Jump to page: 1 2
Thread overview
Bug in std.string.format?
Jul 09, 2004
Juanjo Álvarez
Jul 09, 2004
Stewart Gordon
Jul 09, 2004
Arcane Jill
Jul 09, 2004
Arcane Jill
Jul 09, 2004
Arcane Jill
Jul 09, 2004
Stewart Gordon
Jul 09, 2004
Arcane Jill
Jul 10, 2004
Juanjo Álvarez
Jul 12, 2004
Stewart Gordon
Jul 12, 2004
Arcane Jill
Jul 13, 2004
Stewart Gordon
Jul 09, 2004
Juanjo Álvarez
Jul 09, 2004
Arcane Jill
Jul 10, 2004
Juanjo Álvarez
Jul 12, 2004
Arcane Jill
July 09, 2004
If I do:

//Also with any other ascii 8 bit chars:
char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ");

The program says (in runtime):

Error: invalid UTF-8 sequence

AFAIK 'Ñ' is UTF-8.





July 09, 2004
Juanjo Álvarez wrote:

> If I do:
> 
> //Also with any other ascii 8 bit chars:
> char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ");
<snip>

std.string.format isn't documented as I look.  Is this the string counterpart of writef, which I'd just pointed out we should have over on d.D?

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
July 09, 2004
In article <cclofh$1qrr$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...
>
>If I do:
>
>//Also with any other ascii 8 bit chars:
>char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ");
>
>The program says (in runtime):
>
>Error: invalid UTF-8 sequence

This is not a bug. You have an invalid UTF-8 sequence. The library is correctly reporting it.



>AFAIK 'Ñ' is UTF-8.

It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }.

UTF-8 is backwardly compatible with ASCII. It is /not/, however, backwardly compatible with ISO-8859-1. Any character with codepoint greater than 0x7F must be correctly UTF-8 encoded.

You can get the correct UTF-8 sequence by starting with a string of dchars and passing it to std.utf.toUTF8().

Arcane Jill





July 09, 2004
In article <ccm0is$2768$1@digitaldaemon.com>, Arcane Jill says...
>
>>char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ");
>>
>>Error: invalid UTF-8 sequence
>
>This is not a bug. You have an invalid UTF-8 sequence. The library is correctly reporting it.

Oh - and here's the fix. Save your source-code text file in UTF-8 format before attempting to compile it. I suspect it is currently saved in some ANSI format or other - probably ISO-8859-1 or WINDOWS-259 depending on your operating system. You need a text editor which can save in UTF-8.

D source files should always be saved in UTF-8 format if you want string literals to be correctly interpretted.

Jill


July 09, 2004
Actually, come to think of it, it would be very, very helpful to users of D if the D compiler actually checked the integrity of all string literals at compile time. If any string literal were found (at compile time) to contain an invalid UTF-8 sequence, it would help the user ENORMOUSLY if an error message along the lines of:

#   ERROR - D source file not saved as UTF-8. Cannot compile.

were to be printed. (Strictly speaking, the D compiler should always pass the
entire source file to toUTF32(), and generate the above error if toUTF32()
fails. However, the source file encoding won't make any difference EXCEPT to
string literals).

So ... although it /is/ a user-error, it is nonetheless a user-error which DMD could have detected at compile-time, instead of leaving the error reporting to run time. The error message itself (as it stands) doesn't really help people to understand what's wrong.

Arcane Jill



July 09, 2004
Arcane Jill wrote:

<snip>
> #   ERROR - D source file not saved as UTF-8. Cannot compile.

Hang on ... according to the docs, the compiler is supposed to accept UTF-16 and UTF-32 too.

<snip>
> So ... although it /is/ a user-error, it is nonetheless a user-error which DMD
> could have detected at compile-time, instead of leaving the error reporting to
> run time. The error message itself (as it stands) doesn't really help people to
> understand what's wrong.

Some debate is possible.  Obviously the compiler isn't being UTF compliant.  But what if someone wants to include, in a string literal, characters in the native OS or other character set that don't match UTF-8?  (FTM, how are escaped characters supposed to be handled ITR, considering that a string literal can be either a char[], wchar[] or dchar[]?)

Speaking of lexical.html...
"There are no digraphs or trigraphs in D."

What is meant by this, exactly?

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
July 09, 2004
Arcane Jill wrote:

>>AFAIK 'Ñ' is UTF-8.
> 
> It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }.
> 
> UTF-8 is backwardly compatible with ASCII. It is /not/, however, backwardly compatible with ISO-8859-1. Any character with codepoint greater than 0x7F must be correctly UTF-8 encoded.

Then I was confused by the fact that inserting the line:

# -*- coding: UTF-8 -*-

at the start of a Python script make the interpreters works with latin1 chars directly.

> You can get the correct UTF-8 sequence by starting with a string of dchars and passing it to std.utf.toUTF8().

Could you please provide and example of how would be that done? Because if I try:

dchar[] dstr = "ESPAÑA";

the compiler says:

otroformat.d(7): invalid UTF-8 sequence

and if I instead try:

dchar[] dstr = std.utf.toUTF8("ESPAÑA");


it says:

otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[
(dchar[]s) both match argument list for toUTF8

So I'm a little lost here.


July 09, 2004
In article <ccmo0h$8u0$1@digitaldaemon.com>, Stewart Gordon says...
>
>Arcane Jill wrote:
>
><snip>
>> #   ERROR - D source file not saved as UTF-8. Cannot compile.
>
>Hang on ... according to the docs, the compiler is supposed to accept UTF-16 and UTF-32 too.

I stand corrected. However, the UTFs are all very easy to tell apart. UTF-16 looks very different from UTF-8, and it only takes a simple algorithm to distinguish them. Ditto UTF-32.

What I *SHOULD* have said is that DMD assumes that the source file is encoded in UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. What it can't do is tell 8-bit encodings apart from each other, so it assumes that, if it's an 8-bit encoding, it will be UTF-8.




><snip>
>> So ... although it /is/ a user-error, it is nonetheless a user-error which DMD could have detected at compile-time, instead of leaving the error reporting to run time. The error message itself (as it stands) doesn't really help people to understand what's wrong.
>
>Some debate is possible.  Obviously the compiler isn't being UTF compliant.

Yes, it is. The compiler is being 100% UTF compliant. Problems only arise if the source code isn't.


>But what if someone wants to include, in a string literal, characters in the native OS or other character set that don't match UTF-8?

There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not sure there's an OS on the planet which uses characters which are not in Unicode.

Oh wait - I believe the ZX Spectrum had some weird clunky graphics characters which are not in Unicode. But we don't need to worry about that because D has not been ported to that platform.



>(FTM, how are escaped characters supposed to be handled ITR, considering that a string literal can be either a char[], wchar[] or dchar[]?)

They are supposed to be represented as is, not escaped in any way (beyond being encoded in UTF-whatever).

Unless of course you mean stuff like "\n" - which obviously is stored in source as backslash followed by 'n'. The compiler can figure THAT out because it's part of D.




>Speaking of lexical.html...
>"There are no digraphs or trigraphs in D."
>
>What is meant by this, exactly?

Old, old stuff from the early days of C. You have to go back a long time, but once, there were keyboards without square brackets or curly braces and things, and which were not remappable in software. Digraphs are two-character sequences which a C compiler will replace with those single missing characters. Trigraphs are similar three character sequences.



July 09, 2004
In article <ccn00c$khq$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...

>Then I was confused by the fact that inserting the line:
>
># -*- coding: UTF-8 -*-
>
>at the start of a Python script make the interpreters works with latin1 chars directly.

That may be a red herring, but I don't know what Python does and I'm not qualified to comment. If I had to guess, I'd say that declaration tells Python the encoding with which the source files was saved.

I can tell you though that D also interprets all Latin-1 characters (and indeed,
all Unicode characters) directly ... *IF* the source file is saved in a UTF
format. (See below).

DMD may be "deficient" in the sense that it does not understand ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that as a strength, not a weakness. Simple. Neat. Clean. However, this does need to be better documented.



>Could you please provide and example of how would be that done? Because if I try:
>
>dchar[] dstr = "ESPAÑA";
>
>the compiler says:
>
>otroformat.d(7): invalid UTF-8 sequence

Honestly - this has got nothing whatsoever to do with the compiler. There's a stage BEFORE compiling - it's called saving the text file.

Let's say you're using Microsoft Notepad. Type something into it, such as:

#    dchar[] dstr = "ESPAÑA";

Now - instead of clicking on "Save", click instead on "Save As". You'll see three drop-down menus at the bottom of the dialog. One of them is labelled "Encoding", and it will have "ANSI" selected by default. *** CHANGE IT TO UTF-8 ***. Now save. Now the D compiler will be happy with it.

Pretty much all text editors these days offer such a choice - however it is usually not the default, so you have to remember to explicitly do the Save As / UTF-8 thing.

And you can use ALL characters too, not just Latin-1. You can use Latin-2, Greek, Russian, Chinese, whatever.

Just remember that trick - SAVE AS UTF-8 before you attempt to compile.



>and if I instead try:
>
>dchar[] dstr = std.utf.toUTF8("ESPAÑA");
>
>it says:
>
>otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[
>(dchar[]s) both match argument list for toUTF8
>
>So I'm a little lost here.

I can understand that because, as I said, the DMD error message is not helpful. However, bear in mind that the fault lies with your use of the text editor, not with your use of D.

If Walter would care to help everyone out with this one by improving the error message (if only to lay blame somewhere other than DMD), what he should do is this. The compiler should pass the entire source file contents to std.utf.validate (or some equivalent function written in C/C++). If it passes, go ahead and compile. If it fails, issue an error message that the source file is not correctly encoded, and needs to be re-saved as UTF-8 before it will compile.

Of course, if the source file contains only ASCII characters then it is automatically valid UTF-8, even if it was saved as "ANSI".

Arcane Jill




July 10, 2004
Arcane Jill wrote:


>>What is meant by this, exactly?
> 
> Old, old stuff from the early days of C. You have to go back a long time, but once, there were keyboards without square brackets or curly braces and things, and which were not remappable in software. Digraphs are two-character sequences which a C compiler will replace with those single missing characters. Trigraphs are similar three character sequences.

And they are (or at least were) extensively used in the obfuscated C
contests :)
« First   ‹ Prev
1 2