Standard omission or compiler bug: Hexadecimal escapes don't (page 3)

In article <cj8fhp$1h9u$1@digitaldaemon.com>, Arcane Jill says... <snip> > # char[] foo = "\x8F"; // leaves foo = [ 0x8F ] -- not valid > UTF-8 > # wchar[] bar = "\x8F"; // leaves bar = [ 0x008F ] -- valid > UTF-16 for U+008F <snip> No, "\x8F" _means_ the byte with value 0x8F, meant to be interpreted as UTF-8. Somewhere in the docs there's an example or two of a wchar[] or dchar[] being initialised with UTF-8 in this way. Stewart.

Arcane Jill wrote: > In article <cj4c5p$1r6s$1@digitaldaemon.com>, Burton Radons says... > > >>I don't think this will work; it requires specifying what encoding the compiler worked with internally. >> >>For example, DMD works in UTF-8 internally. > > > Walter assures us that the D language itself is not prejudiced toward UTF-8; > that UTF-16 and UTF-32 have equal status. I can think of one or two examples > which seem to contradict this, but they are likely to disappear once D gives us > implicit conversions between the UTFs. I don't understand what political interpretation you gave my statement, but it was only to introduce an object example. >>Therefore the first string is okay but the second is not because the UTF-8 is broken: >> >> char [] foo = "\x8F"; >> wchar [] bar = "\x8F"; > > > I presume you meant that the other way round: the /first/ string is broken; the > /second/ string is okay. If the compiler uses UTF-8 internally, the first string compiles correctly as string with length one while the second string does not, because the compiler tries to re-encode it as UTF-16 during semantic processing and fails. I am describing DMD's current behaviour, mind. If the compiler uses UTF-16 or UTF-32 internally (where it would convert the source file into its native encoding during BOM processing), then both strings compile. The first string has length two; the second string has length one. [snip] >>But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect any problem with either of those strings. > > > There isn't /necessarily/ anything wrong with either of those strings. For > example: If the compiler is using UTF-8 internally, there is no possible way to re-encode the second string as UTF-16 while remaining consistent with compilers that use different encodings. To use your example encoding, if the compiler uses UTF-8 internally, then this code: wchar [] s = "\xC4\x8F"; Would result in a single-code string (you understand that D grammar is contextless and that string escapes are interpreted during tokenisation, right?). However, if the compiler uses UTF-16 internally, it would result in a two-code string. This does show a third option, however: change string escapes so that they must not be interpreted until after semantic processing where they can be interpreted directly as their destination encoding. But that only serves to illustrate how unnatural this behaviour is; which may be why that no Unicode-supporting language that I can find that handles \x interprets it as anything but a character. [snip]

September 30, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't

Posted by Arcane Jill
in reply to Burton Radons

Permalink

Arcane Jill

Posted in reply to Burton Radons

Permalink

In article <cjg3qq$1ui6$1@digitaldaemon.com>, Burton Radons says...

>>>Therefore the first string is okay but the second is not because the UTF-8 is broken:
>>>
>>>    char [] foo = "\x8F";
>>>    wchar [] bar = "\x8F";
>> 
>> 
>> I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay.
>
>If the compiler uses UTF-8 internally, the first string compiles correctly as string with length one while the second string does not, because the compiler tries to re-encode it as UTF-16 during semantic processing and fails.  I am describing DMD's current behaviour, mind.

Yes. I was wrong.

The first example compiles okay, but results in foo containing an invalid UTF-8 sequence. The second example does not compile. (I assumed that it would, without testing the hypothesis. That'll teach me).


>To use your example encoding,
>if the compiler uses UTF-8 internally, then this code:
>
>    wchar [] s = "\xC4\x8F";
>

Again, I was wrong. I assumed (without testing) that this would compile to a two-wchar string constant, with s[0] containing U+00C4 and s[1] containing U+008F. In actual fact, what this code yields is a one-wchar string constant, with s[0] containing U+010F.

I would call that a bug. [ 0xC4, 0x8F ] is UTF-8 (not UTF-16) for U+010F. But s is a wchar string, so it's supposed to be UTF-16.



>(you understand that D grammar is
>contextless and that string escapes are interpreted during tokenisation,
>right?).  However, if the compiler uses UTF-16 internally, it would
>result in a two-code string.

I think you're right. That's what happening. The compiler is interpretting all string constants as though they were UTF-8, regardless of the type of the destination.


>This does show a third option, however: change string escapes so that they must not be interpreted until after semantic processing where they can be interpreted directly as their destination encoding.

That's the way I assumed it would be done.


>But that
>only serves to illustrate how unnatural this behaviour is; which may be
>why that no Unicode-supporting language that I can find that handles \x
>interprets it as anything but a character.

Actually, C and C++ interpret \x as a LOCAL ENCODING character, and \u as a UNICODE character. Thus, in C++, if your local encoding were Windows-1252, then the following two statements would have identical effect:

#    // C++
#    char *s = "\x80";      // U+20AC (Euro sign) encoded in WINDOWS-1252
#    char *s = "\u20AC";

Both of these will leave the string s containing a single (byte-wide) char, with
value 0x80. (Plus the null-terminator, of course). Compare this with

#    // C++
#    char *e = "\u0080";

which /should/ fail to compile on a Windows-1252 machine.

So you /are/ right, but nonetheless there is a difference between \x and \u. And this presents a problem for D, because D aims to be portable between encodings. In D, therefore, \x SHOULD NOT be interpretted according to the local encoding, because this would immediately make code non-portable.

One way around this would be to assert that \x should mean exactly the same thing as \u and \U (that is, to specify a Unicode character). Now, that would be fine for those of us used to Latin-1, but Cyrillic users (for example) would be left out in the cold.

Currently, I have come to the conclusion that \x should be deprecated. The escapes \u and \U explicitly specify a character set (i.e. Unicode), and that is what you need for portabilty. \x just has too many problems.

Arcane Jill

Forums