What to do about \x?

The use of the escape sequence "\x" in string and character literals causes a lot of confusion in D.

(1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and, consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It does not.

(2) In a recent thread on this forum, novice (who is Russian) expected "\xC0" to
emit the Cyrillic letter capital A (because that's what happens in C++ on their
WINDOWS-1251 machine). It does not.

(3) Over in the bugs forum, it is being discussed and discovered that the DMD compiler interprets "\x" as UTF-8 /even in wchar strings/ (which one might have expected to be UTF-16). This is clearly nonsense.

So maybe it's time to clear up the confusion once and for all. What /should/ "\x" do? As I see it, these are the options:

(1) \x should be interpretted in the user's default encoding
(2) \x should be interpretted in the source-code encoding
(3) \x should be interpretted according to the destination type of the literal
(4) \x should be interpretted as UTF-8
(5) \x should be interpretted as Latin-1
(6) \x should be interpretted as ASCII
(7) \x should be deprecated

Option (1) is what C++ does. However - it makes code non-portable across encodings: a source file created by one user will not necessarily compile correctly for another.

Option (2) is more restricted, since the source file encoding must be one of UTF-8, UTF-16 or UTF-32. I still don't like it though - the compiler shouldn't behave differently just because the source file is saved differently.

Option (3) is what I (incorrectly) assumed \x would do. However, D has a
context-free grammar, and parses string literals /before/ it knows what kind of
thing it's assigning. (Exactly how it manages to make

#    wchar[] s = "hello";

do the right thing is beyond me, but whether or not this contextual typing could be extended to include the intepretation of \x is something only Walter could answer). Anyway, this would still be strange behaviour, from the point of view of C++ programmers.

Option (4) is the status quo. It confuses everybody.

Option (5) is biased toward folk in the Western world. There is some justfication for it, however, since Latin-1 is a subset of Unicode, having precisely the same codepoint-to-character mapping. If we went for this, then "\x##" would always be the same thing as "\u00##" or "\U000000##". I /think/ (though I'm not certain) that this is what Java does.

Option (6) is not unreasonable. It would mean that "\x00" to "\x7F" would be the only legal \x escape sequences - these are unambiguous. Sequences "\x80" to "\xFF" would become compile-time errors. The error message should advise people to use "\u" instead.

Option (7) is foolproof. Any use of "\x" becomes a compile-time error. The error message should advise people to use "\u" instead.

What think you all?
Arcane Jill

October 01, 2004

Re: What to do about \x?

Posted by Stewart Gordon
in reply to Arcane Jill

Permalink

Stewart Gordon

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:
> The use of the escape sequence "\x" in string and character literals causes a
> lot of confusion in D. 
> 
> (1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and,
> consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It
> does not.

It does _emit_ a lowercase 'e' with an acute accent, if its destination is anything in the Windows GUI and it was emitted using the A version of a Windows API function.

> (2) In a recent thread on this forum, novice (who is Russian) expected "\xC0" to
> emit the Cyrillic letter capital A (because that's what happens in C++ on their
> WINDOWS-1251 machine). It does not.
> 
> (3) Over in the bugs forum, it is being discussed and discovered that the DMD
> compiler interprets "\x" as UTF-8 /even in wchar strings/ (which one might have
> expected to be UTF-16). This is clearly nonsense.

In that case, how should it interpret \u or \U in a char string?

\x is a UTF-8 fragment, \u is a UTF-16 fragment, \U is a UTF-32 fragment.  Whatever rules we have for translating one into the other should be consistent.  And the current behaviour satisfies that criterion nicely.

<snip>
> (Exactly how it manages to make
> 
> #    wchar[] s = "hello";
> 
> do the right thing is beyond me, but whether or not this contextual typing could
> be extended to include the intepretation of \x is something only Walter could
> answer).

Just thinking about it, I guess that DMD uses an 8-bit internal representation during the tokenising and parsing phase, converting \u and \U codes (and of course literal characters in UTF-16 or UTF-32 source text) to their UTF-8 counterparts.  During the semantic analysis, it then converts it back to UTF-16 or UTF-32 if it's assigning to a wchar[] or dchar[].

<snip>
> Option (4) is the status quo. It confuses everybody.
<snip>

Well, I'm not confused ... yet.

> What think you all?

My vote goes to leaving \x as it is.

Stewart.

October 01, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to Stewart Gordon

Permalink

Arcane Jill

Posted in reply to Stewart Gordon

Permalink

In article <cjjcs7$9pf$1@digitaldaemon.com>, Stewart Gordon says...
>
>Arcane Jill wrote:
>> The use of the escape sequence "\x" in string and character literals causes a lot of confusion in D.
>> 
>> (1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and, consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It does not.
>
>It does _emit_ a lowercase 'e' with an acute accent, if its destination is anything in the Windows GUI and it was emitted using the A version of a Windows API function.

Sorry, I didn't follow that. Can you give me a code example? I only meant that

#    char[] s = "\xE9";

does not leave s containing an e with an acute accent, which some people might expect.

>In that case, how should it interpret \u or \U in a char string?

\u and \U are well defined universally. They do not depend on encoding.

>\x is a UTF-8 fragment,

That's the status quo in D, yes.

\u is a UTF-16 fragment, \U is a UTF-32 fragment.

Incorrect. \u is /not/ a UTF-16 fragment. Why did you assume that? Did you assume that because, in D, \x is a UTF-8 fragment? If so, we may take that further evidence that the current implementation of \x causes confusion.

In fact, both \u and \U specify Unicode /characters/ (not UTF fragments). By definition, \u#### = \U0000#### = the Unicode character U+####. \u and \U are identical in all respects other than the number of hex digits expected to follow them. (You only need to resort to \U if more than four digits are present.) Thus,

#    wchar[] s = "\uD800\uDC00"; // error

(correctly) fails to compile, (correctly) requiring you instead to do:

#    wchar[] s = "\U00110000";

So you see, \x really _is_ the odd one out.

>Whatever rules we have for translating one into the other should be consistent.

Such consistency would require option (5) from my original post, so that \x## = \u00## = \U000000## - in all cases regarded as a Unicode character, not a UTF fragment. (The primary argument /against/ this behaviour is that it effectively makes \x an ISO-8859-1 encoding, which could be considered to be Western bias.)

>And the current behaviour satisfies that criterion nicely.

My point is that the current behavior of \x is /not/ consistent with \u or \U. It is also not consistent with the expectations of users used to C++ or Java behavior.

>> Option (4) is the status quo. It confuses everybody.
><snip>
>
>Well, I'm not confused ... yet.

Well, I don't know about you, but the following confuses me:

#    wchar[] s = "\xC4\x8F";  // eh?
#    wchar[] t = "\u010F";
#    assert(s == t);          // yup -they're the same

Why should s - declared as a UTF-16 string - be able to accept UTF-8 literals, but not UTF-16 literals? Let's see that again with a different example:

#    dchar[] s = "\U00010000";        // Unicode - (correctly) compiles
#    dchar[] s = "\uD800\uDC00";      // UTF-16 - (correctly) fails to compile
#    dchar[] s = "\xE0\x90\x80\x80";  // UTF-8 - compiles

It's not consistent. (But \u and \U are implemented correctly).

>> What think you all?
>My vote goes to leaving \x as it is.
>Stewart.

For what it's worth, my vote goes to deprecating \x.
Arcane Jill

October 01, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to Arcane Jill

Permalink

Arcane Jill

Posted in reply to Arcane Jill

Permalink

In article <cjjgs7$bs2$1@digitaldaemon.com>, Arcane Jill says...

Erratum

#    wchar[] s = "\U00110000";

should read:

#    wchar[] s = "\U00010000";

Jill

October 01, 2004

Re: What to do about \x?

Posted by Stewart Gordon
in reply to Arcane Jill

Permalink

Stewart Gordon

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:

> In article <cjjcs7$9pf$1@digitaldaemon.com>, Stewart Gordon says...
<snip>
>> It does _emit_ a lowercase 'e' with an acute accent, if its destination is anything in the Windows GUI and it was emitted using the A version of a Windows API function.
> 
> Sorry, I didn't follow that. Can you give me a code example?

    char[] s = "\xE9";
    SendMessageA(hWnd, WM_SETTEXT, 0, cast(LPARAM) cast(char*) s);

<snip>
> In fact, both \u and \U specify Unicode /characters/ (not UTF fragments). By definition, \u#### = \U0000#### = the Unicode character U+####. \u and \U are identical in all respects other than the number of hex digits expected to follow them. (You only need to resort to \U if more than four digits are present.) Thus, 
> 
> #    wchar[] s = "\uD800\uDC00"; // error
> 
> (correctly) fails to compile, (correctly) requiring you instead to do:

I find in the spec:

lex.html
	\n			the linefeed character
	\t			the tab character
	\"			the double quote character
	\012			octal
	\x1A			hex
	\u1234			wchar character
	\U00101234		dchar character
	\r\n			carriage return, line feed

expression.html
"Character literals are single characters and resolve to one of type char, wchar, or dchar. If the literal is a \u escape sequence, it resolves to type wchar. If the literal is a \U escape sequence, it resolves to type dchar. Otherwise, it resolves to the type with the smallest size it will fit into."

What bit of the spec should I be reading instead?

<snip>
>> Whatever rules we have for translating one into the other should be consistent.
> 
> Such consistency would require option (5) from my original post, so that \x## = \u00## = \U000000## - in all cases regarded as a Unicode character, not a UTF fragment. (The primary argument /against/ this behaviour is that it effectively makes \x an ISO-8859-1 encoding, which could be considered to be Western bias.)

How does simply having \x, \u and \U represent UTF-8, UTF-16 and UTF-32 fragments respectively not achieve this consistency?

<snip>
> Well, I don't know about you, but the following confuses me:
> 
> #    wchar[] s = "\xC4\x8F";  // eh?
> #    wchar[] t = "\u010F";
> #    assert(s == t);          // yup -they're the same

I suppose it confused me before I realised what it meant.

The point, AIUI, is that all string literals are equal whether they are notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all to the same thing.  This is kind of consistent with the principle that the permitted source text encodings are all treated equally.  This leaves the semantic analyser with only one kind of string literal to worry about, which it converts to the UTF of the target type.

Moreover, I imagine that string literals juxtaposed into one, or even the contents of a single " " pair, are allowed to mix the various escape notations.  In this case, trying to label string literals at lex-time as UTF-8, UTF-16 or UTF-32 would be fruitless.

> Why should s - declared as a UTF-16 string - be able to accept UTF-8 literals, but not UTF-16 literals? Let's see that again with a different example:
<snip>

That would sound like a bug to me.

Stewart.

October 01, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to Stewart Gordon

Permalink

Arcane Jill

Posted in reply to Stewart Gordon

Permalink

In article <cjjo6a$g8o$1@digitaldaemon.com>, Stewart Gordon says...

>I find in the spec:
>
>lex.html
>	\n			the linefeed character
>	\t			the tab character
>	\"			the double quote character
>	\012			octal
>	\x1A			hex
>	\u1234			wchar character
>	\U00101234		dchar character
>	\r\n			carriage return, line feed

I would argue that the spec is wrong. \u and \U are Unicode things, and I distinctly remember a discussion on this very subject on the Unicode public forum a while back (though Walter could, if he were sufficiently perverse, give D a different definition). I suggest that the D spec /should/ read:

#	\u1234			Unicode character
#	\U00101234		Unicode character

This would be consistent with C++, C#, Java, the intent of the Unicode Consortium, and the actual current behavior of D.

Certainly, at present, actual behavior of \u in D is different from the documentation, so either there is a documentation error, or else there is a bug. I'm inclined to the belief that it's a documentation error. I certainly hope so, because \u and \U should always be independent of encoding. No-one should have to learn UTF-16 to use \u. It should be legal to write:

#    wchar[] s = "\U00101234";

(which of course it currently is).



>expression.html
>"Character literals are single characters and resolve to one of type
>char, wchar, or dchar. If the literal is a \u escape sequence, it
>resolves to type wchar. If the literal is a \U escape sequence, it
>resolves to type dchar. Otherwise, it resolves to the type with the
>smallest size it will fit into."

Bugger!

Again, that's not how Unicode is supposed to behave. The following should (and does) compile without complaint:

#    char c = '\U0000002A';

So again, D is behaving fine, but the documentation does not match reality. Documentation error or bug? I say it's a documentation error.



>What bit of the spec should I be reading instead?

You read the D docs correctly. I was going by previous discussions on the Unicode public forum (from memory). Obviously, those discussions weren't specifically about D.


>How does simply having \x, \u and \U represent UTF-8, UTF-16 and UTF-32 fragments respectively not achieve this consistency?

It's just not how Unicode is supposed to behave. \u and \U are supposed to be Unicode characters. Nothing more. Nothing less. (And that of course is exactly what D has implemented).

I'm going to have a hard time backing that up - so please trust me on this one. If not, I'll have to go trawling through the Unicode archives, or passing this question on to the Consortium folk. It's a complicated question, because the \u thing isn't actually something the UC can define, but nonetheless the same definition is used by C++, Java, Python, C#, various internet RFCs, etc. etc. D would be out on a very dodgy limb here if it were to do things differently.

(...and it's also what is implemented by D, so again, I claim it's the documentation which is in error).




>The point, AIUI, is that all string literals are equal whether they are notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all to the same thing.

But one should not have to learn /any/ UTF to encode a string. Why would anyone want that?

All you should need to know to encode a character using an escape sequence is its codepoint. And that's what \u and \U are for.


>This is kind of consistent with the principle that the permitted source text encodings are all treated equally.

Well, here at least we are in agreement. All supported encodings should be treated equally.


>> Why should s - declared as a UTF-16 string - be able to accept UTF-8 literals, but not UTF-16 literals? Let's see that again with a different example:
><snip>
>
>That would sound like a bug to me.

It's certainly an inconsistency - but it's the legality of \x, not the illegality of \u, about which I would complain.

Arcane Jill

October 01, 2004

Re: What to do about \x?

Posted by Stewart Gordon
in reply to Arcane Jill

Permalink

Stewart Gordon

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:
<snip>
> I would argue that the spec is wrong. \u and \U are Unicode things, and I distinctly remember a discussion on this very subject on the Unicode public forum a while back (though Walter could, if he were sufficiently perverse, give D a different definition).

Which bit of The Unicode Standard should I read to find the meanings of \u and \U it sets in stone across every language ever invented?

<snip>
> Certainly, at present, actual behavior of \u in D is different from the documentation, so either there is a documentation error, or else there is a bug. I'm inclined to the belief that it's a documentation error. I certainly hope so, because \u and \U should always be independent of encoding.

How would fixing the compiler to follow the spec create dependence on encoding?

> No-one should have to learn UTF-16 to use \u. It should be legal to write:
> 
> #    wchar[] s = "\U00101234";
> 
> (which of course it currently is).

Agreed from the start.  Nobody suggested that anyone should have to learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being _allowed_ to use UTF-32 (and hence actual codepoints) should be mutually exclusive.  I thought that was half the spirit of having both \u and \U - to give the programmer the choice.

>> expression.html
>> "Character literals are single characters and resolve to one of type char, wchar, or dchar. If the literal is a \u escape sequence, it resolves to type wchar. If the literal is a \U escape sequence, it resolves to type dchar. Otherwise, it resolves to the type with the smallest size it will fit into."
> 
> Bugger!
> 
> Again, that's not how Unicode is supposed to behave. The following should (and does) compile without complaint:
> 
> #    char c = '\U0000002A';
> 
> So again, D is behaving fine, but the documentation does not match reality. Documentation error or bug? I say it's a documentation error.

Ah, you have a point.  It appears that character and string literals are translated by the same function, rather than character literals being labelled as char, wchar or dchar.  So that could be either a doc error or a bug.  Walter?

<snip>
> It's just not how Unicode is supposed to behave. \u and \U are supposed to be Unicode characters. Nothing more. Nothing less. (And that of course is exactly what D has implemented).
> 
> I'm going to have a hard time backing that up - so please trust me on this one. If not, I'll have to go trawling through the Unicode archives, or passing this question on to the Consortium folk. It's a complicated question, because the \u thing isn't actually something the UC can define, but nonetheless the same definition is used by C++, Java, Python, C#, various internet RFCs, etc. etc. D would be out on a very dodgy limb here if it were to do things differently.

Well, at least what the D spec implies \u should mean is a superset of what you're implying is the 'correct' Unicode behaviour.

<snip>
>> The point, AIUI, is that all string literals are equal whether they are notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all to the same thing.
> 
> But one should not have to learn /any/ UTF to encode a string. Why would anyone want that?

I can't see how that follows on from what I just said.  But agreed.

> All you should need to know to encode a character using an escape sequence is its codepoint. And that's what \u and \U are for.
<snip>

Exactly.  At least, whichever interpretation of \u we go by, it works for all codepoints below U+FFFE.  The only thing left to debate is whether it should also work for those UTF-16 fragments that don't directly correspond to codepoints.  If not, then there are two things to do:

- fix the documentation to explain this
- invent another escape to represent a UTF-16 fragment, for the sake of completeness.

Stewart.

October 01, 2004

Re: What to do about \x?

Posted by Walter
in reply to Arcane Jill

Permalink

Walter

Posted in reply to Arcane Jill

Permalink

This is a good summary of the situation. I don't know what the right answer is yet, so it's a good topic for discussion. One issue not mentioned yet is the (not that unusual) practice of stuffing arbitrary data into a string with \x. For example, in C one might create a length prefixed 'pascal style' string with:

    unsigned char* p = "\x03abc";

where the \x03 is not character data, but length data. I'm hesitant to say "you cannot do that in D". I also want to support things like:

    byte[] a = "\x05\x89abc\xFF";

where clearly a bunch of binary data is desired.

One possibility is to use prefixes on the string literals:

    "..."    // Takes its type from the context, only ascii \x allowed
    c"..."    // char[] string literal, only ascii \x allowed
    w"..."   // wchar[] string literal, \x not allowed
    d"..."    // dchar[] string literal, \x not allowed
    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
not allowed

October 01, 2004

Re: What to do about \x?

Posted by David L. Davis
in reply to Walter

Permalink

David L. Davis

Posted in reply to Walter

Permalink

In article <cjk0rv$oqo$1@digitaldaemon.com>, Walter says...
>
>This is a good summary of the situation. I don't know what the right answer is yet, so it's a good topic for discussion. One issue not mentioned yet is the (not that unusual) practice of stuffing arbitrary data into a string with \x. For example, in C one might create a length prefixed 'pascal style' string with:
>
>    unsigned char* p = "\x03abc";
>
>where the \x03 is not character data, but length data. I'm hesitant to say "you cannot do that in D". I also want to support things like:
>
>    byte[] a = "\x05\x89abc\xFF";
>
>where clearly a bunch of binary data is desired.
>
>One possibility is to use prefixes on the string literals:
>
>    "..."    // Takes its type from the context, only ascii \x allowed
>    c"..."    // char[] string literal, only ascii \x allowed
>    w"..."   // wchar[] string literal, \x not allowed
>    d"..."    // dchar[] string literal, \x not allowed
>    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
>not allowed
>
>
Walter: Personally, I for one like the prefix idea for the string literals (which should also make the ~ string concatenation operator behave correctly...so, no more need to cast every darn string literal), and the cases of when \x can and cannot be used for each. I'm actually hold my breathe for this one to happen! ;)

David L.

P.S. Welcome Back!! Hope you got some much needed rest.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

October 01, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to Stewart Gordon

Permalink

Arcane Jill

Posted in reply to Stewart Gordon

Permalink

In article <cjjvnq$nje$1@digitaldaemon.com>, Stewart Gordon says...
>
>Arcane Jill wrote:
><snip>
>> I would argue that the spec is wrong. \u and \U are Unicode things, and I distinctly remember a discussion on this very subject on the Unicode public forum a while back (though Walter could, if he were sufficiently perverse, give D a different definition).
>
>Which bit of The Unicode Standard should I read to find the meanings of \u and \U it sets in stone across every language ever invented?

It doesn't, of course. The meaning of \u and \U is defined separately in each programming language, and not by the UC. Walter is free to define them how he likes for D.

><snip>
>> Certainly, at present, actual behavior of \u in D is different from the documentation, so either there is a documentation error, or else there is a bug. I'm inclined to the belief that it's a documentation error. I certainly hope so, because \u and \U should always be independent of encoding.
>
>How would fixing the compiler to follow the spec create dependence on encoding?

Tricky. Well, if "\uD800" were allowed, for example, it would open up the door to the very nasty possibility that "\U0000D800" might also be allowed - and that /definitely/ shouldn't be, because it's not a valid character. I dunno - I just think it could be very confusing and counterintuitive.

Everyone expects \u and \U to prefix /characters/, not fragments of some encoding scheme. Why would you want it any different?

>Agreed from the start.  Nobody suggested that anyone should have to learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being _allowed_ to use UTF-32 (and hence actual codepoints) should be mutually exclusive.  I thought that was half the spirit of having both \u and \U - to give the programmer the choice.

In other languages, \u#### is identical in meaning to \U0000####. Why make D different?

If you want to hand-code UTF-16, you can always do it like this:

#    wchar[] s = [ 0xD800, 0xDC00 ];

>Ah, you have a point.  It appears that character and string literals are translated by the same function, rather than character literals being labelled as char, wchar or dchar.  So that could be either a doc error or a bug.  Walter?

Well, as you know, I suspect a documentation error, but I leave it to Walter to answer once and for all.

>Well, at least what the D spec implies \u should mean is a superset of what you're implying is the 'correct' Unicode behaviour.

Guess I can't argue with that.

>Exactly.  At least, whichever interpretation of \u we go by, it works for all codepoints below U+FFFE.  The only thing left to debate is whether it should also work for those UTF-16 fragments that don't directly correspond to codepoints.  If not, then there are two things to do:
>
>- fix the documentation to explain this

Agreed.

>- invent another escape to represent a UTF-16 fragment, for the sake of completeness.

I'm not sure you need an escape sequence. Why not just replace

#    wchar c = '\uD800';

with

#    wchar c = 0xD800;

What requirement is there to hand-code UTF-8 /inside a string literal/? If you get it right, then you might just as well have used \U and the codepoint; if you get it wrong, you're buggered.

Arcane Jill

Top | Forum index | About this forum

Forums