Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Issues » Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 (page 2)

September 22, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Regan Heath
in reply to Arcane Jill

Regan Heath

Posted in reply to Arcane Jill

On Wed, 22 Sep 2004 07:21:27 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:
> In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says...
>
>> I agree.. however doesn't this make it possible to create an invalid UTF-8
>> sequence?
>
> Yup. If you use \x in a char array you are doing /low level stuff/. You are
> doing encoding-by-hand - and it's up to you to get it right.

I agree.

>> Does the compiler/program catch this invalid sequence?
>> I believe it should.
>
> I disagree. If you're using \x then you're working at the byte level. You might
> be doing some system-programming-type stuff where you actually /want/ to break
> the rules.

I disagree. char is 'defined' as being UTF encoded, IMO it should never not be.
If you want to 'break the rules' you can/should use ubyte[], then, you're not breaking any rules.

> The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me.

Probably fair enough.. however, I think it would be more robust if it was made impossible to have an invalid utf8/16/32 sequence. That may be an impossible dream..

> People simply need to understand the difference between \u and \x.

But of course.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

September 22, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Regan Heath
in reply to Stewart Gordon

Regan Heath

Posted in reply to Stewart Gordon

On Wed, 22 Sep 2004 10:49:03 +0000 (UTC), Stewart Gordon <Stewart_member@pathlink.com> wrote:
> In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says...
> <snip>
>> I agree..  however doesn't this make it possible to create an
>> invalid UTF-8 sequence?  Does the compiler/program catch this
>> invalid sequence?  I believe it should.
>
> I firmly don't believe in any attempts to force a specific character
> encoding on every char[] ever created.

But it's 'defined' as having that encoding, if you dont want it, dont use char[] use byte[] instead.

> As said before, it should
> remain possible for char[] literals to contain character codes that
> aren't UTF-8, for such purposes as interfacing OS APIs.

A C/C++ char* is a signed 8 bit value with no specified encoding. D's byte[] matches that perfectly. Maybe byte[] should be implicitly convertable to char* (if it's not already).

> The ability to use arbitrary \x codes provides this neatly.  I
> imagine few people would use it to insert UTF-8 characters in
> practice - if they want the checking, they can either type the
> character directly or use the \u code, which is much simpler than
> manually converting it to UTF-8.

Sure, really I'm playing devils advocate.. I question the logic of 'defining' char to be utf8 if you're not going to enforce it.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

September 23, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Arcane Jill
in reply to Regan Heath

Arcane Jill

Posted in reply to Regan Heath

In article <opseq16svz5a2sq9@digitalmars.com>, Regan Heath says...


>I disagree. char is 'defined' as being UTF encoded, IMO it should never
>not be.
>If you want to 'break the rules' you can/should use ubyte[], then, you're
>not breaking any rules.

Okay, you've convinced me.

In that case, \x## should be forbidden in char, wchar, dchar, char[], wchar[] and dchar[] literals, while \u#### and \U######## should be forbidden in all other integer and integer array literals.

But that would be a real headache for Walter, since D is supposed to have a context-free grammar. It's not clear to me how the compiler could parse the difference.

Arcane Jill

September 23, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Stewart Gordon
in reply to Regan Heath

Stewart Gordon

Posted in reply to Regan Heath

In article <opseq16svz5a2sq9@digitalmars.com>, Regan Heath says... <snip>
> I disagree.  char is 'defined' as being UTF encoded, IMO it should never not be.  If you want to 'break the rules' you can/should use ubyte[], then, you're not breaking any rules.

Do char[] and ubyte[] implicitly convert between each other?  If not, it could make code that interfaces a foreign API somewhat cluttered with casts.

And besides this, which is more self-documenting for the purpose?

ubyte[] obscures the fact that it's a string, rather than any old block of 8-bit numbers.  char[] denotes a string, but is it any more misleading?  People coming from a C(++) background are likely to see it and think 'string' rather than 'UTF-8'.  (Does anyone actually come from a D background yet?)

>> The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me.
> 
> Probably fair enough..  however, I think it would be more robust if it was made impossible to have an invalid utf8/16/32 sequence. That may be an impossible dream..
<snip>

That would mean that a single char value would be restricted to the ASCII set , wouldn't it?

Stewart.

September 23, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Arcane Jill
in reply to Stewart Gordon

Arcane Jill

Posted in reply to Stewart Gordon

In article <ciu7q5$s35$1@digitaldaemon.com>, Stewart Gordon says...

>Do char[] and ubyte[] implicitly convert between each other?

Of course not. They can't even /ex/plicitly convert. How could they? You'd be converting from UTF-8 to ... what exactly?

But I suspect you meant implicitly /cast/. In which case, no, they don't do that either.

>If not, it could make code that interfaces a foreign API somewhat cluttered with casts.

Not really, since foreign API functions should be /expecting/ C-strings, that is, pointers to arrays of bytes (not chars), terminated with the byte value \0. So, for example, strcat() should be declared in D as:

#    extern(C) byte * strcat(byte * dest, byte * src); // correct

and not as:

#    extern(C) char * strcat(char * dest, char * src); // incorrect



>And besides this, which is more self-documenting for the purpose?

Well, this of course is the big area of disagreement. We all want code to be easily maintainable. That means, more readable; more self-documenting. Readable code is a good thing. The problem is that, some of us (Regan and I, for example) look at a declaration of char[] and see "A string of Unicode characters encoded in UTF-8". It is eminently self-documenting, by the very definition of char[]. We also look at a declaration of byte[] and see "An array of bytes whose interpretation depends on what you do with them".

Others (yourself included) apparently see things differently. You look at a declaration of char[] and see "A string of not-necessarily-Unicode characters encoded in some unspecified way", and see byte[] as "An array of bytes whose interpretation is anything /other/ than a sequence of characters".

It is not really possible for code to be simultaneously self-documenting in both paradigms - but you might like to consider the fact that in C and C++, an array of C chars must be interpretted as "An array of bytes whose interpretation depends on what you do with them" - because C/C++ don't actually /have/ a character type, merely an overused byte type. As soon as you start to think:

D           Java              C/C++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
byte        byte              signed char
ubyte       no equivalent     unsigned char
char        no equivalent     no equivalent
wchar       char              wchar_t

and /stop/ imagining that D's char == C's char (which it clearly doesn't) then everything makes sense.


>ubyte[] obscures the fact that it's a string, rather than any old block of 8-bit numbers.

And how would you make such a distinction in C?



>char[] denotes a string, but is it any more misleading?  People coming from a C(++) background are likely to see it and think 'string' rather than 'UTF-8'.  (Does anyone actually come from a D background yet?)

Maybe you're answering your own question there. Stop thinking in C. This is D. Think in D. Even if nobody comes from a D background yet - let's just assume that one day, they will.

It has been suggested over on the main forum that D's types be renamed. If no type called "char" existed in D; if instead, you had to choose between the types "utf8", "uint8" and "int8", it would be obvious which one you'd go for.



>That would mean that a single char value would be restricted to the ASCII set , wouldn't it?

You're not thinking in Unicode. A D char stores a "code unit" (a UTF-8 fragment), not a character codepoint. UTF-8 code-units coincide with character codepoints /only/ in the ASCII range. A single char value, however, can store any valid UTF-8 fragment. You would be wrong, however, to interpret this as a character. For example:

#    char[] euroSign = "€";    // Note that euroSign.length == 3
#    char e0 = euroSign[0];    // perfectly valid
#    char e1 = euroSign[1];    // perfectly valid
#    char e2 = euroSign[2];    // perfectly valid
#
#    char[] s;
#    s ~= e0;   // s is now temporarily invalid
#    s ~= e1;   // s is now temporarily invalid
#    s ~= e2;   // s is now fully constructed
#    assert(s == euroSign);

Arcane Jill

September 23, 2004

char[] vs. ubyte[] (was: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8)

Posted by Stewart Gordon
in reply to Arcane Jill

Stewart Gordon

Posted in reply to Arcane Jill

In article <ciudb4$11ba$1@digitaldaemon.com>, Arcane Jill says...

> In article <ciu7q5$s35$1@digitaldaemon.com>, Stewart Gordon says...
> 
>> Do char[] and ubyte[] implicitly convert between each other?
> 
> Of course not.  They can't even /ex/plicitly convert.  How could they?  You'd be converting from UTF-8 to ...  what exactly?

I wouldn't.  I'd be converting from bytes interpreted as chars to bytes interpreted as bytes of arbitrary semantics.

<snip>
> Not really, since foreign API functions should be /expecting/ C-strings, that is, pointers to arrays of bytes (not chars), terminated with the byte value \0.

Even if they're written in/for Pascal or Fortran?

> So, for example, strcat() should be declared in D as:
> 
> #    extern(C) byte * strcat(byte * dest, byte * src); // correct
> 
> and not as:
> 
> #    extern(C) char * strcat(char * dest, char * src); // incorrect

Then how would I write the C call

strcat(qwert, "yuiop");

in D?

<snip>
> Others (yourself included) apparently see things differently.  You look at a declaration of char[] and see "A string of not-necessarily-Unicode characters encoded in some unspecified way", and see byte[] as "An array of bytes whose interpretation is anything /other/ than a sequence of characters".

Did I say that?  I didn't mean to indicate that byte[] necessarily isn't an array of characters.  Merely that I don't see people as seeing it and thinking 'string'.

<snip>
>> ubyte[] obscures the fact that it's a string, rather than any old block of 8-bit numbers.
> 
> And how would you make such a distinction in C?

With a typedef.

<snip>
> Maybe you're answering your own question there.  Stop thinking in C.  This is D.  Think in D.

I do on the whole.  But trying to think in Windows API at the same time isn't easy.  It'll probably be easier once the D Windows headers are finished.

> Even if nobody comes from a D background yet - let's just assume that one day, they will.
> 
> It has been suggested over on the main forum that D's types be renamed.  If no type called "char" existed in D; if instead, you had to choose between the types "utf8", "uint8" and "int8", it would be obvious which one you'd go for.

Then if only such types existed as "ansi", "windows1252", "windows1253", "ibm", "iso8859_5", "macdevanagari" then the list would be complete.

<snip>
> You're not thinking in Unicode.  A D char stores a "code unit" (a UTF-8 fragment), not a character codepoint.  UTF-8 code-units coincide with character codepoints /only/ in the ASCII range.  A single char value, however, can store any valid UTF-8 fragment. You would be wrong, however, to interpret this as a character.
<snip>

That makes sense....

Stewart.

September 24, 2004

Re: char[] vs. ubyte[] (was: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8)

Posted by Regan Heath
in reply to Stewart Gordon

Regan Heath

Posted in reply to Stewart Gordon

On Thu, 23 Sep 2004 12:51:16 +0000 (UTC), Stewart Gordon <Stewart_member@pathlink.com> wrote:

<snip>

>> Even if nobody comes from a D background yet - let's just assume
>> that one day, they will.
>>
>> It has been suggested over on the main forum that D's types be
>> renamed.  If no type called "char" existed in D; if instead, you
>> had to choose between the types "utf8", "uint8" and "int8", it
>> would be obvious which one you'd go for.
>
> Then if only such types existed as "ansi", "windows1252",
> "windows1253", "ibm", "iso8859_5", "macdevanagari" then the list
> would be complete.

It appears to me that Walter has decided on having only 3 types with a specified encoding, and all other encodings will be handled by using ubyte[]/byte[] and conversion functions.

I think this is the right choice. I see unicode as the future and other encodings as legacy encodings, whose use I hope gradually disappears.

Of course is there is a valid reason for a certain encoding to remain, for speed/space/other reasons, and D wanted the same sort of built-in support as we do for utf8/16/32 then a new type might emerge.

<snip>

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

September 24, 2004

Re: char[] vs. ubyte[] (was: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8)

Posted by Arcane Jill
in reply to Regan Heath

Arcane Jill

Posted in reply to Regan Heath

In article <opses3k8mv5a2sq9@digitalmars.com>, Regan Heath says...

>It appears to me that Walter has decided on having only 3 types with a specified encoding, and all other encodings will be handled by using ubyte[]/byte[] and conversion functions.

Completely in agreement with you there. However, Stewart did actually ask a question which I couldn't answer, and which we shouldn't ignore. Maybe you have some ideas.

..anyway...

I'm moving my reply to the main forum. I think it's more appropriate there.

Arcane Jill

September 25, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Burton Radons
in reply to Stewart Gordon

Burton Radons

Posted in reply to Stewart Gordon

Stewart Gordon wrote:

> In article <cioggi$k7p$1@digitaldaemon.com>, Walter says...
> <snip>
> 
>>>It remains x"8F".  Thus the \x specifies a literal byte in the character
>>>stream as implemented.  The specification doesn't mention this twist, if
>>>it was intentional.
>>
>>I wasn't sure what to do about that case, so I left the \x as whatever the
>>programmer wrote. The \u, though, is definitely meant as unicode and so is
>>converted to UTF-8.
> 
> 
> Here's somewhere I agree with your choice of behaviour, where \x denotes byte
> values, not Unicode codepoints.  Hence here, the coder who writes \x8F intended the byte having this
> value - a single value of type char.  Moreover, it follows the "looks like C, acts like C" principle.

I don't think this will work; it requires specifying what encoding the compiler worked with internally.

For example, DMD works in UTF-8 internally.  Therefore the first string is okay but the second is not because the UTF-8 is broken:

    char [] foo = "\x8F";
    wchar [] bar = "\x8F";

But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect any problem with either of those strings.

So a literal string must be valid for arbitrary conversion between any encoding (that can only be interpreted as "\x specifies a UNICODE character"), OR there must be a mandate for what encoding the compiler uses internally.  I think the former is less odious; as soon as you start depending upon features of an encoding, you get into trouble.

September 27, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't

Posted by Arcane Jill
in reply to Burton Radons

Arcane Jill

Posted in reply to Burton Radons

In article <cj4c5p$1r6s$1@digitaldaemon.com>, Burton Radons says...

>I don't think this will work; it requires specifying what encoding the compiler worked with internally.
>
>For example, DMD works in UTF-8 internally.

Walter assures us that the D language itself is not prejudiced toward UTF-8; that UTF-16 and UTF-32 have equal status. I can think of one or two examples which seem to contradict this, but they are likely to disappear once D gives us implicit conversions between the UTFs.


>Therefore the first string is okay but the second is not because the UTF-8 is broken:
>
>     char [] foo = "\x8F";
>     wchar [] bar = "\x8F";

I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay.

#   char[] foo = "\x8F";  // leaves foo = [ 0x8F ] -- not valid UTF-8
#   wchar[] bar = "\x8F"; // leaves bar = [ 0x008F ] -- valid UTF-16 for U+008F

But that's okay. Anyone using \x in a char[], wchar[] or dchar[] is expected to know what they're doing, otherwise they should be using \u.


>But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect any problem with either of those strings.

There isn't /necessarily/ anything wrong with either of those strings. For example:

#    char[] s = "\xC4";             // -- not valid UTF-8 ON PURPOSE
#    char[] t = "\x8F";             // -- not valid UTF-8 ON PURPOSE
#    char[] u = s ~ t;              // -- valid UTF-8 for U+010F
#    wchar[] v = toUTF16( s ~ t );  // -- valid UTF-16 for U+010F

The requirement for using \x is merely that the programmer knows their UTF.


>So a literal string must be valid for arbitrary conversion between any encoding (that can only be interpreted as "\x specifies a UNICODE character"),

No, the requirement is that programmers /must not/ use \x within a string unless they understand exactly how it will be interpretted.

For most normal purposes, stick to this golden rule:

*) For char[], wchar[] or dchar[] - use \u
*) For all other arrays - use \x


>OR there must be a mandate for what encoding the compiler uses internally.

I don't see a need for that.


>I think the former is less odious; as soon as you start depending upon features of an encoding, you get into trouble.

Right. Which is why \x in strings should be considered "experts only". But I would hesitate to call that a "bug".

It would be /possible/ for D's lexer to distinguish character string constants from byte string constants in some cases. I don't know if that would be a good idea. What I mean is:

#    "hello"           // char[] literal
#    "hell\u006F"      // char[] literal because of the embedded \u
#    "hell\x6F"        // ubyte[] literal because of the embedded \x
#    "hell\u006F\x6F"  // syntax error

This would catch a lot of such bugs at compile time. Maybe Walter could be persuaded to go for this, I don't know. But \x bugs are bugs in user code, not in the compiler.

Arcane Jill

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation