Jump to page: 1 2 3
Thread overview
Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8
Sep 20, 2004
Burton Radons
Sep 21, 2004
Walter
Sep 21, 2004
Stewart Gordon
Sep 21, 2004
Stewart Gordon
Sep 21, 2004
Regan Heath
Sep 21, 2004
Walter
Sep 22, 2004
Arcane Jill
Sep 22, 2004
Regan Heath
Sep 23, 2004
Arcane Jill
Sep 23, 2004
Stewart Gordon
Sep 23, 2004
Arcane Jill
char[] vs. ubyte[] (was: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8)
Sep 23, 2004
Stewart Gordon
Sep 24, 2004
Regan Heath
Sep 24, 2004
Arcane Jill
Sep 22, 2004
Stewart Gordon
Sep 22, 2004
Arcane Jill
Sep 22, 2004
Regan Heath
Sep 25, 2004
Burton Radons
Re: Standard omission or compiler bug: Hexadecimal escapes don't
Sep 27, 2004
Arcane Jill
Sep 29, 2004
Stewart Gordon
Sep 30, 2004
Burton Radons
Sep 30, 2004
Arcane Jill
Re: Standard omission or compiler bug: Hexadecimal escapes don't encode
Sep 22, 2004
Arcane Jill
September 20, 2004
The following string is encoded into UTF-8:

   char [] c = "\u7362";

It encodes into x"E4 98 B7".  The following string, however, does not get encoded:

   char [] c = "\x8F";

It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.
September 21, 2004
"Burton Radons" <burton-radons@shaw.ca> wrote in message news:cilbgp$1p3a$1@digitaldaemon.com...
> The following string is encoded into UTF-8:
>
>     char [] c = "\u7362";
>
> It encodes into x"E4 98 B7".  The following string, however, does not get encoded:
>
>     char [] c = "\x8F";
>
> It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.

I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.


September 21, 2004
In article <cioggi$k7p$1@digitaldaemon.com>, Walter says... <snip>
>> It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.
>
>I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.

Here's somewhere I agree with your choice of behaviour, where \x denotes byte
values, not Unicode
codepoints.  Hence here, the coder who writes \x8F intended the byte having this
value - a single value
of type char.  Moreover, it follows the "looks like C, acts like C" principle.

Of course, if circumstances dictate that the string be interpreted as a wchar[]
or dchar[], then that's
another matter.

Stewart.


September 21, 2004
In article <ciout3$skf$1@digitaldaemon.com>, Stewart Gordon says and then some
program or
another makes a mess of...
<snip>
>Here's somewhere I agree with your choice of behaviour, where \x denotes byte
>values, not Unicode
>codepoints.  Hence here, the coder who writes \x8F intended the byte having this
>value - a single value
>of type char.  Moreover, it follows the "looks like C, acts like C" principle.
>
>Of course, if circumstances dictate that the string be interpreted as a wchar[]
>or dchar[], then that's
>another matter.

Just what is wrong with this web newsgroup interface?  I should've carried on using my quote tidier.

If anyone else is having the same troubles, you're pointed here....

http://smjg.port5.com/faqs/usenet/quotetidy.html

Hopefully my regular posting environment will soon have a working power supply once again....

Stewart.


September 21, 2004
On Tue, 21 Sep 2004 10:13:55 +0000 (UTC), Stewart Gordon <Stewart_member@pathlink.com> wrote:

> In article <cioggi$k7p$1@digitaldaemon.com>, Walter says...
> <snip>
>>> It remains x"8F".  Thus the \x specifies a literal byte in the character
>>> stream as implemented.  The specification doesn't mention this twist, if
>>> it was intentional.
>>
>> I wasn't sure what to do about that case, so I left the \x as whatever the
>> programmer wrote. The \u, though, is definitely meant as unicode and so is
>> converted to UTF-8.
>
> Here's somewhere I agree with your choice of behaviour, where \x denotes byte
> values, not Unicode
> codepoints.  Hence here, the coder who writes \x8F intended the byte having this
> value - a single value
> of type char.  Moreover, it follows the "looks like C, acts like C" principle.

I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?
Does the compiler/program catch this invalid sequence?
I believe it should.

> Of course, if circumstances dictate that the string be interpreted as a wchar[]
> or dchar[], then that's
> another matter.
>
> Stewart.
>
>



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
September 21, 2004
"Regan Heath" <regan@netwin.co.nz> wrote in message news:opseo5g1si5a2sq9@digitalmars.com...
> I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?

Yes.

> Does the compiler/program catch this invalid sequence?
> I believe it should.

Only if the string is interpreted as a wchar[] or dchar[].



September 22, 2004
In article <cilbgp$1p3a$1@digitaldaemon.com>, Burton Radons says...
>
>The following string is encoded into UTF-8:
>
>    char [] c = "\u7362";
>
>It encodes into x"E4 98 B7".  The following string, however, does not get encoded:
>
>    char [] c = "\x8F";
>
>It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.

This is correct behavior. You should be using \u for Unicode characters. \x is for literal bytes. \u is supposed to understand the encoding. \x is not.

In D, the source code encoding must always be a UTF, but in other computer languages, this is not so. Imagine a C++ program in which the source code encoding were WINDOWS-1252. In such a case, the following two lines would be equivalent:

#    char[] s = "\x80"; // Euro sign in WINDOWS-1252 (C++)
#    char[] s = "\u20AC"; // Euro sign in Unicode (C++)

In both cases, a single byte [0x80] will be placed in the string s. And now, here's the same thing in a C++ program in which the source code encoding is WINDOWS-1251:

#    char[] s = "\x88"; // Euro sign in WINDOWS-1251 (C++)
#    char[] s = "\u20AC"; // Euro sign in Unicode (C++)

In both cases, a single byte [0x88] will be placed in the string s. Now, since D does not allow non-UTF source code encodings, the distinction may appear blurred, but it's still there.

Just remember:
\x => insert this literal byte
\u => insert this Unicode character, encoded in the appropriate encoding.

Arcane Jill


September 22, 2004
In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says...

>I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?

Yup. If you use \x in a char array you are doing /low level stuff/. You are doing encoding-by-hand - and it's up to you to get it right.

>Does the compiler/program catch this invalid sequence?
>I believe it should.

I disagree. If you're using \x then you're working at the byte level. You might be doing some system-programming-type stuff where you actually /want/ to break the rules.

The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me.

People simply need to understand the difference between \u and \x. Arcane Jill


September 22, 2004
In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says... <snip>
> I agree..  however doesn't this make it possible to create an invalid UTF-8 sequence?  Does the compiler/program catch this invalid sequence?  I believe it should.

I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created.  As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs.

The ability to use arbitrary \x codes provides this neatly.  I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8.

Stewart.


September 22, 2004
In article <cirlav$2a9s$1@digitaldaemon.com>, Stewart Gordon says...
>
>I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created.

I do, since it's documented that way.


>As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs.

I agree that it should remain possible - but I disagree with the reason. Non-UTF encodings are more properly stored as ubyte[] arrays in D. Remember, C and C++ simply don't /have/ a type equivalent to D's char, so functions written in C or C++ were /never/ intended to receive such a type. C's char == D's byte or ubyte.

The possible reasons why one might want to store arbitrary byte values in chars include scary hand-encoding of UTF-8 and possible some esoteric custom extensions (for example, imagine you invent some backwardly compatible UTF-8-PLUS). Such uses are, however, rare. About as rare as, say, needing to write a custom allocator because "new" isn't good enough. It should always be possible, but never commonplace.



>The ability to use arbitrary \x codes provides this neatly.  I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8.

Of course this makes perfect logical sense - /if/ you're talking about a ubyte[] array, not a char[] array.

Jill


« First   ‹ Prev
1 2 3