Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Issues » Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Thread overview

Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Sep 20, 2004

Sep 21, 2004

Sep 21, 2004

Sep 21, 2004

Sep 21, 2004

Sep 21, 2004

Sep 22, 2004

Sep 22, 2004

Sep 23, 2004

Sep 23, 2004

Sep 23, 2004

char[] vs. ubyte[] (was: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8)
Sep 23, 2004 Stewart Gordon
Sep 24, 2004 Regan Heath
Sep 24, 2004 Arcane Jill

Sep 22, 2004

Sep 22, 2004

Sep 22, 2004

Sep 25, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't
Sep 27, 2004 Arcane Jill
Sep 29, 2004 Stewart Gordon
Sep 30, 2004 Burton Radons
Sep 30, 2004 Arcane Jill

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode
Sep 22, 2004 Arcane Jill

September 20, 2004

Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Burton Radons

Burton Radons

The following string is encoded into UTF-8:

   char [] c = "\u7362";

It encodes into x"E4 98 B7".  The following string, however, does not get encoded:

   char [] c = "\x8F";

It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.

September 21, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Walter
in reply to Burton Radons

Walter

Posted in reply to Burton Radons

"Burton Radons" <burton-radons@shaw.ca> wrote in message news:cilbgp$1p3a$1@digitaldaemon.com...
> The following string is encoded into UTF-8:
>
>     char [] c = "\u7362";
>
> It encodes into x"E4 98 B7".  The following string, however, does not get encoded:
>
>     char [] c = "\x8F";
>
> It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.

I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.

September 21, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Stewart Gordon
in reply to Walter

Stewart Gordon

Posted in reply to Walter

In article <cioggi$k7p$1@digitaldaemon.com>, Walter says... <snip>
>> It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.
>
>I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.

Here's somewhere I agree with your choice of behaviour, where \x denotes byte
values, not Unicode
codepoints.  Hence here, the coder who writes \x8F intended the byte having this
value - a single value
of type char.  Moreover, it follows the "looks like C, acts like C" principle.

Of course, if circumstances dictate that the string be interpreted as a wchar[]
or dchar[], then that's
another matter.

Stewart.

September 21, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Stewart Gordon
in reply to Stewart Gordon

Stewart Gordon

Posted in reply to Stewart Gordon

In article <ciout3$skf$1@digitaldaemon.com>, Stewart Gordon says and then some
program or
another makes a mess of...
<snip>
>Here's somewhere I agree with your choice of behaviour, where \x denotes byte
>values, not Unicode
>codepoints.  Hence here, the coder who writes \x8F intended the byte having this
>value - a single value
>of type char.  Moreover, it follows the "looks like C, acts like C" principle.
>
>Of course, if circumstances dictate that the string be interpreted as a wchar[]
>or dchar[], then that's
>another matter.

Just what is wrong with this web newsgroup interface?  I should've carried on using my quote tidier.

If anyone else is having the same troubles, you're pointed here....

http://smjg.port5.com/faqs/usenet/quotetidy.html

Hopefully my regular posting environment will soon have a working power supply once again....

Stewart.

September 21, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Regan Heath
in reply to Stewart Gordon

Regan Heath

Posted in reply to Stewart Gordon

On Tue, 21 Sep 2004 10:13:55 +0000 (UTC), Stewart Gordon <Stewart_member@pathlink.com> wrote:

> In article <cioggi$k7p$1@digitaldaemon.com>, Walter says...
> <snip>
>>> It remains x"8F".  Thus the \x specifies a literal byte in the character
>>> stream as implemented.  The specification doesn't mention this twist, if
>>> it was intentional.
>>
>> I wasn't sure what to do about that case, so I left the \x as whatever the
>> programmer wrote. The \u, though, is definitely meant as unicode and so is
>> converted to UTF-8.
>
> Here's somewhere I agree with your choice of behaviour, where \x denotes byte
> values, not Unicode
> codepoints.  Hence here, the coder who writes \x8F intended the byte having this
> value - a single value
> of type char.  Moreover, it follows the "looks like C, acts like C" principle.

I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?
Does the compiler/program catch this invalid sequence?
I believe it should.

> Of course, if circumstances dictate that the string be interpreted as a wchar[]
> or dchar[], then that's
> another matter.
>
> Stewart.
>
>



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

September 21, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Walter
in reply to Regan Heath

Walter

Posted in reply to Regan Heath

"Regan Heath" <regan@netwin.co.nz> wrote in message news:opseo5g1si5a2sq9@digitalmars.com...
> I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?

Yes.

> Does the compiler/program catch this invalid sequence?
> I believe it should.

Only if the string is interpreted as a wchar[] or dchar[].

September 22, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode

Posted by Arcane Jill
in reply to Burton Radons

Arcane Jill

Posted in reply to Burton Radons

In article <cilbgp$1p3a$1@digitaldaemon.com>, Burton Radons says...
>
>The following string is encoded into UTF-8:
>
>    char [] c = "\u7362";
>
>It encodes into x"E4 98 B7".  The following string, however, does not get encoded:
>
>    char [] c = "\x8F";
>
>It remains x"8F".  Thus the \x specifies a literal byte in the character stream as implemented.  The specification doesn't mention this twist, if it was intentional.

This is correct behavior. You should be using \u for Unicode characters. \x is for literal bytes. \u is supposed to understand the encoding. \x is not.

In D, the source code encoding must always be a UTF, but in other computer languages, this is not so. Imagine a C++ program in which the source code encoding were WINDOWS-1252. In such a case, the following two lines would be equivalent:

#    char[] s = "\x80"; // Euro sign in WINDOWS-1252 (C++)
#    char[] s = "\u20AC"; // Euro sign in Unicode (C++)

In both cases, a single byte [0x80] will be placed in the string s. And now, here's the same thing in a C++ program in which the source code encoding is WINDOWS-1251:

#    char[] s = "\x88"; // Euro sign in WINDOWS-1251 (C++)
#    char[] s = "\u20AC"; // Euro sign in Unicode (C++)

In both cases, a single byte [0x88] will be placed in the string s. Now, since D does not allow non-UTF source code encodings, the distinction may appear blurred, but it's still there.

Just remember:
\x => insert this literal byte
\u => insert this Unicode character, encoded in the appropriate encoding.

Arcane Jill

September 22, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Arcane Jill
in reply to Regan Heath

Arcane Jill

Posted in reply to Regan Heath

In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says...

>I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?

Yup. If you use \x in a char array you are doing /low level stuff/. You are doing encoding-by-hand - and it's up to you to get it right.

>Does the compiler/program catch this invalid sequence?
>I believe it should.

I disagree. If you're using \x then you're working at the byte level. You might be doing some system-programming-type stuff where you actually /want/ to break the rules.

The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me.

People simply need to understand the difference between \u and \x. Arcane Jill

September 22, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Stewart Gordon
in reply to Regan Heath

Stewart Gordon

Posted in reply to Regan Heath

In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says... <snip>
> I agree..  however doesn't this make it possible to create an invalid UTF-8 sequence?  Does the compiler/program catch this invalid sequence?  I believe it should.

I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created.  As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs.

The ability to use arbitrary \x codes provides this neatly.  I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8.

Stewart.

September 22, 2004

Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8

Posted by Arcane Jill
in reply to Stewart Gordon

Arcane Jill

Posted in reply to Stewart Gordon

In article <cirlav$2a9s$1@digitaldaemon.com>, Stewart Gordon says...
>
>I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created.

I do, since it's documented that way.


>As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs.

I agree that it should remain possible - but I disagree with the reason. Non-UTF encodings are more properly stored as ubyte[] arrays in D. Remember, C and C++ simply don't /have/ a type equivalent to D's char, so functions written in C or C++ were /never/ intended to receive such a type. C's char == D's byte or ubyte.

The possible reasons why one might want to store arbitrary byte values in chars include scary hand-encoding of UTF-8 and possible some esoteric custom extensions (for example, imagine you invent some backwardly compatible UTF-8-PLUS). Such uses are, however, rare. About as rare as, say, needing to write a custom allocator because "new" isn't good enough. It should always be possible, but never commonplace.



>The ability to use arbitrary \x codes provides this neatly.  I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8.

Of course this makes perfect logical sense - /if/ you're talking about a ubyte[] array, not a char[] array.

Jill

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation