Thread overview | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
September 20, 2004 Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
The following string is encoded into UTF-8: char [] c = "\u7362"; It encodes into x"E4 98 B7". The following string, however, does not get encoded: char [] c = "\x8F"; It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional. |
September 21, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Burton Radons | "Burton Radons" <burton-radons@shaw.ca> wrote in message news:cilbgp$1p3a$1@digitaldaemon.com... > The following string is encoded into UTF-8: > > char [] c = "\u7362"; > > It encodes into x"E4 98 B7". The following string, however, does not get encoded: > > char [] c = "\x8F"; > > It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional. I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8. |
September 21, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | In article <cioggi$k7p$1@digitaldaemon.com>, Walter says... <snip> >> It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional. > >I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8. Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle. Of course, if circumstances dictate that the string be interpreted as a wchar[] or dchar[], then that's another matter. Stewart. |
September 21, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | In article <ciout3$skf$1@digitaldaemon.com>, Stewart Gordon says and then some program or another makes a mess of... <snip> >Here's somewhere I agree with your choice of behaviour, where \x denotes byte >values, not Unicode >codepoints. Hence here, the coder who writes \x8F intended the byte having this >value - a single value >of type char. Moreover, it follows the "looks like C, acts like C" principle. > >Of course, if circumstances dictate that the string be interpreted as a wchar[] >or dchar[], then that's >another matter. Just what is wrong with this web newsgroup interface? I should've carried on using my quote tidier. If anyone else is having the same troubles, you're pointed here.... http://smjg.port5.com/faqs/usenet/quotetidy.html Hopefully my regular posting environment will soon have a working power supply once again.... Stewart. |
September 21, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | On Tue, 21 Sep 2004 10:13:55 +0000 (UTC), Stewart Gordon <Stewart_member@pathlink.com> wrote: > In article <cioggi$k7p$1@digitaldaemon.com>, Walter says... > <snip> >>> It remains x"8F". Thus the \x specifies a literal byte in the character >>> stream as implemented. The specification doesn't mention this twist, if >>> it was intentional. >> >> I wasn't sure what to do about that case, so I left the \x as whatever the >> programmer wrote. The \u, though, is definitely meant as unicode and so is >> converted to UTF-8. > > Here's somewhere I agree with your choice of behaviour, where \x denotes byte > values, not Unicode > codepoints. Hence here, the coder who writes \x8F intended the byte having this > value - a single value > of type char. Moreover, it follows the "looks like C, acts like C" principle. I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Does the compiler/program catch this invalid sequence? I believe it should. > Of course, if circumstances dictate that the string be interpreted as a wchar[] > or dchar[], then that's > another matter. > > Stewart. > > -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/ |
September 21, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | "Regan Heath" <regan@netwin.co.nz> wrote in message news:opseo5g1si5a2sq9@digitalmars.com... > I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Yes. > Does the compiler/program catch this invalid sequence? > I believe it should. Only if the string is interpreted as a wchar[] or dchar[]. |
September 22, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Burton Radons | In article <cilbgp$1p3a$1@digitaldaemon.com>, Burton Radons says... > >The following string is encoded into UTF-8: > > char [] c = "\u7362"; > >It encodes into x"E4 98 B7". The following string, however, does not get encoded: > > char [] c = "\x8F"; > >It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional. This is correct behavior. You should be using \u for Unicode characters. \x is for literal bytes. \u is supposed to understand the encoding. \x is not. In D, the source code encoding must always be a UTF, but in other computer languages, this is not so. Imagine a C++ program in which the source code encoding were WINDOWS-1252. In such a case, the following two lines would be equivalent: # char[] s = "\x80"; // Euro sign in WINDOWS-1252 (C++) # char[] s = "\u20AC"; // Euro sign in Unicode (C++) In both cases, a single byte [0x80] will be placed in the string s. And now, here's the same thing in a C++ program in which the source code encoding is WINDOWS-1251: # char[] s = "\x88"; // Euro sign in WINDOWS-1251 (C++) # char[] s = "\u20AC"; // Euro sign in Unicode (C++) In both cases, a single byte [0x88] will be placed in the string s. Now, since D does not allow non-UTF source code encodings, the distinction may appear blurred, but it's still there. Just remember: \x => insert this literal byte \u => insert this Unicode character, encoded in the appropriate encoding. Arcane Jill |
September 22, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says... >I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Yup. If you use \x in a char array you are doing /low level stuff/. You are doing encoding-by-hand - and it's up to you to get it right. >Does the compiler/program catch this invalid sequence? >I believe it should. I disagree. If you're using \x then you're working at the byte level. You might be doing some system-programming-type stuff where you actually /want/ to break the rules. The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me. People simply need to understand the difference between \u and \x. Arcane Jill |
September 22, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | In article <opseo5g1si5a2sq9@digitalmars.com>, Regan Heath says... <snip> > I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Does the compiler/program catch this invalid sequence? I believe it should. I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created. As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs. The ability to use arbitrary \x codes provides this neatly. I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8. Stewart. |
September 22, 2004 Re: Standard omission or compiler bug: Hexadecimal escapes don't encode into UTF-8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | In article <cirlav$2a9s$1@digitaldaemon.com>, Stewart Gordon says... > >I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created. I do, since it's documented that way. >As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs. I agree that it should remain possible - but I disagree with the reason. Non-UTF encodings are more properly stored as ubyte[] arrays in D. Remember, C and C++ simply don't /have/ a type equivalent to D's char, so functions written in C or C++ were /never/ intended to receive such a type. C's char == D's byte or ubyte. The possible reasons why one might want to store arbitrary byte values in chars include scary hand-encoding of UTF-8 and possible some esoteric custom extensions (for example, imagine you invent some backwardly compatible UTF-8-PLUS). Such uses are, however, rare. About as rare as, say, needing to write a custom allocator because "new" isn't good enough. It should always be possible, but never commonplace. >The ability to use arbitrary \x codes provides this neatly. I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8. Of course this makes perfect logical sense - /if/ you're talking about a ubyte[] array, not a char[] array. Jill |
Copyright © 1999-2021 by the D Language Foundation