What to do about \x? (page 4)

October 06, 2004

Re: What to do about \x?

Posted by Stewart Gordon
in reply to Arcane Jill

Permalink

Stewart Gordon

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:
<snip>
> POSSIBILITY ONE:
> 
> Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a string containing a multiplication sign ('×') immediately followed by a Euro sign ('€'). Such a user might mistakenly type
> 
> #    char[] s = "\xD7\x80";
> 
> since 0xD7 is the WINDOWS-1252 codepoint for '×', and 0x80 is the WINDOWS-1252 codepoint for '€'. This /does/ compile under present rules - but results in s containing the single Unicode character U+05C0 (Hebrew punctuation PASEQ). This is not what the user was expecting, and results entirely from the system trying to interpret \x as UTF-8 when it wasn't. If the user had been required to instead type
> 
> #    char[] s = "\u00D7\u20AC";
> 
> then they would have been protected from that error.

I see....

> POSSIBILITY TWO:
> 
> Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in:
> 
> #    char[] s = "\xE2\x82\x8C";  // whoops - should be "\xE2\x82\xAC"
> 
> 
> who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type:
> 
> #    char[] s = "\u20AC";
> 
> then they would have been protected from that error.

How are these less typo-prone?  Even if I did that, how would it know that I didn't mean to type

    char[] s = "\u208C";

?  As I started to say, people who choose to hand-code UTF-8, any other encoding or even uncoded codepoints should know what they're doing and that they have to be careful.

<snip>
>> Maybe you're right.  The trouble is that you have some points I don't really agree with, like that writing strings in UTF-8 or UTF-16 should be illegal.
> 
> Not illegal - I'm only asking for a b prefix. As in b"...".
> 
>> And some of my thoughts are:
>> 
>> - it should remain straightforward to interface legacy APIs, whatever character sets/encodings they may rely on
> 
> Yes, absolutely. That would be one use of b"...".

I see....

>> Maybe we can come to a best of all three worlds.
> 
> Of course we can. We're good. I'm not "making a stand" or "taking a position". I'm open to persuasion. I can be persuaded to change my mind. Assuming that's also true of you, it should just be a matter of logicking out all the pros and cons, and then letting Walter be the judge.

Good idea.

Stewart.

In article <ck0gia$239i$1@digitaldaemon.com>, Stewart Gordon says... >> POSSIBILITY TWO: >> >> Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in: >> >> # char[] s = "\xE2\x82\x8C"; // whoops - should be "\xE2\x82\xAC" >> >> >> who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type: >> >> # char[] s = "\u20AC"; >> >> then they would have been protected from that error. > >How are these less typo-prone? Even if I did that, how would it know that I didn't mean to type > > char[] s = "\u208C"; True enough. But I guess, what I was trying to say is that \u20AC is something that any /human/ can look up in Unicode code charts. By contrast, you are unlikely to find a convenient lookup table anywhere on the web which will let you look up \xE2\x82\xAC. So the \u version is just more maintainable, and the \x version more obfuscated. I suppose I'm saying that if a project is maintained by more than one person, an error involving \u is more likely to be spotted by someone else in the team than an error involving a sequence of \x's. Maybe that's just subjective. Maybe I'm wrong. I dunno any more. >? As I started to say, people who choose to hand-code UTF-8, any other encoding or even uncoded codepoints should know what they're doing and that they have to be careful. Well obviously we both agree on that one. I think we only disagree on whether or not you should need a b before the "...". Jill

Forums