October 06, 2004
Arcane Jill wrote:
<snip>
> POSSIBILITY ONE:
> 
> Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a string containing a multiplication sign ('×') immediately followed by a Euro sign ('€'). Such a user might mistakenly type
> 
> #    char[] s = "\xD7\x80";
> 
> since 0xD7 is the WINDOWS-1252 codepoint for '×', and 0x80 is the WINDOWS-1252 codepoint for '€'. This /does/ compile under present rules - but results in s containing the single Unicode character U+05C0 (Hebrew punctuation PASEQ). This is not what the user was expecting, and results entirely from the system trying to interpret \x as UTF-8 when it wasn't. If the user had been required to instead type
> 
> #    char[] s = "\u00D7\u20AC";
> 
> then they would have been protected from that error.

I see....

> POSSIBILITY TWO:
> 
> Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in:
> 
> #    char[] s = "\xE2\x82\x8C";  // whoops - should be "\xE2\x82\xAC"
> 
> 
> who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type:
> 
> #    char[] s = "\u20AC";
> 
> then they would have been protected from that error.

How are these less typo-prone?  Even if I did that, how would it know that I didn't mean to type

    char[] s = "\u208C";

?  As I started to say, people who choose to hand-code UTF-8, any other encoding or even uncoded codepoints should know what they're doing and that they have to be careful.

<snip>
>> Maybe you're right.  The trouble is that you have some points I don't really agree with, like that writing strings in UTF-8 or UTF-16 should be illegal.
> 
> Not illegal - I'm only asking for a b prefix. As in b"...".
> 
>> And some of my thoughts are:
>> 
>> - it should remain straightforward to interface legacy APIs, whatever character sets/encodings they may rely on
> 
> Yes, absolutely. That would be one use of b"...".

I see....

>> Maybe we can come to a best of all three worlds.
> 
> Of course we can. We're good. I'm not "making a stand" or "taking a position". I'm open to persuasion. I can be persuaded to change my mind. Assuming that's also true of you, it should just be a matter of logicking out all the pros and cons, and then letting Walter be the judge.

Good idea.

Stewart.
October 06, 2004
In article <ck0gia$239i$1@digitaldaemon.com>, Stewart Gordon says...

>> POSSIBILITY TWO:
>> 
>> Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in:
>> 
>> #    char[] s = "\xE2\x82\x8C";  // whoops - should be "\xE2\x82\xAC"
>> 
>> 
>> who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type:
>> 
>> #    char[] s = "\u20AC";
>> 
>> then they would have been protected from that error.
>
>How are these less typo-prone?  Even if I did that, how would it know that I didn't mean to type
>
>     char[] s = "\u208C";

True enough. But I guess, what I was trying to say is that \u20AC is something that any /human/ can look up in Unicode code charts. By contrast, you are unlikely to find a convenient lookup table anywhere on the web which will let you look up \xE2\x82\xAC. So the \u version is just more maintainable, and the \x version more obfuscated. I suppose I'm saying that if a project is maintained by more than one person, an error involving \u is more likely to be spotted by someone else in the team than an error involving a sequence of \x's.

Maybe that's just subjective. Maybe I'm wrong. I dunno any more.


>?  As I started to say, people who choose to hand-code UTF-8, any other encoding or even uncoded codepoints should know what they're doing and that they have to be careful.

Well obviously we both agree on that one. I think we only disagree on whether or not you should need a b before the "...".

Jill


1 2 3 4
Next ›   Last »