October 06, 2004 Re: What to do about \x? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | Arcane Jill wrote: <snip> > POSSIBILITY ONE: > > Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a string containing a multiplication sign ('×') immediately followed by a Euro sign ('€'). Such a user might mistakenly type > > # char[] s = "\xD7\x80"; > > since 0xD7 is the WINDOWS-1252 codepoint for '×', and 0x80 is the WINDOWS-1252 codepoint for '€'. This /does/ compile under present rules - but results in s containing the single Unicode character U+05C0 (Hebrew punctuation PASEQ). This is not what the user was expecting, and results entirely from the system trying to interpret \x as UTF-8 when it wasn't. If the user had been required to instead type > > # char[] s = "\u00D7\u20AC"; > > then they would have been protected from that error. I see.... > POSSIBILITY TWO: > > Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in: > > # char[] s = "\xE2\x82\x8C"; // whoops - should be "\xE2\x82\xAC" > > > who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type: > > # char[] s = "\u20AC"; > > then they would have been protected from that error. How are these less typo-prone? Even if I did that, how would it know that I didn't mean to type char[] s = "\u208C"; ? As I started to say, people who choose to hand-code UTF-8, any other encoding or even uncoded codepoints should know what they're doing and that they have to be careful. <snip> >> Maybe you're right. The trouble is that you have some points I don't really agree with, like that writing strings in UTF-8 or UTF-16 should be illegal. > > Not illegal - I'm only asking for a b prefix. As in b"...". > >> And some of my thoughts are: >> >> - it should remain straightforward to interface legacy APIs, whatever character sets/encodings they may rely on > > Yes, absolutely. That would be one use of b"...". I see.... >> Maybe we can come to a best of all three worlds. > > Of course we can. We're good. I'm not "making a stand" or "taking a position". I'm open to persuasion. I can be persuaded to change my mind. Assuming that's also true of you, it should just be a matter of logicking out all the pros and cons, and then letting Walter be the judge. Good idea. Stewart. |
October 06, 2004 Re: What to do about \x? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | In article <ck0gia$239i$1@digitaldaemon.com>, Stewart Gordon says... >> POSSIBILITY TWO: >> >> Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in: >> >> # char[] s = "\xE2\x82\x8C"; // whoops - should be "\xE2\x82\xAC" >> >> >> who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type: >> >> # char[] s = "\u20AC"; >> >> then they would have been protected from that error. > >How are these less typo-prone? Even if I did that, how would it know that I didn't mean to type > > char[] s = "\u208C"; True enough. But I guess, what I was trying to say is that \u20AC is something that any /human/ can look up in Unicode code charts. By contrast, you are unlikely to find a convenient lookup table anywhere on the web which will let you look up \xE2\x82\xAC. So the \u version is just more maintainable, and the \x version more obfuscated. I suppose I'm saying that if a project is maintained by more than one person, an error involving \u is more likely to be spotted by someone else in the team than an error involving a sequence of \x's. Maybe that's just subjective. Maybe I'm wrong. I dunno any more. >? As I started to say, people who choose to hand-code UTF-8, any other encoding or even uncoded codepoints should know what they're doing and that they have to be careful. Well obviously we both agree on that one. I think we only disagree on whether or not you should need a b before the "...". Jill |
Copyright © 1999-2021 by the D Language Foundation