What to do about \x? (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » What to do about \x? (page 3)

October 05, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to Arcane Jill

Arcane Jill

Posted in reply to Arcane Jill

In article <cjrd3d$2cep$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <cjr7ar$29j8$1@digitaldaemon.com>, Stewart Gordon says...

>>Other thing we would need to be careful about:
>>
>>(a) whether an expression like
>>
>>     "£10"[1..3]
>>
>>should be allowed, and what it should do
>
>An interesting conundrum!

That got me thinking a lot. Finally, it occurred to be that if the type of "..." cannot be determined by context then it should be a syntax error. What's more, non-ASCII characters shouldn't be allowed in a string constant of unknown type. So, in effect, I'm saying:

#    c"£10"[1..3];   // results in c"£1"
#    w"£10"[1..3];   // results in w"£10"

And the complete set of rules should now be:

For string literals with a prefix, the following rules should apply:

#    Literal   /x        /u      /U     non-ASCII   type
#    -------------------------------------------------------
#    c"..."    ASCII     yes     yes    yes         char[]
#    w"..."    no        yes     yes    yes         wchar[]
#    d"..."    no        yes     yes    yes         dchar[]
#    b"..."    yes       no      no     no          ubyte[]

Note the extra column, "non-ASCII". What this means is that statements like

#    ubyte[] x = b"€100";

must be forbidden, because '€' is a non-ASCII character, and its representation
is unspecified. Is it UTF-8 ("\xE2\x82\xAC")? Is it UTF-16BE ("\x20\xAC")?
UTF-16LE ("\xAC\x20")? Is it WINDOWS-1252 ("\x80")? You see the problem.

For string literals /without/ a prefix, the rules should be determined by context, as follows:

#    Context           Treat as
#    -------------------------------------
#    char[]            c"..."
#    wchar[]           w"..."
#    dchar[]           d"..."
#    ubyte[]           b"..."
#    byte[]            b"..."
#    indeterminate     compile-time error

I believe that these rules would lead to a resolution of Stewart's conundrum.

#    "£10"[1..3];    // illegal - type of "..." not known

String literals will now be consistent, logical, and sufficiently powerful for Walter's purposes. The distinction between byte[] and ubyte[] would then be the only remaining problem.

Arcane Jill

October 05, 2004

Re: Function Resolution by Return Type (Was: Re: What to do about \x?)

Posted by Arcane Jill
in reply to Benjamin Herr

Arcane Jill

Posted in reply to Benjamin Herr

In article <cjrq9c$uea$1@digitaldaemon.com>, Benjamin Herr says...
>
>Stewart Gordon wrote:
>> I thought it was by design that string literals take their type from the context
>
>Eh, if we are that far, what prevents resolving overloaded function
>calls based on their return type? Then we could even have a sensible opCast.
>Sorry to hijack the thread, but I never quite realised this feature.
>
>-ben

As I understand it, the context-detection for string literals is very, very crude - basically limited to assignment statements of the form:

#    T s = "literal";

although it ought to be fairly easily extended to:

#    T s = /*stuff*/ ~ "literal" ~ /*stuff*/;

It's not a full context analysis - and even if it were, it would be an /exception/ to D's overall context free grammar, not the rule. The language may be able to withstand one or two very simple exceptions, but that is a far cry from generically being able to tell the context of everything in all circumstances. The latter would (I suspect) require a complete language rewrite.

Arcane Jill

PS. I want a sensible opCast() too - but that's for another thread.

October 05, 2004

Re: What to do about \x?

Posted by Stewart Gordon
in reply to Arcane Jill

Stewart Gordon

Posted in reply to Arcane Jill

Arcane Jill wrote:
<snip>
> Currently, strings /cannot/ be initialized with UTF-16, and I believe that
> behavior to be correct. (You believe it to be incorrect, I know. Like I said,
> opinions differ).
> 
> I don't believe that allowing them to be initialised with UTF-8 is a good idea
> either. It's dumb.
> 
> Regardless of the destination type! 

If people choose to initialise a string with UTF-8 or UTF-16, then it automatically follows that they should know what they're doing.  What would we gain by stopping these people?

<snip>
> I posted an update this morning with better rules, which I think would keep
> everyone happy. Well - everyone apart from those who want to explicitly stick
> UTF-8 and UTF-16 into char[], wchar[] and dchar[] anyway.

Hmm......

>>This leaves two things to do:
>>
>>1. Stop string literals from jumping to the wrong conclusions.
> 
> Prefixes do that nicely. In the absense of prefixes, you can often figure it
> out. But /sometimes/ you just can't. For instance: f("..."), where f() is
> overloaded.

But they take away the convenience Walter has gone to the trouble to create, of which this is just one example:

http://www.digitalmars.com/d/ctod.html#ascii

<snip>
>>2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed by \u escapes - get the compiler matching the spec or vice versa.  If we don't allow them, then if only for the sake of completeness, we should invent a new escape to denote UTF-16 fragments.
> 
> Disagree. I still think that encoding-by-hand inside a string literal is a dumb
> idea - both for UTF-8 and for UTF-16. What on Earth is wrong with \u#### and
> \U00###### (where #### and ###### are just Unicode codepoints in hex)?

Nothing - these are perfectly valid at the moment, and remain perfectly valid whether \u is interpreted as codepoints or UTF-16 fragments.

> Walter suggests (and I agree with him) that \x should be for inserting arbitrary
> binary data into binary strings, and ASCII characters into text strings. I see
> no /point/ in defining \x to be UTF-8 ... unless of course you want to enter an
> obfuscated D contest with code like this:
> 
> #    wchar c = '\xE2\x82\xAC';    // currently legal
> 
> instead of either of:
> 
> #    wchar c = '\u20AC';
> #    wchar c = '€';
> 
> It's just crazy.

Maybe.  But there's method in some people's madness.

Stewart.

October 05, 2004

Re: Function Resolution by Return Type (Was: Re: What to do about \x?)

Posted by Stewart Gordon
in reply to Arcane Jill

Stewart Gordon

Posted in reply to Arcane Jill

Arcane Jill wrote:

> In article <cjrq9c$uea$1@digitaldaemon.com>, Benjamin Herr says...
> 
>>Stewart Gordon wrote:
>>
>>>I thought it was by design that string literals take their type from the context
>>
>>Eh, if we are that far, what prevents resolving overloaded function calls based on their return type? Then we could even have a sensible opCast.
>>Sorry to hijack the thread, but I never quite realised this feature.

There would be a lot more cases to deal with, it would probably be complex to implement, and could get utterly confusing to determine which path the types are going through in a complex expression.

Simple type resolution of string literals is, OTOH, a relatively simple feature.

>>-ben
> 
> 
> As I understand it, the context-detection for string literals is very, very
> crude - basically limited to assignment statements of the form:
> 
> #    T s = "literal";
> 
> although it ought to be fairly easily extended to:
> 
> #    T s = /*stuff*/ ~ "literal" ~ /*stuff*/;
> 
> It's not a full context analysis - and even if it were, it would be an
> /exception/ to D's overall context free grammar, not the rule.
<snip>

How would it destroy CFG?  Type resolution and expression simplification are part of semantic analysis.

The mechanism would be simple and unambiguous to this extent:

literal ~ literal => literal (concatenated at compile time)
char[]  ~ literal => char[]
wchar[] ~ literal => wchar[]
dchar[] ~ literal => dchar[]
literal ~ char[]  => char[]
literal ~ wchar[] => wchar[]
literal ~ dchar[] => dchar[]

char[] ~ wchar[] => invalid?

char[]  = literal => char[]
wchar[] = literal => wchar[]
dchar[] = literal => dchar[]

Stewart.

October 05, 2004

Re: What to do about \x?

Posted by Stewart Gordon
in reply to Arcane Jill

Stewart Gordon

Posted in reply to Arcane Jill

Arcane Jill wrote:

<snip>
> That got me thinking a lot. Finally, it occurred to be that if the type of "..."
> cannot be determined by context then it should be a syntax error.

Making it a _syntax_ error while retaining CFG might be tricky, if not impossible.  Perhaps better would be to make it an error at the semantic analysis level.

> What's more, non-ASCII characters shouldn't be allowed in a string constant of unknown type.
> So, in effect, I'm saying:
> 
> #    c"£10"[1..3];   // results in c"£1"
> #    w"£10"[1..3];   // results in w"£10"

You seem to've got your indexing mixed up.  Surely it would be c"\xA31" and w"10"?

> And the complete set of rules should now be:
> 
> For string literals with a prefix, the following rules should apply:
> 
> #    Literal   /x        /u      /U     non-ASCII   type
> #    -------------------------------------------------------
> #    c"..."    ASCII     yes     yes    yes         char[]
> #    w"..."    no        yes     yes    yes         wchar[]
> #    d"..."    no        yes     yes    yes         dchar[]
> #    b"..."    yes       no      no     no          ubyte[]

Where did those forward slashes come from?

> Note the extra column, "non-ASCII". What this means is that statements like
> 
> #    ubyte[] x = b"€100";
> 
> must be forbidden, because '€' is a non-ASCII character, and its representation
> is unspecified. Is it UTF-8 ("\xE2\x82\xAC")? Is it UTF-16BE ("\x20\xAC")?
> UTF-16LE ("\xAC\x20")? Is it WINDOWS-1252 ("\x80")? You see the problem.

Indeed, if we had b"..." then this would be the case.

> For string literals /without/ a prefix, the rules should be determined by
> context, as follows:
> 
> #    Context           Treat as
> #    -------------------------------------
> #    char[]            c"..."
> #    wchar[]           w"..."
> #    dchar[]           d"..."
> #    ubyte[]           b"..."
> #    byte[]            b"..."
> #    indeterminate     compile-time error
> 
> I believe that these rules would lead to a resolution of Stewart's conundrum.
<snip>

Depends on what you mean by "treat as".  My thought is that it is simplest to keep translation of string escapes at the lexical level.  Of course, the prefix (or lack thereof) would be carried forward to the SA.  This of course would retain the current 'anything goes' approach for unprefixed literals, but also retains the flexibility that some of us want or even need.

Of course, if the context is this very conundrum, then it would indeed be a compile-time error, at least if it contains anything non-ASCII at all.

Stewart.

October 05, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to Stewart Gordon

Arcane Jill

Posted in reply to Stewart Gordon

In article <cjtufp$2rjd$1@digitaldaemon.com>, Stewart Gordon says...
>
>> #    c"£10"[1..3];   // results in c"£1"
>> #    w"£10"[1..3];   // results in w"£10"
>
>You seem to've got your indexing mixed up.  Surely it would be c"\xA31" and w"10"?

Whoops. My brain was seeing [0..2] for some reason. Of course it should be:

#    c"£10"[1..3];    // c"\xA31"
#    w"£10"[1..3];    // array bounds exception



>> #    Literal   /x        /u      /U     non-ASCII   type
>Where did those forward slashes come from?

Errr. Typo. Read as backslash.


>Indeed, if we had b"..." then this would be the case.

That was Walter's idea, but I like it.


>Depends on what you mean by "treat as".  My thought is that it is simplest to keep translation of string escapes at the lexical level.  Of course, the prefix (or lack thereof) would be carried forward to the SA.

You're starting to lose me. Compiler front ends are not my strong point. But, by "treat as", I meant, allow the same escape sequences. I couldn't tell you whether or not this would be feasible for a D compiler, but it's a suggestion to which Walter knows the answer, so it didn't seem unreasonable to suggest it.



>  This of course would retain the current 'anything goes' approach for
>unprefixed literals, but also retains the flexibility that some of us want or even need.

Need?

You have yet to provide a convincing argument as to why you (or, indeed, anyone) would /need/ the syntax:

#    wchar[] s = "\xE2\x82\xAC";

which doesn't even make sense (unless you DEFINE \x as meaning a UTF-8 fragment, but there's just no need for that). "€" tells you you've got the Euro character; "\u20AC" tells you you've got the character U+20AC (which any code chart will tell you is the Euro character). "\xE2\x82\xAC" tells you ... what?

If you need to place arbitrary binary data into a string literal, then Walter's idea makes the most sense. He proposes:

#    byte[] s = "\x80\x81x\82";    // byte[] tells you this is binary data #    f( b"\x80\x81x\82" );         // b"..." tells you this is binary data

There is really no way to do this if \x MUST be UTF-8.

Combining two posts into one here...
In article <cjtt82$2qsb$1@digitaldaemon.com>, Stewart Gordon says...

>If people choose to initialise a string with UTF-8 or UTF-16, then it automatically follows that they should know what they're doing.  What would we gain by stopping these people?

Defining rules for string literals doesn't stop anyone doing anything. Look at it this way - I'm saying that the following line would be outlawed:

#    char[] s = "\xE2\xA2\xAC";    // illegal under proposed new rules

but of course we still allow this:

#    char[] s = "\u20AC";          // perfectly ok

but there would still be nothing - absolutely nothing - to stop you from initialising a string with UTF-8, by doing any of:

#    byte[] s = "\xE2\xA2\xAC";
#    char[] s = [ 0xE2, 0xA2, 0xAC ];
#    char[] s = cast(char[]) b"\xE2\xA2\xAC";

if you really wanted to. It seems to me to be the best of both worlds. People who really know what they're doing and who /insist/ on hand-coding their UTF-8 will still be allowed to do so, at the cost of a slightly more involved syntax, but everyone else will be protected from silly mistakes. And I speak as someone who /can/ hand-code UTF-8, and /still/ wants to be protected from doing it accidently.


>But they take away the convenience Walter has gone to the trouble to create, of which this is just one example:
>
>http://www.digitalmars.com/d/ctod.html#ascii

You mean this?

>The D Way
>The type of a string is determined by semantic analysis,
>so there is no need to wrap strings in a macro call:
>
>    char[] foo_ascii = "hello";        // string is taken to be ascii
>    wchar[] foo_wchar = "hello";       // string is taken to be wchar

Under the proposed new rules, that example would still work exactly as documented.

Everything would still work, Stewart - everything that you would ever (sensibly) want to do - PLUS you'd get arbitrary binary strings thrown in as a bonus - AND you could still hand-code whatever you wanted (with the inherent risk of getting it wrong, obviously) with a small amount of extra typing. It's a no-lose situation.


>>What on Earth is wrong with \u#### and
>> \U00###### (where #### and ###### are just Unicode codepoints in hex)?
>
>Nothing - these are perfectly valid at the moment, and remain perfectly valid whether \u is interpreted as codepoints or UTF-16 fragments.

"\uD800" won't compile, so I'd hardly call it "perfectly valid at the moment".

Under the proposed new rules, however, you /would/ be able to do any of the following:

#    wchar c = 0xD800;                        // UTF-16
#    wchar[] s = cast(wchar[]) b"\x00\xD8";   // UTF-16LE
#    wchar[] s = cast(wchar[]) b"\xD8\x00";   // UTF-16BE

or even

#    wchar[] s = "string with a " ~ cast(wchar)0xD800 ~ " in it";

I don't see a problem with this. If you want to create invalid strings, you should not be surprised if you need a bit of casting.



>> ... unless of course you want to enter an
>> obfuscated D contest with code like this:
>> 
>> #    wchar c = '\xE2\x82\xAC';    // currently legal
>> 
>> instead of either of:
>> 
>> #    wchar c = '\u20AC';
>> #    wchar c = '€';
>> 
>> It's just crazy.
>
>Maybe.  But there's method in some people's madness.

But is there sufficient method in:

#    wchar c = '\xE2\x82\xAC';

(which leaves c containing the value 0x20AC) to justify its current status as legal D. I mean, what methodic madness makes the above line better than any of:

#    wchar c = '\u20AC';
#    wchar c = 0x20AC;
#    wchar c = '€';

..really?

Providing that D can do a reasonable job of auto-detecting the type of "..." (and '...') in the most common circumstances (which it can), the rest just becomes logical. I mean - we have an /opportunity/ here - an opportunity to make D better; to make strings to the /obvious/, /intuitive/ thing, to allow arbitrary binary strings (which we don't have currently), and all this /without/ preventing programmers from doing "under the hood" stuff if they really want to. Let us seize this opportunity while we can. It's not often that such an opportunity arises that /actually has Walter's backing/ (and was actually Walter's idea). We should be uniting on this one not arguing with each other (although it is good that differing arguments get thrashed out).

Arcane Jill

October 05, 2004

Re: What to do about \x?

Posted by Stewart Gordon
in reply to Arcane Jill

Stewart Gordon

Posted in reply to Arcane Jill

Arcane Jill wrote:

<snip>
> Whoops. My brain was seeing [0..2] for some reason. Of course it should be:
> 
> #    c"£10"[1..3];    // c"\xA31"
> #    w"£10"[1..3];    // array bounds exception

Wrong again.  "£10" is three wchars: '£', '1', '0'.  w"10" is correct.

<snip>
>>Depends on what you mean by "treat as".  My thought is that it is simplest to keep translation of string escapes at the lexical level.  Of course, the prefix (or lack thereof) would be carried forward to the SA. 
> 
> You're starting to lose me. Compiler front ends are not my strong point. But, by
> "treat as", I meant, allow the same escape sequences. I couldn't tell you
> whether or not this would be feasible for a D compiler, but it's a suggestion to
> which Walter knows the answer, so it didn't seem unreasonable to suggest it.

It would somewhat complicate the lexical analysis with little or no real benefit, which is one of the reasons I suggest allowing \x, \u or \U equally in unprefixed literals.

>> This of course would retain the current 'anything goes' approach for unprefixed literals, but also retains the flexibility that some of us want or even need.
> 
> Need?

As I said before, need to be able to interface foreign APIs relying on non-Unicode character sets/encodings.

<snip>
> if you really wanted to. It seems to me to be the best of both worlds. People
> who really know what they're doing and who /insist/ on hand-coding their UTF-8
> will still be allowed to do so, at the cost of a slightly more involved syntax,
> but everyone else will be protected from silly mistakes. And I speak as someone
> who /can/ hand-code UTF-8, and /still/ wants to be protected from doing it
> accidently.

What are the ill consequences of hand-coding UTF-8, from which one would need to be protected?

>>But they take away the convenience Walter has gone to the trouble to create, of which this is just one example:
>>
>>http://www.digitalmars.com/d/ctod.html#ascii
> 
> You mean this?
> 
>>The D Way
>>The type of a string is determined by semantic analysis,
>>so there is no need to wrap strings in a macro call:
>>
>>   char[] foo_ascii = "hello";        // string is taken to be ascii   wchar[] foo_wchar = "hello";       // string is taken to be wchar 

That's indeed the only bit of D code in that section of the page.

<snip>
>>>What on Earth is wrong with \u#### and
>>>\U00###### (where #### and ###### are just Unicode codepoints in hex)?
>>
>>Nothing - these are perfectly valid at the moment, and remain perfectly valid whether \u is interpreted as codepoints or UTF-16 fragments.
> 
> "\uD800" won't compile, so I'd hardly call it "perfectly valid at the moment".

I meant that, _if_ #### happens to be Unicode codepoint, as you were saying, \u#### would be equally legal whether \u is defined to take a Unicode codepoint or a UTF-16 fragment.

<snip>
> But is there sufficient method in:
> 
> #    wchar c = '\xE2\x82\xAC';
> 
> (which leaves c containing the value 0x20AC) to justify its current status as
> legal D. I mean, what methodic madness makes the above line better than any of:
> 
> #    wchar c = '\u20AC';
> #    wchar c = 0x20AC;
> #    wchar c = '€';
> 
> ..really?

Using DMD as a UTF conversion tool, perhaps?  :-)

Maybe someone can come up with a variety of uses for this.

> Providing that D can do a reasonable job of auto-detecting the type of "..."
> (and '...') in the most common circumstances (which it can), the rest just
> becomes logical. I mean - we have an /opportunity/ here - an opportunity to make
> D better; to make strings to the /obvious/, /intuitive/ thing, to allow
> arbitrary binary strings (which we don't have currently), and all this /without/
> preventing programmers from doing "under the hood" stuff if they really want to.
> Let us seize this opportunity while we can. It's not often that such an
> opportunity arises that /actually has Walter's backing/ (and was actually
> Walter's idea). We should be uniting on this one not arguing with each other
> (although it is good that differing arguments get thrashed out).

Maybe you're right.  The trouble is that you have some points I don't really agree with, like that writing strings in UTF-8 or UTF-16 should be illegal.  And some of my thoughts are:

- it should remain straightforward to interface legacy APIs, whatever character sets/encodings they may rely on
- the implementation should remain simple, complete with context-free lexical analysis

But at least we seem to agree that:
- string literals should remain specifyable in terms of actual Unicode codepoints
- prefixed string literals might come in useful one day as an alternative to unprefixed ones

Maybe we can come to a best of all three worlds.

Stewart.

October 05, 2004

Re: What to do about \x?

Posted by larrycowan
in reply to Arcane Jill

larrycowan

Posted in reply to Arcane Jill

In article <cjtgk8$2h40$1@digitaldaemon.com>, Arcane Jill says...
>...
>String literals will now be consistent, logical, and sufficiently powerful for
>Walter's purposes. The distinction between byte[] and ubyte[] would then be the
>only remaining problem.
>
>Arcane Jill
>
>
Why is there any problem here? The content is not different.  ubyte and byte are distinguished only when used singly (or grouped) as binary for an arithmetic evaluation.  The only difference I can see is for something like:

ubyte what = -123;  // valid
and
byte what = -123; // invalid, would be = 133 to same content

but that is a different type of initialization.

October 06, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to larrycowan

Arcane Jill

Posted in reply to larrycowan

In article <cjur82$lpv$1@digitaldaemon.com>, larrycowan says...

>Why is there any problem here?

It's possible that there isn't. I don't know enough about how the compiler can sort these things out.


>The content is not different.

Well, there is a sense in which the context is different. For example:

#    byte[] x = "\x80\x90";
#    ubyte[] y = "\x80\x90";

#    assert( x[1] < 0 );
#    assert( y[1] >= 0 );
#    assert( x[1] != y[1] );

I guess I incline to the opinion that only ubyte[] should be used for binary string literals, but maybe that's not the only answer.




>ubyte and
>byte are distinguished only when used singly (or grouped) as binary for
>an arithmetic evaluation.  The only difference I can see is for something
>like:
>
>ubyte what = -123;  // valid
>and
>byte what = -123; // invalid, would be = 133 to same content
>
>but that is a different type of initialization.

You mean assignment. Assignment can happen at times other than just initialization.

But I was more getting at stuff like this:

#    byte[] x = "\x80\x90";
#    ubyte[] y = x;         // A lossless convertion? Should this compile?

or

#    byte[] x = "\x80\x90";
#    ubyte[] y = "\x80\x90";
#    if (x == y)           // Again, should this compile?
#                          // And, if so, should it evaluate to true or false?

or

#    void f(byte[] s) { /*stuff*/ }
#    void f(ubyte[] s) { /*stuff*/ }
#    f( b"\x80\xC0" );     // which f gets called?

Jill

October 06, 2004

Re: What to do about \x?

Posted by Arcane Jill
in reply to Stewart Gordon

Arcane Jill

Posted in reply to Stewart Gordon

In article <cjuhhv$bn7$1@digitaldaemon.com>, Stewart Gordon says...

>Wrong again.  "£10" is three wchars: '£', '1', '0'.  w"10" is correct.

<Hangs head in shame>
What can I say? One of these days I must learn to count!



>As I said before, need to be able to interface foreign APIs relying on non-Unicode character sets/encodings.

I think you must be misunderstanding me. Fair enough - maybe I'm not that good at explaining things.

Okay, let's say you a foreign API whose signature is something like:

#    extern(C) char * strstr(char *s1, char *s2);

Hopefully, we're all familiar with that one. Now, the following would all be legal under both Jill's rules and Stewart's rules:

#    strstr("hello", "e");
#    strstr("€100", "€");
#    strstr("€100", "\u20AC");

The following would be legal under Stewart's rules, but illegal under Jill's:

#    strstr("€100", "\xE2\x82\xAC");  // note that this will /succeed/ #                                     // - that is, return non-null

However, even under Jill's rules, you could still do:

#    strstr("€100", cast(char*) b"\xE2\x82\xAC");

if you really wanted to insist on stuffing encoding details into a literal.

But you said "relying on non-Unicode character sets/encodings" - so let's look at that now. Let's say we're running on a WINDOWS-1252 machine, in which character set the character '€' has codepoint 0x80. Now you'd have to write:

#    strstr("\x80100", "\x80");                            // Stewart's rules
#    strstr(cast(char*) b"\x80100", cast(char*) b"\x80");  // Jill's rules

Now here, of course, by use of those casts, you are explicitly telling the compiler "I know what I'm doing" -- which I don't think is unreasonable given that you are now hand-coding WINDOWS-1252.

Of course, I've been assuming here that the type of b"..." would be byte[] or ubyte[]. But it's equally possible that Walter might decide that the type of b"..." is actually to be char[] -- precisely so that it /can/ interface easily with foreign APIs, in which case we'd end up with:

#    strstr("\x80100", "\x80");      // Stewart's rules
#    strstr(b"\x80100", b"\x80");    // Jill's rules





>What are the ill consequences of hand-coding UTF-8, from which one would need to be protected?

POSSIBILITY ONE:

Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a
string containing a multiplication sign ('×') immediately followed by a Euro
sign ('€'). Such a user might mistakenly type

#    char[] s = "\xD7\x80";

since 0xD7 is the WINDOWS-1252 codepoint for '×', and 0x80 is the WINDOWS-1252 codepoint for '€'. This /does/ compile under present rules - but results in s containing the single Unicode character U+05C0 (Hebrew punctuation PASEQ). This is not what the user was expecting, and results entirely from the system trying to interpret \x as UTF-8 when it wasn't. If the user had been required to instead type

#    char[] s = "\u00D7\u20AC";

then they would have been protected from that error.

POSSIBILITY TWO:

Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in:

#    char[] s = "\xE2\x82\x8C";  // whoops - should be "\xE2\x82\xAC"

who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type:

#    char[] s = "\u20AC";

then they would have been protected from that error.



>> Let us seize this opportunity while we can. It's not often that such an opportunity arises that /actually has Walter's backing/ (and was actually Walter's idea). We should be uniting on this one not arguing with each other (although it is good that differing arguments get thrashed out).
>
>Maybe you're right.  The trouble is that you have some points I don't really agree with, like that writing strings in UTF-8 or UTF-16 should be illegal.

Not illegal - I'm only asking for a b prefix. As in b"...".


>And some of my thoughts are:
>
>- it should remain straightforward to interface legacy APIs, whatever character sets/encodings they may rely on

Yes, absolutely. That would be one use of b"...".



>Maybe we can come to a best of all three worlds.

Of course we can. We're good. I'm not "making a stand" or "taking a position". I'm open to persuasion. I can be persuaded to change my mind. Assuming that's also true of you, it should just be a matter of logicking out all the pros and cons, and then letting Walter be the judge.

Arcane Jill

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation