October 01, 2004
In article <cjk0rv$oqo$1@digitaldaemon.com>, Walter says...
>
>This is a good summary of the situation. I don't know what the right answer is yet, so it's a good topic for discussion. One issue not mentioned yet is the (not that unusual) practice of stuffing arbitrary data into a string with \x. For example, in C one might create a length prefixed 'pascal style' string with:
>
>    unsigned char* p = "\x03abc";
>
>where the \x03 is not character data, but length data. I'm hesitant to say "you cannot do that in D".

You've /already/ outlawed that in D, Walter - at least, if the length is greater than 127. Try compiling this:

#    char[] p = "\x81abcdefg...";



>I also want to support things like:
>
>    byte[] a = "\x05\x89abc\xFF";
>
>where clearly a bunch of binary data is desired.

And I. That's a good plan. But that's a byte[] literal you want there, not a char[] literal. How to tell them apart, that's the problem...?



>One possibility is to use prefixes on the string literals:
>
>    "..."    // Takes its type from the context, only ascii \x allowed
>    c"..."    // char[] string literal, only ascii \x allowed
>    w"..."   // wchar[] string literal, \x not allowed
>    d"..."    // dchar[] string literal, \x not allowed
>    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
>not allowed

Perfect! I'd support that wholeheartedly. It definitely seems the right way to go - except that I'd allow the following:

#              /x        /u      /U
#    ------------------------------------
#     "..."    ASCII     yes     yes
#    c"..."    ASCII     yes     yes
#    w"..."    no        yes     yes
#    d"..."    no        yes     yes
#    b"..."    yes       no      no

That is, if the prefix is omitted, Unicode escapes should still be allowed. Otherwise people will complain when

#    char[] euroSign = "\u20AC";

fails to compile.

Oh - and one other thing: for the benefit of novice and other newcomers to D, it would be nice if the error message which is output when something like "\xC0" fails to compile was more helpful: "invalid UTF-8" has turned out to be confusing; "invalid ASCII" would be even worse (as would suggest that D only does ASCII). You need an error message which lets folk know that we can do Unicode, and that "\u" followed by the four-digit Unicode codepoint is most likely what they want.

Arcane Jill



October 02, 2004
Walter wrote:
> where the \x03 is not character data, but length data. I'm hesitant to say
> "you cannot do that in D". I also want to support things like:
> 
>     byte[] a = "\x05\x89abc\xFF";
> 
> where clearly a bunch of binary data is desired.

That code won't compile.  x"" should be turned to produce ubyte[] instead.

For the "meat is my potatoes" crew you can allow implicitly converting x"" literals to char[]; then they can embed random data within the string to their heart's content.

> One possibility is to use prefixes on the string literals:
> 
>     "..."    // Takes its type from the context, only ascii \x allowed
>     c"..."    // char[] string literal, only ascii \x allowed
>     w"..."   // wchar[] string literal, \x not allowed
>     d"..."    // dchar[] string literal, \x not allowed
>     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
> not allowed

So the language should be made more complex and even more hostile to templating because... don't got an answer there.  Just leave \x as is but have it encode.  It'll be a stumbling point for transitionary users, but that's what tutorials are for.
October 04, 2004
In article <cjkhnf$167v$1@digitaldaemon.com>, Arcane Jill says...

Here's my latest suggestion, only very slightly modified from the last one. The essential difference is the behavior of the default ("...") string type.

For string literals with a prefix, the following rules should apply:

#    Literal   /x        /u      /U     type
#    -------------------------------------------
#    a"..."    ASCII     no      no     char[]
#    c"..."    ASCII     yes     yes    char[]
#    w"..."    no        yes     yes    wchar[]
#    d"..."    no        yes     yes    dchar[]
#    b"..."    yes       no      no     ubyte[]

(Note the addition of the a"..." type. We don't actually need this, but it makes the table below make more sense).

For string literals /without/ a prefix, the rules should be determined by context, as follows:

#    Context           Treat as
#    ---------------------------
#    char[]            c"..."
#    wchar[]           w"..."
#    dchar[]           d"..."
#    ubyte[]           b"..."
#    byte[]            b"..."
#    indeterminate     a"..."

Now there's only one remaining problem: should /both/ of the following lines compile, or should one of them need a cast? (or perhaps put another way, should byte[] and ubyte[] be mutually implicitly convertable?)

#    byte[] x = "\x05\x89abc\xFF";
#    ubyte[] y = "\x05\x89abc\xFF";

(by the context rules, these strings would be interpreted as b"\x05\x89abc\xFF")

Arcane Jill

PS. Of course, this sort of thing (defining consecutive bytes of mixed data) is simple to do in assembler, so it ought to be simple to do in a close-to-the-processor language like C:

#    db 5, 89h, "abc", 0FFh



October 04, 2004
In article <cjqrt4$1vbg$1@digitaldaemon.com>, Arcane Jill says...

>PS. Of course, this sort of thing (defining consecutive bytes of mixed data) is simple to do in assembler, so it ought to be simple to do in a close-to-the-processor language like C:

Er - make that D.

>#    db 5, 89h, "abc", 0FFh



October 04, 2004
Arcane Jill wrote:
<snip>
> Tricky. Well, if "\uD800" were allowed, for example, it would open up the door
> to the very nasty possibility that "\U0000D800" might also be allowed - and that
> /definitely/ shouldn't be, because it's not a valid character. I dunno - I just
> think it could be very confusing and counterintuitive.
> 
> Everyone expects \u and \U to prefix /characters/, not fragments of some
> encoding scheme. Why would you want it any different?

Because that's how \x is, and at the moment we have an equivalent for UTF-16 according to the spec, but not according to the compiler.

>>Agreed from the start.  Nobody suggested that anyone should have to learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being _allowed_ to use UTF-32 (and hence actual codepoints) should be mutually exclusive.  I thought that was half the spirit of having both \u and \U - to give the programmer the choice.
> 
> In other languages, \u#### is identical in meaning to \U0000####. Why make D
> different?

Are languages defined by the spec or the compiler?  I'd've thought the spec, IWC D is already different.

> If you want to hand-code UTF-16, you can always do it like this:
> 
> #    wchar[] s = [ 0xD800, 0xDC00 ];

If D's meant to be consistent, then I should also be able to do

    wchar[] s = "\uD800\uDC00";

or even

    dchar[] s = "\uD800\uDC00";

or

    char[] s = "\uD800\uDC00";

or the same with the 'u' replaced by some other letter defined with these semantics.

<snip>
>>- invent another escape to represent a UTF-16 fragment, for the sake of completeness.
> 
> I'm not sure you need an escape sequence. Why not just replace
> 
> #    wchar c = '\uD800';
> 
> with
> 
> #    wchar c = 0xD800;

Because that doesn't enable whole string literals to be done like this.  And because it would be inconsistent to allow string literals to be encoded in anything except UTF-16.

> What requirement is there to hand-code UTF-8 /inside a string literal/? If you
> get it right, then you might just as well have used \U and the codepoint; if you
> get it wrong, you're buggered. 

The same as there was when the subject was first brought up.

Stewart.
October 04, 2004
Walter wrote:

<snip>
> One possibility is to use prefixes on the string literals:
> 
>     "..."    // Takes its type from the context, only ascii \x allowed
>     c"..."    // char[] string literal, only ascii \x allowed
>     w"..."   // wchar[] string literal, \x not allowed
>     d"..."    // dchar[] string literal, \x not allowed
>     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
> not allowed

I thought it was by design that string literals take their type from the context, and that they could be initialised by UTF-8, UTF-16 or UTF-32 regardless of the destination type.  As such, it would be a step backward to restrict what can go in a "...".

This leaves two things to do:

1. Stop string literals from jumping to the wrong conclusions.  At parse time, a StringLiteral is just a StringLiteral, right?  It has no specific type yet.  On seeing a subexpression

    StringLiteral ~ StringLiteral

the compiler would concatenate the strings there and then, and the result would still be a StringLiteral.

Only when a specific type is required, e.g.

- assignment
- concatenation with a string of known type
- passing to a function or template
- an explicit cast

would the semantic analysis turn it into a char[], wchar[] or dchar[]. The spec would need to be clear about what happens if a given function/template name is overloaded to take different string types, and of course the type it would have in a variadic function.

Other thing we would need to be careful about:

(a) whether an expression like

    "£10"[1..3]

should be allowed, and what it should do

(b) when we get array arithmetic, whether it should be allowed on strings.  String literals and strings of known type could be considered separately in this debate.

Of course, c"..." et al could still be invented, if only as syntactic sugar for cast(char[]) "..."....

2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed by \u escapes - get the compiler matching the spec or vice versa.  If we don't allow them, then if only for the sake of completeness, we should invent a new escape to denote UTF-16 fragments.

Stewart.
October 04, 2004
In article <cjr5no$28em$1@digitaldaemon.com>, Stewart Gordon says...

>> What requirement is there to hand-code UTF-8 /inside a string literal/? If you get it right, then you might just as well have used \U and the codepoint; if you get it wrong, you're buggered.
>
>The same as there was when the subject was first brought up.

Which is what, exactly? Why would anyone need to hand-code UTF-8 inside a string literal? Or UTF-16 for that matter?

I can't think of any circumstance in which "\x##\x##\x##\x##" would be preferable to "\U00######". Can you?

Jill


October 04, 2004
In article <cjr7ar$29j8$1@digitaldaemon.com>, Stewart Gordon says...
>
>Walter wrote:
>
><snip>
>> One possibility is to use prefixes on the string literals:
>> 
>>     "..."    // Takes its type from the context, only ascii \x allowed
>>     c"..."    // char[] string literal, only ascii \x allowed
>>     w"..."   // wchar[] string literal, \x not allowed
>>     d"..."    // dchar[] string literal, \x not allowed
>>     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
>> not allowed
>
>I thought it was by design that string literals take their type from the context, and that they could be initialised by UTF-8, UTF-16 or UTF-32 regardless of the destination type.

Opinions differ on that viewpoint.

Currently, strings /cannot/ be initialized with UTF-16, and I believe that behavior to be correct. (You believe it to be incorrect, I know. Like I said, opinions differ).

I don't believe that allowing them to be initialised with UTF-8 is a good idea either. It's dumb.

Regardless of the destination type!



>As such, it would be a step backward to restrict what can go in a "...".

I posted an update this morning with better rules, which I think would keep everyone happy. Well - everyone apart from those who want to explicitly stick UTF-8 and UTF-16 into char[], wchar[] and dchar[] anyway.



>This leaves two things to do:
>
>1. Stop string literals from jumping to the wrong conclusions.

Prefixes do that nicely. In the absense of prefixes, you can often figure it
out. But /sometimes/ you just can't. For instance: f("..."), where f() is
overloaded.

>The spec would need to be clear about what happens if a given function/template name is overloaded to take different string types, and of course the type it would have in a variadic function.

Yeah. The algorithm I posted this morning (which is almost the same as Walter's)
covers that nicely.


>Other thing we would need to be careful about:
>
>(a) whether an expression like
>
>     "£10"[1..3]
>
>should be allowed, and what it should do

An interesting conundrum!


>2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed by \u escapes - get the compiler matching the spec or vice versa.  If we don't allow them, then if only for the sake of completeness, we should invent a new escape to denote UTF-16 fragments.

Disagree. I still think that encoding-by-hand inside a string literal is a dumb idea - both for UTF-8 and for UTF-16. What on Earth is wrong with \u#### and \U00###### (where #### and ###### are just Unicode codepoints in hex)?

Walter suggests (and I agree with him) that \x should be for inserting arbitrary binary data into binary strings, and ASCII characters into text strings. I see no /point/ in defining \x to be UTF-8 ... unless of course you want to enter an obfuscated D contest with code like this:

#    wchar c = '\xE2\x82\xAC';    // currently legal

instead of either of:

#    wchar c = '\u20AC';
#    wchar c = '€';

It's just crazy.

Jill


October 04, 2004
Arcane Jill wrote:
> In article <cjr5no$28em$1@digitaldaemon.com>, Stewart Gordon says...
<snip>
>> The same as there was when the subject was first brought up.
> 
> Which is what, exactly? Why would anyone need to hand-code UTF-8 inside a string
> literal? Or UTF-16 for that matter?

Sorry.  Maybe I got mixed up and was thinking of the talk-to-death of initialising a char[] with arbitrary byte values.

> I can't think of any circumstance in which "\x##\x##\x##\x##" would be
> preferable to "\U00######". Can you?

Maybe it isn't in general.  But I suppose if you're doing file I/O and want your code to be self-documenting to the extent of indicating how many actual bytes are being transferred....

Stewart.
October 04, 2004
Stewart Gordon wrote:
> I thought it was by design that string literals take their type from the context

Eh, if we are that far, what prevents resolving overloaded function calls based on their return type? Then we could even have a sensible opCast.
Sorry to hijack the thread, but I never quite realised this feature.

-ben