Jump to page: 1 2
Thread overview
Proposal: clean up semantics of array literals vs string literals
Oct 02, 2012
Don Clugston
Oct 02, 2012
Tobias Pankrath
Oct 02, 2012
Don Clugston
Oct 02, 2012
deadalnix
Oct 02, 2012
Don Clugston
Oct 02, 2012
deadalnix
Oct 02, 2012
Andrej Mitrovic
Oct 02, 2012
Don Clugston
Oct 02, 2012
kenji hara
Oct 02, 2012
monarch_dodra
Oct 04, 2012
Bernard Helyer
Oct 02, 2012
Peter Alexander
Oct 04, 2012
Don Clugston
Oct 04, 2012
Bernard Helyer
Oct 04, 2012
Jakob Ovrum
October 02, 2012
The problem
-----------

String literals in D are a little bit magical; they have a trailing \0. This means that is possible to write,

printf("Hello, World!\n");

without including a trailing \0. This is important for compatibility with C. This trailing \0 is mentioned in the spec but only incidentally, and generally in connection with printf.

But the semantics are not well defined.

printf("Hello, W" ~ "orld!\n");

Does this have a trailing \0 ? I think it should, because it improves readability of string literals that are longer than one line. Currently DMD adds a \0, but it is not in the spec.

Now consider array literals.

printf(['H','e', 'l', 'l','o','\n']);

Does this have a trailing \0 ? Currently DMD does not put one in.
How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?

And "Hello " ~ ['W','o','r','l','d','\n']   ?

And "Hello World!" ~ '\n' ?
And  null ~ "Hello World!\n" ?

Currently DMD puts \0 in some cases but not others, and it's rather random.

The root cause is that this trailing zero is not part of the type, it's part of the literal. There are no rules for how literals are propagated inside expressions, they are just literals. This is a mess.

There is a second difference.
Array literals of char type, have completely different semantics from string literals. In module scope:

char[] x = ['a'];  // OK -- array literals can have an implicit .dup
char[] y = "b";    // illegal

This is a big problem for CTFE, because for CTFE, a string is just a compile-time value, it's neither string literal nor array literal!

See bug 8660 for further details of the problems this causes.


A proposal to clean up this mess
--------------------------------

Any compile-time value of type immutable(char)[] or const(char)[], behaves a string literals currently do, and will have a \0 appended when it is stored in the executable.

ie,

enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
printf(hello);

will work.

Any value of type char[], which is generated at compile time, will not have the trailing \0, and it will do an implicit dup (as current array literals do).

char [] foo()
{
    return "abc";
}

char [] x = foo();

// x does not have a trailing \0, and it is implicitly duped, even though it was not declared with an array literal.

-------------------
So that the difference between string literals and char array literals would simply be that the latter are polysemous. There would be no semantics associated with the form of the literal itself.


We still have this oddity:


void foo(char qqq = 'b') {

   string x = "abc";            // trailing \0
   string y = ['a', 'b', 'c'];  // trailing \0
   string z = ['a', qqq, 'c'];  // no trailing \0
}

This is because we made the (IMHO mistaken) decision to allow variables inside array literals.
This is the reason why I listed _compile time value_ in the requirement for having a \0, rather than entirely basing it on the type.

We could fix that with a language change: an array literal which contains a variable should not be of immutable type. It should be of mutable type (or const, in the case where it contains other, immutable values).

So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even though w is allocated on the heap).

But that's a separate proposal from the one I'm making here. I just need a decision on the main proposal so that I can fix a pile of CTFE bugs.
October 02, 2012
On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
> The problem
> -----------
>
> String literals in D are a little bit magical; they have a trailing \0. This means that is possible to write,
>
> printf("Hello, World!\n");
>
> without including a trailing \0. This is important for compatibility with C. This trailing \0 is mentioned in the spec but only incidentally, and generally in connection with printf.
>
> But the semantics are not well defined.
>
> printf("Hello, W" ~ "orld!\n");
>
If every string literal is \0-terminated, then there should be two \0 in the final string. I guess that's not the case and that's actually my preferred behaviour, but the spec should make it crystal clear in which situations a
string literal gets a terminator and in which not.

October 02, 2012
Well the whole mess come from the fact that D conflate C string and D string.

The first problem come from the fact that D array are implicitly convertible to pointer. So calling D function that expect a char* is possible with D string even if it is unsafe and will not work in the general case.

The fact that D provide tricks that will make it work in special cases is armful as previous discussion have shown (many D programmer assume that this will always work because of toy tests they have made, where in case it won't and toStringz must be used).

The only sane solution I can think of is to :
 - disallow slice to convert implicitly to pointer. .ptr is made for that.
 - Do not put any trailing 0 in string literal, unless it is specified explicitly ( "foobar\0" ).
 - Except if a const(char)* is expected from the string literal. In case it becomes a Cstring literal, with a trailing 0. This is made to allow uses like printf("foobar");

In other terms, the receiver type is used to decide if the compiler generate a string literal or a Cstring literal.

Other addition of 0 are just confusing, and will make incorrect code work in special cases, which is something you usually don't want. Code that work by accident often backfire in spectacular ways at the least expected moment.
October 02, 2012
On 10/2/12, Don Clugston <dac@nospam.com> wrote:
> A proposal to clean up this mess
> --------------------------------
>
> Any compile-time value of type immutable(char)[] or const(char)[],
> behaves a string literals currently do, and will have a \0 appended when
> it is stored in the executable.
>
> ie,
>
> enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
> printf(hello);
>
> will work.

What about these, will these pass?:

enum string x = "foo";
assert(x.length == 3);

void test(string x) { assert(x.length == 3); }
test(x);

If these don't pass the proposal will break code.
October 02, 2012
On 02/10/12 14:02, Andrej Mitrovic wrote:
> On 10/2/12, Don Clugston <dac@nospam.com> wrote:
>> A proposal to clean up this mess
>> --------------------------------
>>
>> Any compile-time value of type immutable(char)[] or const(char)[],
>> behaves a string literals currently do, and will have a \0 appended when
>> it is stored in the executable.
>>
>> ie,
>>
>> enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
>> printf(hello);
>>
>> will work.
>
> What about these, will these pass?:
>
> enum string x = "foo";
> assert(x.length == 3);
>
> void test(string x) { assert(x.length == 3); }
> test(x);
>
> If these don't pass the proposal will break code.

Yes, they pass. The \0 is not included in the string length. It's effectively in the data segment, not in the string.


October 02, 2012
On 02/10/12 13:26, deadalnix wrote:
> Well the whole mess come from the fact that D conflate C string and D
> string.
>
> The first problem come from the fact that D array are implicitly
> convertible to pointer. So calling D function that expect a char* is
> possible with D string even if it is unsafe and will not work in the
> general case.
>
> The fact that D provide tricks that will make it work in special cases
> is armful as previous discussion have shown (many D programmer assume
> that this will always work because of toy tests they have made, where in
> case it won't and toStringz must be used).
>
> The only sane solution I can think of is to :
>   - disallow slice to convert implicitly to pointer. .ptr is made for that.
>   - Do not put any trailing 0 in string literal, unless it is specified
> explicitly ( "foobar\0" ).
>   - Except if a const(char)* is expected from the string literal. In
> case it becomes a Cstring literal, with a trailing 0. This is made to
> allow uses like printf("foobar");
>
> In other terms, the receiver type is used to decide if the compiler
> generate a string literal or a Cstring literal.

This still doesn't solve the problem of the difference between array literals and string literals (the magical implicit .dup), which is the key problem I'm trying to solve.

October 02, 2012
On 02/10/12 13:18, Tobias Pankrath wrote:
> On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
>> The problem
>> -----------
>>
>> String literals in D are a little bit magical; they have a trailing
>> \0. This means that is possible to write,
>>
>> printf("Hello, World!\n");
>>
>> without including a trailing \0. This is important for compatibility
>> with C. This trailing \0 is mentioned in the spec but only
>> incidentally, and generally in connection with printf.
>>
>> But the semantics are not well defined.
>>
>> printf("Hello, W" ~ "orld!\n");
>>
> If every string literal is \0-terminated, then there should be two \0 in
> the final string. I guess that's not the case and that's actually my
> preferred behaviour, but the spec should make it crystal clear in which
> situations a
> string literal gets a terminator and in which not.

The \0 is *not* part of the string, it lies after the string.
It's as if all memory is cleared, then the string literals are copied into it, with a gap of at least one byte between each. The 'trailing 0' is not part of the literal, it's the underlying cleared memory.

At least, that's how I understand it. The spec is very vague.

October 02, 2012
2012/10/2 Don Clugston <dac@nospam.com>:
> The problem
> -----------
>
> String literals in D are a little bit magical; they have a trailing \0. This means that is possible to write,
>
> printf("Hello, World!\n");
>
> without including a trailing \0. This is important for compatibility with C. This trailing \0 is mentioned in the spec but only incidentally, and generally in connection with printf.
>
> But the semantics are not well defined.
>
> printf("Hello, W" ~ "orld!\n");
>
> Does this have a trailing \0 ? I think it should, because it improves readability of string literals that are longer than one line. Currently DMD adds a \0, but it is not in the spec.
>
> Now consider array literals.
>
> printf(['H','e', 'l', 'l','o','\n']);
>
> Does this have a trailing \0 ? Currently DMD does not put one in. How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?
>
> And "Hello " ~ ['W','o','r','l','d','\n']   ?
>
> And "Hello World!" ~ '\n' ?
> And  null ~ "Hello World!\n" ?
>
> Currently DMD puts \0 in some cases but not others, and it's rather random.
>
> The root cause is that this trailing zero is not part of the type, it's part of the literal. There are no rules for how literals are propagated inside expressions, they are just literals. This is a mess.
>
> There is a second difference.
> Array literals of char type, have completely different semantics from string
> literals. In module scope:
>
> char[] x = ['a'];  // OK -- array literals can have an implicit .dup char[] y = "b";    // illegal
>
> This is a big problem for CTFE, because for CTFE, a string is just a compile-time value, it's neither string literal nor array literal!
>
> See bug 8660 for further details of the problems this causes.
>
>
> A proposal to clean up this mess
> --------------------------------
>
> Any compile-time value of type immutable(char)[] or const(char)[], behaves a
> string literals currently do, and will have a \0 appended when it is stored
> in the executable.
>
> ie,
>
> enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
> printf(hello);
>
> will work.
>
> Any value of type char[], which is generated at compile time, will not have the trailing \0, and it will do an implicit dup (as current array literals do).
>
> char [] foo()
> {
>     return "abc";
> }
>
> char [] x = foo();
>
> // x does not have a trailing \0, and it is implicitly duped, even though it was not declared with an array literal.
>
> -------------------
> So that the difference between string literals and char array literals would simply be that the latter are polysemous. There would be no semantics associated with the form of the literal itself.
>
>
> We still have this oddity:
>
>
> void foo(char qqq = 'b') {
>
>    string x = "abc";            // trailing \0
>    string y = ['a', 'b', 'c'];  // trailing \0
>    string z = ['a', qqq, 'c'];  // no trailing \0
> }
>
> This is because we made the (IMHO mistaken) decision to allow variables
> inside array literals.
> This is the reason why I listed _compile time value_ in the requirement for
> having a \0, rather than entirely basing it on the type.
>
> We could fix that with a language change: an array literal which contains a variable should not be of immutable type. It should be of mutable type (or const, in the case where it contains other, immutable values).
>
> So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even though w is allocated on the heap).
>
> But that's a separate proposal from the one I'm making here. I just need a decision on the main proposal so that I can fix a pile of CTFE bugs.

Maybe your proposal is correct.
I think the key idea is *polysemous typed string literal*.

When based on the Ideal D Interpreter in my brain, the organized rule will become like follows.

1-1) In semantic level, D should have just one polysemous string
literal, which is "an array of char".
1-2) In token level, D has two represents for the polysemous string
literal, they are "str" and ['s','t','r'].

2) The polysemous string literl is implicitly convertible to
[wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is
not need, because immutable([wd]?char)[] is implicitly convertible to
them).

3) The concatenation result between polysemous literals is still polysemous, but its representation is different based on the both side of the operator.

   "str" ~ "str";         // "strstr"
   "str" ~ ['s','t','r']; // ['s','t','r','s','t','r']
   "str" ~ 's';           // "strs"
   ['s','t','r'] ~ 's';   // ['s','t','r','s']
   "str" ~ null;          // "str"
   ['s','t','r'] ~ null;  // ['s','t','r']

4) After semantics _and_ optimization, polysemous string literal which
represented as like
 4-1) "str" is typed as immutable([wd]?char)[] (The char type is
depends on the literal suffix).
 4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is
depends on the common type of its elements).

5) In object file generating phase, string literal which typed as
  5-1) immutable([wd]?)char[] is stored in the executable and
implicitly terminated with \0.
  5-2) [wd]?char[] are stored in the executable as the original image
and implicitly 'dup'ed in runtime.

----
Additionally, in following case, both concatenation should generate
polysemous string literals in CT and RT.
Because, after concatenation of chars and char arrays, newly allocated
strings are *purely immutable* value and implicitly convertible to
mutable.

immutable char ic = 'a';
pragma(msg, typeof(['s', 't', ic, 'r']));   // prints const(char)[]
immutable(char)[] s = ['s', 't', ic, 'r'];  // BUT, should be allowed

char mc = 'a';
pragma(msg, typeof("st"~mc~"r"));   // prints const(char)[]
char[] s = "st"~mc~"r";             // BUT, should be allowed

Kenji Hara
October 02, 2012
On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:
> [SNIP]
> A proposal to clean up this mess
> [SNIP]

While I think it is convenient to be able to write 'printf("world");', as you point out, I think that the fact that it works "inconsistently" (and by that, I mean there are rules and exceptions), is even more dangerous.

If at all possible, I'd rather side with consistency, then the "we got your back... except when we don't" approach: IE: strings are NEVER null terminated.

In theory, how often do you *really* need null terminated strings? And when you do, wouldn't it be safer to just write 'printf("world\0")'? or 'printf(str ~ "world" ~ '\0');' rather than "Am I in a case where it is null terminated? Yeah... 90% confident I am..."

If you want 0 termination, then make it explicit, that's my opinion.

Besides, as you said, the null termination is not documented, so anything relying on it is a bug really. Just an observation of an implementation detail.
October 02, 2012
Le 02/10/2012 15:12, Don Clugston a écrit :
> On 02/10/12 13:26, deadalnix wrote:
>> Well the whole mess come from the fact that D conflate C string and D
>> string.
>>
>> The first problem come from the fact that D array are implicitly
>> convertible to pointer. So calling D function that expect a char* is
>> possible with D string even if it is unsafe and will not work in the
>> general case.
>>
>> The fact that D provide tricks that will make it work in special cases
>> is armful as previous discussion have shown (many D programmer assume
>> that this will always work because of toy tests they have made, where in
>> case it won't and toStringz must be used).
>>
>> The only sane solution I can think of is to :
>> - disallow slice to convert implicitly to pointer. .ptr is made for that.
>> - Do not put any trailing 0 in string literal, unless it is specified
>> explicitly ( "foobar\0" ).
>> - Except if a const(char)* is expected from the string literal. In
>> case it becomes a Cstring literal, with a trailing 0. This is made to
>> allow uses like printf("foobar");
>>
>> In other terms, the receiver type is used to decide if the compiler
>> generate a string literal or a Cstring literal.
>
> This still doesn't solve the problem of the difference between array
> literals and string literals (the magical implicit .dup), which is the
> key problem I'm trying to solve.
>

OK, infact we have 2 different and unrelated problems here. I have to say I have no idea for the second one.
« First   ‹ Prev
1 2