Proposal: clean up semantics of array literals vs string literals (page 2)

October 02, 2012

Re: Proposal: clean up semantics of array literals vs string literals

Posted by Andrei Alexandrescu
in reply to Don Clugston

Permalink

Andrei Alexandrescu

Posted in reply to Don Clugston

Permalink

On 10/2/12 7:11 AM, Don Clugston wrote:
> The problem
> -----------
>
> String literals in D are a little bit magical; they have a trailing \0.
[snip]

I don't mean to be Debbie Downer on this because I reckon it addresses an issue that some have, although I never do. With that warning, a few candid opinions follow.

First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary.

Second, a simple and workable solution to this would be to address the matter dynamically: make toStringz opportunistically look whether there's a \0 beyond the end of the string, EXCEPT when the string happens to end exactly at a page boundary (in which case accessing memory beyond the end of the string may produce a page fault). With this simple dynamic test we don't need precise and stringent rules for the implementation.

Third, the complex set of rules proposed pushes the number of cases in which the \0 is guaranteed, but doesn't make for a clear and easy to remember boundary. Therefore people will need to remember some more rules to make sure they can, well, avoid a call to toStringz.

On 10/2/12 10:55 AM, Regan Heath wrote:
> Recent discussions on the zero terminated string problems and
> inconsistency of string literals has me, again, wondering why D
> doesn't have a 'type' to represent C's zero terminated strings.  It
> seems to me that having a type, and typing C functions with it would
> solve a lot of problems.
[snip]
> I am probably missing something obvious, or I have forgotten one of
> the array/slice complexities which makes this a nightmare.

You're not missing anything and defining a zero-terminated type is something I considered doing and have been highly interested in. My interest is motivated by the fact that sentinel-terminated structures are a very interesting example of forward ranges that are also contiguous. That sets them apart from both singly-linked lists and simple arrays, and gives them interesting properties.

I'd be interested in defining the more general:

struct SentinelTerminatedSlice(T, T terminator)
{
    private T* data;
    ...
}

That would be a forward range and the instantiation SentinelTerminatedSlice!(char, 0) would be CString.

However, so far I held off of defining such a range because C-strings are seldom useful in D code and there are not many other compelling examples of sentinel-terminated ranges. Maybe it's time to dust off that idea, I'd love it if we gathered enough motivation for it.

Andrei

On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu wrote: > However, so far I held off of defining such a range because C-strings are seldom useful in D code [...] I think your view of what is common in D code is not representative. You are primarily a library writer, which means you rarely have to interface with other code. Please correct me if I'm wrong, but I don't believe you've written much application-level D code. For people that write applications, we have the unfortunate chore of having to call lots of C APIs to get things done. There's a long list of things for which there is no D interface (graphics, audio, input, GUI, database, platform APIs, various 3rd party libs). Invariably these interfaces require C strings. In short, if you write applications in D, you need C strings. I don't know what the right decision is here, but please do not say that C-strings are seldom useful in D code.

On 02/10/12 17:14, Andrei Alexandrescu wrote: > On 10/2/12 7:11 AM, Don Clugston wrote: >> The problem >> ----------- >> >> String literals in D are a little bit magical; they have a trailing \0. > [snip] > > I don't mean to be Debbie Downer on this because I reckon it addresses > an issue that some have, although I never do. With that warning, a few > candid opinions follow. > > First, I think zero-terminated strings shouldn't be needed frequently > enough in D code to make this necessary. [snip] You're missing the point, a bit. The zero-terminator is only one symptom of the underlying problem: string literals and array literals have the same type but different semantics. The other symptoms are: * the implicit .dup that happens with array literals, but not string literals. This is a silent performance killer. It's probably the most common performance bug we find in our code, and it's completely ungreppable. * string literals are polysemous with width (c, w, d) but array literals are not (they are polysemous with constness). For example, "abc" ~ 'ü' is legal, but ['a', 'b', 'c'] ~ 'ü' is not. This has nothing to do with the zero terminator.

On Tuesday, 2 October 2012 at 14:03:36 UTC, monarch_dodra wrote: > If you want 0 termination, then make it explicit, that's my opinion. That ship has long since sailed. You'll break code in an incredibly dangerous way if you were to change it now.

On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu wrote: > First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary. My experience has been much different. Interfacing with C occurs in nearly every D program I write, and I usually end up passing a string literal. Anecdotes!

On Thursday, 4 October 2012 at 07:57:16 UTC, Bernard Helyer wrote: > On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu wrote: >> First, I think zero-terminated strings shouldn't be needed frequently enough in D code to make this necessary. > > My experience has been much different. Interfacing with C occurs > in nearly every D program I write, and I usually end up passing > a string literal. Anecdotes! Agreed. I'm always happy when I find that the particular C API I am working with supports passing strings as a pointer/length pair :) Anyway, toStringz (and the wchar and dchar equivalents in std.utf) needs to be fixed regardless - it currently does a dangerous optimization if the string is immutable, otherwise it unconditionally concatenates. We cannot rely on strings being GC allocated based on mutability. Memory is outside the scope of the D type system - we cannot make assumptions about memory based on types.

Forums