June 27, 2012
On Wed, Jun 27, 2012 at 10:09 PM, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
>
>
>> I don't see why having the literal be a string would make anything
>> confusing.
>> The fact that a string is considered a range of dchar rather than char
>> could
>> be, but I don't see why having a string literal be a dstring instead of a
>> string would help with that. Besides, it's generally expected that you'll
>> use
>> string for strings unless you specifically need wstring or dstring for
>> some
>> reason.
>
>
> No, the reason is:
>
> 1. T[] is a range of T, unless T == char or T == wchar, and then it's a
> range of dchar (huh?)
> 2. char[] is not a random access range, even though str[i] and str.length
> work.
>
> The fundamental flaw in the way this works is that phobos is pretending
> immutable(char)[] is not an array.  immutable(char)[] should be an array of
> immutable char, string should be a *separate type* of a range of dchar,
> perhaps with immutable(char)[] as its underlying storage.
>
> D needs a full, library-defined string type.  Until it has that, it's going to cause endless confusion and WATs.
>
> -Steve

Agreed. Having struct strings (with slices and everything) will set
the record straight.

-- 
Bye,
Gor Gyolchanyan.
June 27, 2012
On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
> Agreed. Having struct strings (with slices and everything) will set
> the record straight.

Except that they couldn't have slicing, because it would be very inefficient. You'd have to get at the actual array of code units to slice anything. A struct string type would have to be restricted to exactly the same set of operations that range-based functions consider strings to have and then give you a way to get at the underlying code unit representation to be able to use it when special-casing for strings for efficiency, just like you do now.

You _can't_ get away from the fact that you're dealing with an array (or list or whatever) of code units even if you do want to operate on it as a range of code points most of the time. Having a struct would fix the issues like foreach iterating over char by default whereas range-based functions iterate over dchar - it would make it consistent by making it dchar for everything - but the issue of code unit vs code point still remains and you can't get rid of it. Anyone wanting to write efficient string-processing code _needs_ to understand unicode. There's no way around it (which is part of the reason that Walter isn't keen on the idea of changing how strings work in the language itself).

So, while having a string type which is a struct does help eliminate the schizophrenia, the core problem of code unit vs code point is still there, and you still need to understand it. There is no fix for it, because it's intrinsic to how unicode works.

- Jonathan M Davis
June 27, 2012
On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
>> Agreed. Having struct strings (with slices and everything) will set
>> the record straight.
>
> Except that they couldn't have slicing, because it would be very inefficient. You'd have to get at the actual array of code units to slice anything. A struct string type would have to be restricted to exactly the same set of operations that range-based functions consider strings to have and then give you a way to get at the underlying code unit representation to be able to use it when special-casing for strings for efficiency, just like you do now.
>
> You _can't_ get away from the fact that you're dealing with an array (or list or whatever) of code units even if you do want to operate on it as a range of code points most of the time. Having a struct would fix the issues like foreach iterating over char by default whereas range-based functions iterate over dchar - it would make it consistent by making it dchar for everything - but the issue of code unit vs code point still remains and you can't get rid of it. Anyone wanting to write efficient string-processing code _needs_ to understand unicode. There's no way around it (which is part of the reason that Walter isn't keen on the idea of changing how strings work in the language itself).
>
> So, while having a string type which is a struct does help eliminate the schizophrenia, the core problem of code unit vs code point is still there, and you still need to understand it. There is no fix for it, because it's intrinsic to how unicode works.
>
> - Jonathan M Davis

Yes you can get away. The struct string would have ubyte[] ushort[] and uint[] as the representation. Maybe even the char[], wchar[] and dchar[], but those won't be strings as we know them now. The string struct will take care of encoding 100% transparently and will provide access to the representation, which is good for bit blitting and other encoding-agnostic operations, but the representation is then known NOT to be a valid string and will need to be placed into the string struct in order to use string operations.

-- 
Bye,
Gor Gyolchanyan.
June 27, 2012
On 06/27/2012 08:09 PM, Steven Schveighoffer wrote:
> On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis
> <jmdavisProg@gmx.com> wrote:
>
>
>> I don't see why having the literal be a string would make anything
>> confusing.
>> The fact that a string is considered a range of dchar rather than char
>> could
>> be, but I don't see why having a string literal be a dstring instead of a
>> string would help with that. Besides, it's generally expected that
>> you'll use
>> string for strings unless you specifically need wstring or dstring for
>> some
>> reason.
>
> No, the reason is:
>
> 1. T[] is a range of T, unless T == char or T == wchar, and then it's a
> range of dchar (huh?)
> 2. char[] is not a random access range, even though str[i] and
> str.length work.
>
> The fundamental flaw in the way this works is that phobos is pretending
> immutable(char)[] is not an array. immutable(char)[] should be an array
> of immutable char, string should be a *separate type* of a range of
> dchar, perhaps with immutable(char)[] as its underlying storage.
>
> D needs a full, library-defined string type. Until it has that, it's
> going to cause endless confusion and WATs.
>
> -Steve

There is no reason for anyone to be confused about this endlessly. It
is simple to understand. Furthermore, think about the implications of a
library-defined string type: it just introduces the problem of what the
type of built-in string literals should be. This would cause endless
pain with type deduction, ifti, string mixins, ... A library-defined
string type cannot be a full string type. Pretending that it can has no
value.

alias immutable(char)[] string is just fine.
June 27, 2012
On 06/27/2012 08:54 PM, Gor Gyolchanyan wrote:
> On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis<jmdavisProg@gmx.com>  wrote:
>> On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
>>> Agreed. Having struct strings (with slices and everything) will set
>>> the record straight.
>>
>> Except that they couldn't have slicing, because it would be very inefficient.
>> You'd have to get at the actual array of code units to slice anything. A
>> struct string type would have to be restricted to exactly the same set of
>> operations that range-based functions consider strings to have and then give
>> you a way to get at the underlying code unit representation to be able to use
>> it when special-casing for strings for efficiency, just like you do now.
>>
>> You _can't_ get away from the fact that you're dealing with an array (or list
>> or whatever) of code units even if you do want to operate on it as a range of
>> code points most of the time. Having a struct would fix the issues like foreach
>> iterating over char by default whereas range-based functions iterate over
>> dchar - it would make it consistent by making it dchar for everything - but
>> the issue of code unit vs code point still remains and you can't get rid of
>> it. Anyone wanting to write efficient string-processing code _needs_ to
>> understand unicode. There's no way around it (which is part of the reason that
>> Walter isn't keen on the idea of changing how strings work in the language
>> itself).
>>
>> So, while having a string type which is a struct does help eliminate the
>> schizophrenia, the core problem of code unit vs code point is still there, and
>> you still need to understand it. There is no fix for it, because it's intrinsic
>> to how unicode works.
>>
>> - Jonathan M Davis
>
> Yes you can get away. The struct string would have ubyte[] ushort[]
> and uint[] as the representation. Maybe even the char[], wchar[] and
> dchar[], but those won't be strings as we know them now. The string
> struct will take care of encoding 100% transparently

Encoding cannot be taken care of 100% transparently. It has performance implications.

> and will provide access to the representation, which is good for bit blitting and other
> encoding-agnostic operations, but the representation is then known NOT
> to be a valid string

It is NOT known not to be a valid string. Furthermore, this directly contradicts what you claimed above. If the representation is exposed,
it is certainly not transparent.

> and will need to be placed into the string struct in order to use string operations.
>

aliasing..?
June 27, 2012
On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:

> There is no reason for anyone to be confused about this endlessly. It
> is simple to understand. Furthermore, think about the implications of a
> library-defined string type: it just introduces the problem of what the
> type of built-in string literals should be. This would cause endless
> pain with type deduction, ifti, string mixins, ... A library-defined
> string type cannot be a full string type. Pretending that it can has no
> value.

Default type of the literal should be the library type.  If you want immutable(char)[], use "abc".codeunits or equivalent.

Of course, it should by default work as a zero-terminated char * for C compatibility.

The current situation is not simple to understand.  Generic code that accepts arrays has to special-case narrow-width strings if you plan to use phobos with them in some cases.  That is a horrible situation.

> alias immutable(char)[] string is just fine.

That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine.

-Steve
June 27, 2012
On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> There is no reason for anyone to be confused about this endlessly. It
>> is simple to understand. Furthermore, think about the implications of a
>> library-defined string type: it just introduces the problem of what the
>> type of built-in string literals should be. This would cause endless
>> pain with type deduction, ifti, string mixins, ... A library-defined
>> string type cannot be a full string type. Pretending that it can has no
>> value.
>
> Default type of the literal should be the library type.

Then it is not a library type, but a built-in type. Are you planning to
inject a dependency on Phobos into the compiler?

> If you want immutable(char)[], use "abc".codeunits or equivalent.
>

I really don't want to type .codeunits, but I want to use
immutable(char)[] everywhere. This 'library type' is just an interface
change that makes writing nice and efficient code a kludge.

> Of course, it should by default work as a zero-terminated char * for C
> compatibility.
>
> The current situation is not simple to understand.

It is simple, even if not immediately obvious. It does not have to be
immediately obvious without explanation. It needs to be convenient.

> Generic code that accepts arrays  has to special-case narrow-width strings if you plan to
> use phobos with them in some cases. That is a horrible situation.
>

Generic code accepts ranges, not arrays. All necessary (or maybe
unnecessary, I don't know) special casing is already done for you in
Phobos. The _only_ thing that is problematic is the inconsistent
'foreach' behaviour.

>> alias immutable(char)[] string is just fine.
>
> That is technically fine, but if phobos wants to treat immutable(char)[]
> as something other than an array, it is not fine.
>
> -Steve

Phobos does not treat immutable(char)[] as something other than an
array. It does not treat all arrays uniformly though.
June 27, 2012
On Wednesday, June 27, 2012 22:54:28 Gor Gyolchanyan wrote:
> On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis <jmdavisProg@gmx.com>
wrote:
> > On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
> >> Agreed. Having struct strings (with slices and everything) will set
> >> the record straight.
> > 
> > Except that they couldn't have slicing, because it would be very inefficient. You'd have to get at the actual array of code units to slice anything. A struct string type would have to be restricted to exactly the same set of operations that range-based functions consider strings to have and then give you a way to get at the underlying code unit representation to be able to use it when special-casing for strings for efficiency, just like you do now.
> > 
> > You _can't_ get away from the fact that you're dealing with an array (or list or whatever) of code units even if you do want to operate on it as a range of code points most of the time. Having a struct would fix the issues like foreach iterating over char by default whereas range-based functions iterate over dchar - it would make it consistent by making it dchar for everything - but the issue of code unit vs code point still remains and you can't get rid of it. Anyone wanting to write efficient string-processing code _needs_ to understand unicode. There's no way around it (which is part of the reason that Walter isn't keen on the idea of changing how strings work in the language itself).
> > 
> > So, while having a string type which is a struct does help eliminate the schizophrenia, the core problem of code unit vs code point is still there, and you still need to understand it. There is no fix for it, because it's intrinsic to how unicode works.
> > 
> > - Jonathan M Davis
> 
> Yes you can get away. The struct string would have ubyte[] ushort[] and uint[] as the representation. Maybe even the char[], wchar[] and dchar[], but those won't be strings as we know them now. The string struct will take care of encoding 100% transparently and will provide access to the representation, which is good for bit blitting and other encoding-agnostic operations, but the representation is then known NOT to be a valid string and will need to be placed into the string struct in order to use string operations.

If you want efficient strings, you _must_ worry about the encoding. It's _impossible_ for it to be otherwise. It helps quite a bit if you're using functions that someone else already wrote which take this into account rather than having to write the functions yourself, but if you're doing much in the way of string processing, you _must_ understand unicode in order to handle them properly. I fully understand that it's something that most people don't want to have to worry about, but the reality of the matter is that the can't do that unless you don't care about efficiency. The fact that strings are variably length encoded has a huge impact on how they need to be used if you care about both correctness and efficiency. You can't escape it.

- Jonathan M Davis
June 27, 2012
On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
>> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>>
>>> There is no reason for anyone to be confused about this endlessly. It
>>> is simple to understand. Furthermore, think about the implications of a
>>> library-defined string type: it just introduces the problem of what the
>>> type of built-in string literals should be. This would cause endless
>>> pain with type deduction, ifti, string mixins, ... A library-defined
>>> string type cannot be a full string type. Pretending that it can has no
>>> value.
>>
>> Default type of the literal should be the library type.
>
> Then it is not a library type, but a built-in type. Are you planning to
> inject a dependency on Phobos into the compiler?

No, druntime, and include minimal utf support.  We do the same thing with AssociativeArray.

>> If you want immutable(char)[], use "abc".codeunits or equivalent.
>>
>
> I really don't want to type .codeunits, but I want to use
> immutable(char)[] everywhere. This 'library type' is just an interface
> change that makes writing nice and efficient code a kludge.

When most string functions take strings, why would you want to use immutable(char)[] everywhere?

>> Of course, it should by default work as a zero-terminated char * for C
>> compatibility.
>>
>> The current situation is not simple to understand.
>
> It is simple, even if not immediately obvious. It does not have to be
> immediately obvious without explanation. It needs to be convenient.

Try sorting an array of ascii characters.

>> Generic code that accepts arrays  has to special-case narrow-width strings if you plan to
>> use phobos with them in some cases. That is a horrible situation.
>>
>
> Generic code accepts ranges, not arrays. All necessary (or maybe
> unnecessary, I don't know) special casing is already done for you in
> Phobos. The _only_ thing that is problematic is the inconsistent
> 'foreach' behaviour.

Plenty of generic code specializes on arrays.

>>> alias immutable(char)[] string is just fine.
>>
>> That is technically fine, but if phobos wants to treat immutable(char)[]
>> as something other than an array, it is not fine.
>>
>> -Steve
>
> Phobos does not treat immutable(char)[] as something other than an
> array. It does not treat all arrays uniformly though.

It certainly does.  An array by definition is a random-access range.  It does not treat strings as random access ranges.

-Steve
June 27, 2012
On Wednesday, June 27, 2012 17:11:56 Steven Schveighoffer wrote:
> On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
> > On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
> >> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch>
> >> 
> >> wrote:
> >>> There is no reason for anyone to be confused about this endlessly. It is simple to understand. Furthermore, think about the implications of a library-defined string type: it just introduces the problem of what the type of built-in string literals should be. This would cause endless pain with type deduction, ifti, string mixins, ... A library-defined string type cannot be a full string type. Pretending that it can has no value.
> >> 
> >> Default type of the literal should be the library type.
> > 
> > Then it is not a library type, but a built-in type. Are you planning to inject a dependency on Phobos into the compiler?
> 
> No, druntime, and include minimal utf support. We do the same thing with AssociativeArray.
> 
> >> If you want immutable(char)[], use "abc".codeunits or equivalent.
> > 
> > I really don't want to type .codeunits, but I want to use immutable(char)[] everywhere. This 'library type' is just an interface change that makes writing nice and efficient code a kludge.
> 
> When most string functions take strings, why would you want to use
> immutable(char)[] everywhere?
> 
> >> Of course, it should by default work as a zero-terminated char * for C compatibility.
> >> 
> >> The current situation is not simple to understand.
> > 
> > It is simple, even if not immediately obvious. It does not have to be immediately obvious without explanation. It needs to be convenient.
> 
> Try sorting an array of ascii characters.

Cast it to ubyte[]. Problem solved. I honestly don't think that operating on code units like that should be encourage at all, so if it's a bit hard to do, then that's a _good_ thing (but since all that's required is casting to ubyte[], it's still quite easy - you just have to tell the compiler that that's what you really want to do rather than it being the default behavior). The problem that we have is the inconsistencies between how the language treats strings and how the library does, not the fact that operating on char[] as if it were ASCII rather than UTF-8 requires some casting.

> >> Generic code that accepts arrays has to special-case narrow-width
> >> strings if you plan to
> >> use phobos with them in some cases. That is a horrible situation.
> > 
> > Generic code accepts ranges, not arrays. All necessary (or maybe unnecessary, I don't know) special casing is already done for you in Phobos. The _only_ thing that is problematic is the inconsistent 'foreach' behaviour.
> 
> Plenty of generic code specializes on arrays.

You're stuck doing that regardless of how strings are represented. You have to operate on them as ranges of code points (or even graphemes) if you want correct string processing, but that's inefficient, so anything caring about efficiency which can gain extra efficiency by coding with knowledge of how unicode works and operate on the code units will need to special case. Whether string is an array or a struct has zero effect on that. All that it affects is what operates on it as an array of code units vs a range of code points.

> >>> alias immutable(char)[] string is just fine.
> >> 
> >> That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine.
> >> 
> >> -Steve
> > 
> > Phobos does not treat immutable(char)[] as something other than an array. It does not treat all arrays uniformly though.
> 
> It certainly does. An array by definition is a random-access range. It does not treat strings as random access ranges.

Well, now you're getting into a semantics argument. isRandomAccessRange defines what a random access range is. All arrays which aren't narrow strings qualify. Narrow strings do not. Yes, they do have random-access operations, but they aren't random-access ranges, because they're ranges of code points, not code units.

Yes, this makes it so that character arrays are treated inconsistently from other arrays, but the library is very consistent in how it handles them, because it _never_ deals with strings as being made of code units. If it's operating on them as arrays, then it takes unicode into account, and if it's operating on them as ranges, it treats them as ranges of code points. It _always_ makes sure that it's operating on code points. Plenty of code specializes on strings so that it can deal with the code units in an efficient manner rather than having to decode them all the time, but Phobos is completely consistent with regards to how it treats strings. The _only_ inconsintencies are between the language and the library - namely how foreach iterates on code units by default and the fact that while the language defines length, slicing, and random-access operations for strings, the library effectively does not consider strings to have them.

- Jonathan M Davis