June 27, 2012
On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
> On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
>>> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch>
>>> wrote:
>>>
>>>> There is no reason for anyone to be confused about this endlessly. It
>>>> is simple to understand. Furthermore, think about the implications of a
>>>> library-defined string type: it just introduces the problem of what the
>>>> type of built-in string literals should be. This would cause endless
>>>> pain with type deduction, ifti, string mixins, ... A library-defined
>>>> string type cannot be a full string type. Pretending that it can has no
>>>> value.
>>>
>>> Default type of the literal should be the library type.
>>
>> Then it is not a library type, but a built-in type. Are you planning to
>> inject a dependency on Phobos into the compiler?
>
> No, druntime, and include minimal utf support. We do the same thing with
> AssociativeArray.
>

In this case it is misleading to call it a library type.

>>> If you want immutable(char)[], use "abc".codeunits or equivalent.
>>>
>>
>> I really don't want to type .codeunits, but I want to use
>> immutable(char)[] everywhere. This 'library type' is just an interface
>> change that makes writing nice and efficient code a kludge.
>
> When most string functions take strings, why would you want to use
> immutable(char)[] everywhere?
>

Because the proposed 'string' interface is inconvenient to use and useless. It is a struct with one data member and no additionally
maintained invariant, and it strictly narrows the essential parts of
the interface to the data that is reachable without a large typing
overhead. immutable(char)[] supports exactly the operations I usually
need. Maybe I'm not representative.

>>> Of course, it should by default work as a zero-terminated char * for C
>>> compatibility.
>>>
>>> The current situation is not simple to understand.
>>
>> It is simple, even if not immediately obvious. It does not have to be
>> immediately obvious without explanation. It needs to be convenient.
>
> Try sorting an array of ascii characters.
>

auto asciitext = cast(ubyte[])"I am ascii text";
sort(asciitext);


>>> Generic code that accepts arrays has to special-case narrow-width
>>> strings if you plan to
>>> use phobos with them in some cases. That is a horrible situation.
>>>
>>
>> Generic code accepts ranges, not arrays. All necessary (or maybe
>> unnecessary, I don't know) special casing is already done for you in
>> Phobos. The _only_ thing that is problematic is the inconsistent
>> 'foreach' behaviour.
>
> Plenty of generic code specializes on arrays.
>

Ok, point taken. But plenty of generic code then specializes on
strings as well. Would the net gain be so huge? There is also always
the option of just not passing strings to some helper template function
you defined.

There are multiple valid contradictory considerations on the topic, but
I have found the current way of dealing with strings very pleasant.

>>>> alias immutable(char)[] string is just fine.
>>>
>>> That is technically fine, but if phobos wants to treat immutable(char)[]
>>> as something other than an array, it is not fine.
>>>
>>> -Steve
>>
>> Phobos does not treat immutable(char)[] as something other than an
>> array. It does not treat all arrays uniformly though.
>
> It certainly does. An array by definition is a random-access range. It
> does not treat strings as random access ranges.
>
> -Steve

You are right about the random-access part, but the definition of an
array does not depend on the 'range' concept.
June 28, 2012
On Wednesday, June 27, 2012 23:41:14 Timon Gehr wrote:
> On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
> > When most string functions take strings, why would you want to use
> > immutable(char)[] everywhere?
> 
> Because the proposed 'string' interface is inconvenient to use and useless. It is a struct with one data member and no additionally maintained invariant, and it strictly narrows the essential parts of the interface to the data that is reachable without a large typing overhead. immutable(char)[] supports exactly the operations I usually need. Maybe I'm not representative.

I think that a lot of programmers want to be able to use strings without worrying about any of the details (like unicode). The fact that foreach and the library don't treat strings the same is confusing, and the fact that narrow strings are ranges of dchar (with all that that implies with regards to the operations that they support) seems to confuse a lot of people. If we had a struct for a string type, then the usage would be consistent (always a range of dchar), allowing the average programmer to more or less ignore unicode considerations as long as they don't care about efficiency, but it would still allow those who _do_ care to get at the underlying representation. So, a struct would be an improvement in that regard.

But for those who know what they're doing with regards to unicode and understand the fact that foreach treats strings one way and the library treats them another way, it really isn't a problem. It works quite well (which is one of the reasons that Walter isn't too keen on changing strings). It just isn't terribly newbie-friendly.

- Jonathan M Davis
June 28, 2012
On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> There is no reason for anyone to be confused about this endlessly. It is simple to understand. Furthermore, think about the implications of a library-defined string type: it just introduces the problem of what the type of built-in string literals should be. This would cause endless pain with type deduction, ifti, string mixins, ... A library-defined string type cannot be a full string type. Pretending that it can has no value.
>
>
> Default type of the literal should be the library type.  If you want immutable(char)[], use "abc".codeunits or equivalent.
>
> Of course, it should by default work as a zero-terminated char * for C compatibility.
>
> The current situation is not simple to understand.  Generic code that accepts arrays has to special-case narrow-width strings if you plan to use phobos with them in some cases.  That is a horrible situation.
>
>
>> alias immutable(char)[] string is just fine.
>
>
> That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine.
>
> -Steve

Currently strings below dstring are only applicable in ForwardRange and below, but not RandomAccessRange as they should be.

-- 
Bye,
Gor Gyolchanyan.
June 28, 2012
On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
> On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer
> 
> <schveiguy@yahoo.com> wrote:
> > On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
> >> There is no reason for anyone to be confused about this endlessly. It is simple to understand. Furthermore, think about the implications of a library-defined string type: it just introduces the problem of what the type of built-in string literals should be. This would cause endless pain with type deduction, ifti, string mixins, ... A library-defined string type cannot be a full string type. Pretending that it can has no value.
> > 
> > Default type of the literal should be the library type.  If you want immutable(char)[], use "abc".codeunits or equivalent.
> > 
> > Of course, it should by default work as a zero-terminated char * for C compatibility.
> > 
> > The current situation is not simple to understand.  Generic code that accepts arrays has to special-case narrow-width strings if you plan to use phobos with them in some cases.  That is a horrible situation.
> > 
> >> alias immutable(char)[] string is just fine.
> > 
> > That is technically fine, but if phobos wants to treat immutable(char)[]
> > as
> > something other than an array, it is not fine.
> > 
> > -Steve
> 
> Currently strings below dstring are only applicable in ForwardRange and below, but not RandomAccessRange as they should be.

Except that they shouldn't be, because you can't do random access on a narrow string in O(1). If you can't index or slice a range in O(1), it has no business having those operations. The same goes for length. That's why narrow strings do not have any of those operations as far as ranges are concerned. Having those operations in anything worse than O(1) violates the algorithmic complexity guarantees that ranges are supposed to provide, which would seriously harm the efficiency of algorithms which rely on them. It's the same reason why std.container defines the algorithmic complexity of all the operations in std.container. If you want a random-access range which is a string type, you need dchar[], const(dchar)[], or dstring. That is very much on purpose and would not change even if strings were structs.

- Jonathan M Davis
June 28, 2012
"Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
> completely consistent with regards to how it treats strings. The _only_ inconsintencies are between the language and the library - namely how foreach iterates on code units by default and the fact that while the language defines length, slicing, and random-access operations for strings, the library effectively does not consider strings to have them.

char[] is not treated as an array by the library, and is not treated as a RandomAccessRange. That is a second inconsistency, and it would be avoided is string were a struct.

I won't repeat arguments that were already said, but if it matters, to me, things should be such that:

 - string is a druntime defined struct, with an undelying
immutable(char)[]. It is a BidirectionalRange of dchar. Slicing is
provided for convenience, but not as opSlice, since it is not O(1), but
as a method with a separate name. Direct access to the underlying
char[]/ubyte[] is provided.

 - similar structs are provided to hold underlying const(char)[] and
char[]

 - similar structs are provided for wstring

 - dstring is a druntime defined alias to dchar[] or a struct with the
same functionalities for consistency with narrow string being struct.

 - All those structs may be provided as a template.
struct string(T = immutable(char)) {...}
alias string(immutable(wchar)) wstring;
alias string(immutable(dchar)) dstring;

string(const(char)) and string(char) ... are the other types of strings.

 - this string template could also be defined as a wrapper to convert
any range of char/wchar into a range of dchar. That does not need to be
in druntime. Only types necessary for string litterals should be in
druntime.

 - string should not be convertible to char*. Use toStringz to interface
with c code, or the underlying char[] if you know you it is
zero-terminated, at you own risk. Only string litterals need to be
convertible to char*, and I would say that they should be
zero-terminated only when they are directly used as char*, to allow the
compiler to optimize memory.

 - char /may/ disappear in favor of ubyte (or the contrary, or one could
alias the other), if there is no other need to keep separate types that
having strings that are different from ubyte[]. Only dchar is necessary,
and it could just be called char.

That is ideal to me. Of course, I understand code compatibility is important, and compromises have to be made. The current situation is a compromise, but I don't like it because it is a WAT for every newcomer. But the last point, for example, would bring no more that code breakage. Such code breakage may make us find bugs however...

-- 
Christophe
June 28, 2012
On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
> > completely consistent with regards to how it treats strings. The _only_ inconsintencies are between the language and the library - namely how foreach iterates on code units by default and the fact that while the language defines length, slicing, and random-access operations for strings, the library effectively does not consider strings to have them.

> char[] is not treated as an array by the library

Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and char[] works with the functions in std.array. It's just that they're all special-cased appropriately to handle narrow strings properly. What it doesn't do is treat char[] as a range of char.

> and is not treated as a RandomAccessRange.

Which is what I already said.

> That is a second inconsistency, and it would be avoided is string were a
struct.

No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a struct to represent them is irrelevant. And if you can't do those operations in O(1), then they can't be random access ranges.

The _only_ thing that using a struct for narrow strings fixes is the inconsistencies with foreach (it would then use dchar just like all of the range stuff does), and slicing, indexing, and length wouldn't be on it, eliminating the oddity of them existing but not considered to exist by range- based functions. It _would_ make things somewhat nicer for newbies, but it would not give you one iota more of functionality. Narrow strings would still be bidirectional ranges but not access ranges, and you would still have to operate on the underlying array to operate on strings efficiently.

If we were to start from stratch, it probably would be better to go with a struct type for strings, but it would break far too much code for far too little benefit at this point. You need to understand the unicode stuff regardless - like the difference between code units and code points. So, if anything, the fact that strings are treated inconsistently and are treated as ranges of dchar - which confuses so many newbies - is arguably a _good_ thing in that it forces newbies to realize and understand the unicode issues involved rather than blindly using strings in a horribly inefficient manner as would inevitably occur with a struct string type.

So, no, the situation is not exactly ideal, and yes, a struct string type might have been a better solution, but I think that many of the folks who are pushing for a struct string type are seriously overestimating the problems that it would solve. Yes, it would make the language and library more consistent, but that's it. You'd still have to use strings in essentially the same way that you do now. It's just that you wouldn't have to explicitly use dchar with foreach, and you'd have to get at the property which returned the underlying array in order to operate on the code units as you need to do in many functions to make your code appropriately efficient rather than simply using the string that way directly by not using its range-based functions. There is a difference, but it's a lot smaller than many people seem to think.

- Jonathan M Davis
June 28, 2012
Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
> On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
>> "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
>> > completely consistent with regards to how it treats strings. The _only_ inconsintencies are between the language and the library - namely how foreach iterates on code units by default and the fact that while the language defines length, slicing, and random-access operations for strings, the library effectively does not consider strings to have them.
> 
>> char[] is not treated as an array by the library
> 
> Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and char[] works with the functions in std.array. It's just that they're all special-cased appropriately to handle narrow strings properly. What it doesn't do is treat char[] as a range of char.
> 
>> and is not treated as a RandomAccessRange.

All arrays are treated as RandomAccessRanges, except for char[] and wchar[]. So I think I am entitled to say that strings are not treated as arrays. An I would say I am also entitle to say strings are not normal ranges, since they define length, but have isLength as true, and define opIndex and opSlice, but are not RandomAccessRanges.

The fact that isDynamicArray!(char[]) is true, but isRandomAccessRange is not is just another aspect of the schizophrenia. The behavior of a templated function on a string will depend on which was used as a guard.

> 
> Which is what I already said.
> 
>> That is a second inconsistency, and it would be avoided is string were a
> struct.
> 
> No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a struct to represent them is irrelevant. And if you can't do those operations in O(1), then they can't be random access ranges.

I never said strings should support length and slicing. I even said they should not. foreach is inconsistent with the way strings are treated in phobos, but opIndex, opSlice and length, are inconsistent to. string[0] and string.front do not even return the same....

Please read my post a little bit more carefully before answering them.

About the rest of your post, I basically say the same as you in shorter terms, except that I am in favor of changing things (but I didn't even said they should be changed in my conclusion).

newcomers are troubled by this problem, and I think it is important. They will make mistakes when using both array and range functions on strings in the same algorithm, or when using array functions without knowing about utf8 encoding issues (the fact that array functions are also valid range functions if not for strings does not help). But I also think experienced programmers can be affected, because of inattention, reusing codes written by inexperienced programmers, or inappropriate template guards usage.

As a more general comment, I think having a consistent langage is a very important goal to achieve when designing a langage. It makes everything simpler, from langage design to user through compiler and library development. It may not be too late for D.

-- 
Christophe
June 28, 2012
On Thursday, June 28, 2012 09:28:52 Christophe Travert wrote:
> I never said strings should support length and slicing. I even said they should not. foreach is inconsistent with the way strings are treated in phobos, but opIndex, opSlice and length, are inconsistent to. string[0] and string.front do not even return the same....
> 
> Please read my post a little bit more carefully before answering them.

You said this:

> char[] is not treated as an array by the library, and is not treated as a RandomAccessRange. That is a second inconsistency, and it would be avoided is string were a struct.

So, it looked to me like you were saying that making string a struct would make it so that it was a random access range, which would mean implementing length, opSlice, and opIndex.

- Jonathan M Davis
June 28, 2012
On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
>> char[] is not treated as an array by the library, and is not treated as a RandomAccessRange. That is a second inconsistency, and it would be avoided is string were a struct.
>
> So, it looked to me like you were saying that making string a struct would
> make it so that it was a random access range, which would mean implementing
> length, opSlice, and opIndex.

I think he meant that the problem would be solved because people would be less likely to expect it to be a random access range in the first place.

What troubles me most with having is(string == immutable(char)[]) is that it more or less precludes us from adding small string optimizations, etc. in the future…

David
June 28, 2012
On Thursday, 28 June 2012 at 05:10:43 UTC, Jonathan M Davis wrote:
> On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
>> Currently strings below dstring are only applicable in ForwardRange
>> and below, but not RandomAccessRange as they should be.
>
> Except that they shouldn't be, because you can't do random access on a narrow
> string in O(1). If you can't index or slice a range in O(1), it has no
> business having those operations. The same goes for length. That's why narrow
> strings do not have any of those operations as far as ranges are concerned.
> Having those operations in anything worse than O(1) violates the algorithmic
> complexity guarantees that ranges are supposed to provide, which would
> seriously harm the efficiency of algorithms which rely on them. It's the same
> reason why std.container defines the algorithmic complexity of all the
> operations in std.container. If you want a random-access range which is a
> string type, you need dchar[], const(dchar)[], or dstring. That is very much
> on purpose and would not change even if strings were structs.
>
> - Jonathan M Davis

Pedantically speaking, it is possible to index a string with about 50-51% memory overhead to get random access in 0(1) time. Best-performing algorithms can do random access in about 35-50 nanoseconds per operation for strings up to tens of megabytes. For bigger strings (tested up to 1GB) or when some other memory-intensive calculations are performed simultaneously, random access takes up to 200 nanoseconds due to memory-access resolution process.