September 21, 2011
On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
> > unicode natively. Yet the 'D strings are strange and confusing' argument comes up quite often on the web.
> 
> Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird.
> 
> mini-quiz: what should std.range.drop(some_string, 1) do ?
> hint: what it actually does is not what the documentation of phobos
> suggests*...

What do you mean? It does exactly what it says that it does.

> Strings are array of char, but they appear like a lazy range of dchar to phobos. I could cope with the fact that this is a little unexpected for beginners. But well, that creates a lot of exceptions in phobos, like the fact that you can't even copy a char[] to a char[] with std.algorithm.copy. And I don't mention all the optimization that are not/cannot be performed for those strings. I'll just remember to use ubyte[] wherever I can...

Yeah, well, as long as char is a unicode code unit, that's the way that it goes. In general, Phobos does a good job of using slicing and other optimizations on strings when it can in spite of the fact that they're not sliceable ranges, but there are cases where the fact that you _have_ to process them to be able to find the nth code point means that you just can't process them as efficiently as your typical array. That's life with a variable- length encoding - and that includes std.algorithm.copy. But that's an easy one to get around, since if you wanted to ignore unicode safety and just copy some chunk of the string, you can always just slice it directly with no need for copy.

> * Please, someone just adds in the documentation of IsSliceable that narrow strings are an exception, like it was recently added to hasLength.

A good point.

- Jonathan M Davis
September 21, 2011
On 2011-09-21 15:52, Timon Gehr wrote:
> On 09/21/2011 09:37 AM, bearophile wrote:
>> Andrei Alexandrescu:
>>
>>> http://hackerne.ws/item?id=3014861
>>>
>>> Apparently we're still having a PR issue.
>>
>> I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
>> its space to D (meaning D2).
>>
>
> Yes, that is important. Wikipedia is usually the first place people go
> looking for information, and much of the information given there is
> horribly outdated/wrong and mostly only concerns. Many people think
> what's on Wikipedia is true. [citation needed]
>
> "For performance reasons, string slicing and the length property operate
> on code units rather than code points (characters), which frequently
> confuses developers.[27]"
>
> The link at [27] only says that many programmers that don't had have to
> handle unicode have trouble understanding how unicode works initially.
> It is not a D thing in any other way than that D actually supports
> unicode natively. Yet the 'D strings are strange and confusing' argument
> comes up quite often on the web, probably because many feel they are
> competent enough to discuss the language after having read the Wikipedia
> article.

Ruby pre 1.9 behaves similar. Well, actually Ruby pre 1.9 is not encoding aware at all, if I recall correctly.

-- 
/Jacob Carlborg
September 21, 2011
On 9/21/11 10:16 AM, Christophe wrote:
> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>> unicode natively. Yet the 'D strings are strange and confusing' argument
>> comes up quite often on the web.
>
> Well, I think they are. The ptr+length stuff is amasing, but the
> behavior of strings in phobos is weird.
>
> mini-quiz: what should std.range.drop(some_string, 1) do ?
> hint: what it actually does is not what the documentation of phobos
> suggests*...
>
> Strings are array of char, but they appear like a lazy range of dchar to
> phobos. I could cope with the fact that this is a little unexpected for
> beginners. But well, that creates a lot of exceptions in phobos, like
> the fact that you can't even copy a char[] to a char[] with
> std.algorithm.copy. And I don't mention all the optimization that are
> not/cannot be performed for those strings. I'll just remember to use
> ubyte[] wherever I can...

String handling in D is good modulo the oddities you noticed. What would make it perfect would be:

* Add property .rep that returns byte[], ushort[], or uint[] for char[], wchar[], dchar[] respectively (with the appropriate qualifier).

* Replace .length with .codeUnits.

* Disallow [n] and [m .. n]

This would upgrade D's strings from good to awesome. Really it would be a dream come true. Unfortunately it would also break most D code there is out there. I don't see how we can improve the current situation while staying backward compatible.


Andrei
September 21, 2011
On Wed, 21 Sep 2011 11:39:03 -0500, Andrei Alexandrescu wrote:

> On 9/21/11 10:16 AM, Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>>> unicode natively. Yet the 'D strings are strange and confusing' argument comes up quite often on the web.
>>
>> Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird.
>>
>> mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*...
>>
>> Strings are array of char, but they appear like a lazy range of dchar to phobos. I could cope with the fact that this is a little unexpected for beginners. But well, that creates a lot of exceptions in phobos, like the fact that you can't even copy a char[] to a char[] with std.algorithm.copy. And I don't mention all the optimization that are not/cannot be performed for those strings. I'll just remember to use ubyte[] wherever I can...
> 
> String handling in D is good modulo the oddities you noticed. What would make it perfect would be:
> 
> * Add property .rep that returns byte[], ushort[], or uint[] for char[], wchar[], dchar[] respectively (with the appropriate qualifier).
> 
> * Replace .length with .codeUnits.
> 
> * Disallow [n] and [m .. n]
> 
> This would upgrade D's strings from good to awesome. Really it would be a dream come true. Unfortunately it would also break most D code there is out there. I don't see how we can improve the current situation while staying backward compatible.
> 
> 
> Andrei

1. Let "string" remain an alias for "immutable(char)[]", and introduce a new struct, "text!charType", that does the awesome stuff; provide good conversion routines and casts between text instances and old-school strings.

2. Provide an awesome "std.text" library for the text type and related operations, absorbing std.uni and other similar modules. Adapt Phobos to take "text" in virtually every function that currently expects a "string".

3. ...

4. Profit.

Graham
September 21, 2011
Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
> On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>> > unicode natively. Yet the 'D strings are strange and confusing' argument comes up quite often on the web.
>> 
>> Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird.
>> 
>> mini-quiz: what should std.range.drop(some_string, 1) do ?
>> hint: what it actually does is not what the documentation of phobos
>> suggests*...
> 
> What do you mean? It does exactly what it says that it does.

It does say it uses the slice operator if the range is sliceable, and the documentation to isSliceable fails to precise that a narrow string is not sliceable.

> Yeah, well, as long as char is a unicode code unit, that's the way that it goes.

They are not unicode units.

void main() {
  char a = 'ä';
  writeln(a); // outputs: \344
  writeln('ä'); // outputs: ä
}

Obviouly, a code unit don't fit in a char.
Thus 'char[]' is not what the name claims it is.

Unicode operations should be supported by a different class that is really a lazy range of dchar implemented as an undelying char[], with no length, index, or stride operator, and appropriate optimizations.

> In general, Phobos does a good job of using slicing and other
> optimizations on strings when it can in spite of the fact that they're not
> sliceable ranges, but there are cases where the fact that you _have_ to
> process them to be able to find the nth code point means that you just can't
> process them as efficiently as your typical array. That's life with a variable-
> length encoding. - and that includes std.algorithm.copy. But that's an easy one
> to get around, since if you wanted to ignore unicode safety and just copy some
> chunk of the string, you can always just slice it directly with no need for
> copy.

Dealing with utfencoded strings is less efficient, but there is a number of algorithms that can be optimized for utfencoded strings, like copying or finding an ascii char in a string. Unfortunately, there is no practical way to do this with the current range API.

About copy, it's not that easy to overcome the problem if you are using a template, and that template happens to be instanciated for strings.

>> * Please, someone just adds in the documentation of IsSliceable that narrow strings are an exception, like it was recently added to hasLength.
> 
> A good point.

The main point of my post actually.

-- 
Christophe
September 21, 2011
On 21/09/11 5:39 PM, Andrei Alexandrescu wrote:
> On 9/21/11 10:16 AM, Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>>> unicode natively. Yet the 'D strings are strange and confusing' argument
>>> comes up quite often on the web.
>>
>> Well, I think they are. The ptr+length stuff is amasing, but the
>> behavior of strings in phobos is weird.
>>
>> mini-quiz: what should std.range.drop(some_string, 1) do ?
>> hint: what it actually does is not what the documentation of phobos
>> suggests*...
>>
>> Strings are array of char, but they appear like a lazy range of dchar to
>> phobos. I could cope with the fact that this is a little unexpected for
>> beginners. But well, that creates a lot of exceptions in phobos, like
>> the fact that you can't even copy a char[] to a char[] with
>> std.algorithm.copy. And I don't mention all the optimization that are
>> not/cannot be performed for those strings. I'll just remember to use
>> ubyte[] wherever I can...
>
> String handling in D is good modulo the oddities you noticed. What would
> make it perfect would be:
>
> * Add property .rep that returns byte[], ushort[], or uint[] for char[],
> wchar[], dchar[] respectively (with the appropriate qualifier).
>
> * Replace .length with .codeUnits.
>
> * Disallow [n] and [m .. n]
>
> This would upgrade D's strings from good to awesome. Really it would be
> a dream come true. Unfortunately it would also break most D code there
> is out there. I don't see how we can improve the current situation while
> staying backward compatible.
>
>
> Andrei

From what I can see, the problem with D string is that they are a 'magic' special case for arrays.

char[] should be an array of char, just like int[] is an array of int. If you have a T[] arr, then typeof(arr.front) should be T. This is what everyone would expect. char[] should essentially be the same as byte[], although char[] would be more natural for ASCII strings.

string should be something different, a separate type. As you say, disallow [n] and [m..n] would be good as they make no sense with VLE. You could have .length and .codeUnits, but length would have to be O(n). That's not ideal, but since string wouldn't be an array, it doesn't need to have the same complexity guarantees.

Same for wchar[], dchar[], wstring and dstring.

Of course, making that change would break existing code. Maybe D3? :-)


September 21, 2011
On Wed, 21 Sep 2011 20:20:55 +0200, Christophe Travert <travert@phare.normalesup.org> wrote:


>> Yeah, well, as long as char is a unicode code unit, that's the way that it
>> goes.
>
> They are not unicode units.
>
> void main() {
>   char a = 'ä';
>   writeln(a); // outputs: \344
>   writeln('ä'); // outputs: ä
> }
>
> Obviouly, a code unit don't fit in a char.
> Thus 'char[]' is not what the name claims it is.

Oh, it absolutely is. According to the Unicode Consortium, A code unit is
"The minimal bit combination that can represent a unit of encoded text
for processing or interchange. The Unicode Standard uses 8-bit code units
in the UTF-8 encoding form [...]".

What you are thinking about is a code point.


> Unicode operations should be supported by a different class that
> is really a lazy range of dchar implemented as an undelying char[], with
> no length, index, or stride operator, and appropriate optimizations.

I can agree with this, but the benefits over what we already have are nigh
zilch.


-- 
  Simen
September 21, 2011
On Wednesday, September 21, 2011 11:20 Christophe Travert wrote:
> Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
> > On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
> >> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
> >> > unicode natively. Yet the 'D strings are strange and confusing' argument comes up quite often on the web.
> >> 
> >> Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird.
> >> 
> >> mini-quiz: what should std.range.drop(some_string, 1) do ?
> >> hint: what it actually does is not what the documentation of phobos
> >> suggests*...
> > 
> > What do you mean? It does exactly what it says that it does.
> 
> It does say it uses the slice operator if the range is sliceable, and the documentation to isSliceable fails to precise that a narrow string is not sliceable.

1. drop says nothing about slicing.
2. popFrontN (which drop calls) says that it slices for ranges that support
slicing. Strings do not unless they're arrays of dchar.

Yes, hasSlicing should probably be clearer about narrow strings, but that has nothing to do with drop.

> > Yeah, well, as long as char is a unicode code unit, that's the way that it goes.
> 
> They are not unicode units.
> 
> void main() {
> char a = 'ä';
> writeln(a); // outputs: \344
> writeln('ä'); // outputs: ä
> }
> 
> Obviouly, a code unit don't fit in a char.
> Thus 'char[]' is not what the name claims it is.
> 
> Unicode operations should be supported by a different class that
> is really a lazy range of dchar implemented as an undelying char[], with
> no length, index, or stride operator, and appropriate optimizations.

The problem with

char a = 'ä';

is completely separate from char[]. That has to do with the fact that the compiler isn't properly dealing with narrowing conversions for an individual character. A char is _by definition_ a UTF-8 code unit.

char a = 'ä';

shouldn't even be legal. It's a compiler bug. Most code which operates on individual chars or wchars is buggy. dchar is what should be used for individual characters. The code you give is buggy because it's trying to use char as individual character, and the compiler has a bug, so it doesn't catch it.

> > In general, Phobos does a good job of using slicing and other optimizations on strings when it can in spite of the fact that they're not sliceable ranges, but there are cases where the fact that you _have_ to process them to be able to find the nth code point means that you just can't process them as efficiently as your typical array. That's life with a variable- length encoding. - and that includes std.algorithm.copy. But that's an easy one to get around, since if you wanted to ignore unicode safety and just copy some chunk of the string, you can always just slice it directly with no need for copy.
> 
> Dealing with utfencoded strings is less efficient, but there is a number of algorithms that can be optimized for utfencoded strings, like copying or finding an ascii char in a string. Unfortunately, there is no practical way to do this with the current range API.
>
> About copy, it's not that easy to overcome the problem if you are using a template, and that template happens to be instanciated for strings.

In general, you _must_ treat strings as ranges of dchar if you want your code to be correct with regards to code points. So, that's the default. If you know that your particular algorithm can be optimized for strings without treating them strictly as ranges of dchar and still deal with unicode correctly (e.g. by slicing them, because you know the correct point to slice to), then you special-case your template for narrow strings. Phobos does that in a number of places. But you _have_ to special case it because being able to do so depends entirely on your algorithm. You can't treat strings that way in the general case, or yor code is not going to handle unicode correctly.

No, the way that strings are handled in D is not perfect. But we're trying to balance correctness and efficiency, and it does a good job of that. Using strings as ranges of dchar is correct, and in a large number of cases, it is as efficient as your going to get it (perhaps particular functions could be better optimized, but the API can't be made more efficient). And in the cases where you know you can safely get better efficiency out of strings by special casing them, that's what you do. It works. It works fairly well. And no one has been able to come up with a solution that's definitely better.

If there's an issue, it's that the message on how to correctly handle strings is not necessarily being communicated as well as it needs to be. In a few places, the documentation could probably be improved, but what we probobly really need in order to help get the message across is more articles on ranges and strings as ranges. I've actually partially written an article on that, but due to some compiler bugs regarding std.container, I ended up temporarily shelving it.

- Jonathan M Davis
September 21, 2011
"Simen Kjaeraas" , dans le message (digitalmars.D:144921), a écrit :
> What you are thinking about is a code point.

Yes, sorry. Then I disagree with "as long as char is a unicode code unit, that's the way that it goes", since myString.front should then return a code unit, whereas it actually returns a code point.

>> Unicode operations should be supported by a different class that
>> is really a lazy range of dchar implemented as an undelying char[], with
>> no length, index, or stride operator, and appropriate optimizations.
> 
> I can agree with this, but the benefits over what we already have are nigh zilch.

I think no one here as any illusion about changing this in D2.

However, this class could be introduced in phobos right now, without changing anything about string. It would simply be a bit safer than strings.

-- 
Christophe
September 21, 2011
"Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
> 1. drop says nothing about slicing.
> 2. popFrontN (which drop calls) says that it slices for ranges that support
> slicing. Strings do not unless they're arrays of dchar.
> 
> Yes, hasSlicing should probably be clearer about narrow strings, but that has nothing to do with drop.

I never said there was a problem with drop.

> char a = 'ä';
> 
> shouldn't even be legal. It's a compiler bug.

I figured that out.
I wanted to show that a char couldn't hold a code point, but I was too
fast and confused code points with code units.

>> Dealing with utfencoded strings is less efficient, but there is a number of algorithms that can be optimized for utfencoded strings, like copying or finding an ascii char in a string. Unfortunately, there is no practical way to do this with the current range API.

Maybe there should be a way for the designer of a class to provide an overload for some algorithms, like forwarding to myClass.algorithm for instance. The problem is that this is an open door for unvoluntary hacking.

Oh, I just noticed I'm actually answering to myself. Thinking out loud, am I ?

> [...]

After having read all of you, I have no problems with string being a lazy range of dchar. But I have a problem with immutable(char)[] being lazy range of dchar (ie not being a array), and I have a problem with string being immutable(char)[] (ie providing length opIndex and opSlice).

Thanks
-- 
Christophe