View mode: basic / threaded / horizontal-split · Log in · Help
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
> > unicode natively. Yet the 'D strings are strange and confusing' argument
> > comes up quite often on the web.
> 
> Well, I think they are. The ptr+length stuff is amasing, but the
> behavior of strings in phobos is weird.
> 
> mini-quiz: what should std.range.drop(some_string, 1) do ?
> hint: what it actually does is not what the documentation of phobos
> suggests*...

What do you mean? It does exactly what it says that it does.

> Strings are array of char, but they appear like a lazy range of dchar to
> phobos. I could cope with the fact that this is a little unexpected for
> beginners. But well, that creates a lot of exceptions in phobos, like
> the fact that you can't even copy a char[] to a char[] with
> std.algorithm.copy. And I don't mention all the optimization that are
> not/cannot be performed for those strings. I'll just remember to use
> ubyte[] wherever I can...

Yeah, well, as long as char is a unicode code unit, that's the way that it 
goes. In general, Phobos does a good job of using slicing and other 
optimizations on strings when it can in spite of the fact that they're not 
sliceable ranges, but there are cases where the fact that you _have_ to 
process them to be able to find the nth code point means that you just can't 
process them as efficiently as your typical array. That's life with a variable-
length encoding - and that includes std.algorithm.copy. But that's an easy one 
to get around, since if you wanted to ignore unicode safety and just copy some 
chunk of the string, you can always just slice it directly with no need for 
copy.

> * Please, someone just adds in the documentation of IsSliceable that
> narrow strings are an exception, like it was recently added to
> hasLength.

A good point.

- Jonathan M Davis
September 21, 2011
Re: D on hackernews
On 2011-09-21 15:52, Timon Gehr wrote:
> On 09/21/2011 09:37 AM, bearophile wrote:
>> Andrei Alexandrescu:
>>
>>> http://hackerne.ws/item?id=3014861
>>>
>>> Apparently we're still having a PR issue.
>>
>> I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
>> its space to D (meaning D2).
>>
>
> Yes, that is important. Wikipedia is usually the first place people go
> looking for information, and much of the information given there is
> horribly outdated/wrong and mostly only concerns. Many people think
> what's on Wikipedia is true. [citation needed]
>
> "For performance reasons, string slicing and the length property operate
> on code units rather than code points (characters), which frequently
> confuses developers.[27]"
>
> The link at [27] only says that many programmers that don't had have to
> handle unicode have trouble understanding how unicode works initially.
> It is not a D thing in any other way than that D actually supports
> unicode natively. Yet the 'D strings are strange and confusing' argument
> comes up quite often on the web, probably because many feel they are
> competent enough to discuss the language after having read the Wikipedia
> article.

Ruby pre 1.9 behaves similar. Well, actually Ruby pre 1.9 is not 
encoding aware at all, if I recall correctly.

-- 
/Jacob Carlborg
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On 9/21/11 10:16 AM, Christophe wrote:
> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>> unicode natively. Yet the 'D strings are strange and confusing' argument
>> comes up quite often on the web.
>
> Well, I think they are. The ptr+length stuff is amasing, but the
> behavior of strings in phobos is weird.
>
> mini-quiz: what should std.range.drop(some_string, 1) do ?
> hint: what it actually does is not what the documentation of phobos
> suggests*...
>
> Strings are array of char, but they appear like a lazy range of dchar to
> phobos. I could cope with the fact that this is a little unexpected for
> beginners. But well, that creates a lot of exceptions in phobos, like
> the fact that you can't even copy a char[] to a char[] with
> std.algorithm.copy. And I don't mention all the optimization that are
> not/cannot be performed for those strings. I'll just remember to use
> ubyte[] wherever I can...

String handling in D is good modulo the oddities you noticed. What would 
make it perfect would be:

* Add property .rep that returns byte[], ushort[], or uint[] for char[], 
wchar[], dchar[] respectively (with the appropriate qualifier).

* Replace .length with .codeUnits.

* Disallow [n] and [m .. n]

This would upgrade D's strings from good to awesome. Really it would be 
a dream come true. Unfortunately it would also break most D code there 
is out there. I don't see how we can improve the current situation while 
staying backward compatible.


Andrei
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On Wed, 21 Sep 2011 11:39:03 -0500, Andrei Alexandrescu wrote:

> On 9/21/11 10:16 AM, Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>>> unicode natively. Yet the 'D strings are strange and confusing'
>>> argument comes up quite often on the web.
>>
>> Well, I think they are. The ptr+length stuff is amasing, but the
>> behavior of strings in phobos is weird.
>>
>> mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what
>> it actually does is not what the documentation of phobos suggests*...
>>
>> Strings are array of char, but they appear like a lazy range of dchar
>> to phobos. I could cope with the fact that this is a little unexpected
>> for beginners. But well, that creates a lot of exceptions in phobos,
>> like the fact that you can't even copy a char[] to a char[] with
>> std.algorithm.copy. And I don't mention all the optimization that are
>> not/cannot be performed for those strings. I'll just remember to use
>> ubyte[] wherever I can...
> 
> String handling in D is good modulo the oddities you noticed. What would
> make it perfect would be:
> 
> * Add property .rep that returns byte[], ushort[], or uint[] for char[],
> wchar[], dchar[] respectively (with the appropriate qualifier).
> 
> * Replace .length with .codeUnits.
> 
> * Disallow [n] and [m .. n]
> 
> This would upgrade D's strings from good to awesome. Really it would be
> a dream come true. Unfortunately it would also break most D code there
> is out there. I don't see how we can improve the current situation while
> staying backward compatible.
> 
> 
> Andrei

1. Let "string" remain an alias for "immutable(char)[]", and introduce a 
new struct, "text!charType", that does the awesome stuff; provide good 
conversion routines and casts between text instances and old-school 
strings.

2. Provide an awesome "std.text" library for the text type and related 
operations, absorbing std.uni and other similar modules. Adapt Phobos to 
take "text" in virtually every function that currently expects a "string".

3. ...

4. Profit.

Graham
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
> On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>> > unicode natively. Yet the 'D strings are strange and confusing' argument
>> > comes up quite often on the web.
>> 
>> Well, I think they are. The ptr+length stuff is amasing, but the
>> behavior of strings in phobos is weird.
>> 
>> mini-quiz: what should std.range.drop(some_string, 1) do ?
>> hint: what it actually does is not what the documentation of phobos
>> suggests*...
> 
> What do you mean? It does exactly what it says that it does.

It does say it uses the slice operator if the range is sliceable, and
the documentation to isSliceable fails to precise that a narrow string 
is not sliceable.

> Yeah, well, as long as char is a unicode code unit, that's the way that it 
> goes.

They are not unicode units.

void main() {
 char a = 'ä';
 writeln(a); // outputs: \344
 writeln('ä'); // outputs: ä
}

Obviouly, a code unit don't fit in a char.
Thus 'char[]' is not what the name claims it is.

Unicode operations should be supported by a different class that 
is really a lazy range of dchar implemented as an undelying char[], with 
no length, index, or stride operator, and appropriate optimizations.

> In general, Phobos does a good job of using slicing and other 
> optimizations on strings when it can in spite of the fact that they're not 
> sliceable ranges, but there are cases where the fact that you _have_ to 
> process them to be able to find the nth code point means that you just can't 
> process them as efficiently as your typical array. That's life with a variable-
> length encoding. - and that includes std.algorithm.copy. But that's an easy one 
> to get around, since if you wanted to ignore unicode safety and just copy some 
> chunk of the string, you can always just slice it directly with no need for 
> copy.

Dealing with utfencoded strings is less efficient, but there is a number 
of algorithms that can be optimized for utfencoded strings, like copying 
or finding an ascii char in a string. Unfortunately, there is no 
practical way to do this with the current range API.

About copy, it's not that easy to overcome the problem if you are using 
a template, and that template happens to be instanciated for strings.

>> * Please, someone just adds in the documentation of IsSliceable that
>> narrow strings are an exception, like it was recently added to
>> hasLength.
> 
> A good point.

The main point of my post actually.

-- 
Christophe
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On 21/09/11 5:39 PM, Andrei Alexandrescu wrote:
> On 9/21/11 10:16 AM, Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>>> unicode natively. Yet the 'D strings are strange and confusing' argument
>>> comes up quite often on the web.
>>
>> Well, I think they are. The ptr+length stuff is amasing, but the
>> behavior of strings in phobos is weird.
>>
>> mini-quiz: what should std.range.drop(some_string, 1) do ?
>> hint: what it actually does is not what the documentation of phobos
>> suggests*...
>>
>> Strings are array of char, but they appear like a lazy range of dchar to
>> phobos. I could cope with the fact that this is a little unexpected for
>> beginners. But well, that creates a lot of exceptions in phobos, like
>> the fact that you can't even copy a char[] to a char[] with
>> std.algorithm.copy. And I don't mention all the optimization that are
>> not/cannot be performed for those strings. I'll just remember to use
>> ubyte[] wherever I can...
>
> String handling in D is good modulo the oddities you noticed. What would
> make it perfect would be:
>
> * Add property .rep that returns byte[], ushort[], or uint[] for char[],
> wchar[], dchar[] respectively (with the appropriate qualifier).
>
> * Replace .length with .codeUnits.
>
> * Disallow [n] and [m .. n]
>
> This would upgrade D's strings from good to awesome. Really it would be
> a dream come true. Unfortunately it would also break most D code there
> is out there. I don't see how we can improve the current situation while
> staying backward compatible.
>
>
> Andrei

From what I can see, the problem with D string is that they are a 
'magic' special case for arrays.

char[] should be an array of char, just like int[] is an array of int. 
If you have a T[] arr, then typeof(arr.front) should be T. This is what 
everyone would expect. char[] should essentially be the same as byte[], 
although char[] would be more natural for ASCII strings.

string should be something different, a separate type. As you say, 
disallow [n] and [m..n] would be good as they make no sense with VLE. 
You could have .length and .codeUnits, but length would have to be O(n). 
That's not ideal, but since string wouldn't be an array, it doesn't need 
to have the same complexity guarantees.

Same for wchar[], dchar[], wstring and dstring.

Of course, making that change would break existing code. Maybe D3? :-)
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On Wed, 21 Sep 2011 20:20:55 +0200, Christophe Travert  
<travert@phare.normalesup.org> wrote:


>> Yeah, well, as long as char is a unicode code unit, that's the way that  
>> it
>> goes.
>
> They are not unicode units.
>
> void main() {
>   char a = 'ä';
>   writeln(a); // outputs: \344
>   writeln('ä'); // outputs: ä
> }
>
> Obviouly, a code unit don't fit in a char.
> Thus 'char[]' is not what the name claims it is.

Oh, it absolutely is. According to the Unicode Consortium, A code unit is
"The minimal bit combination that can represent a unit of encoded text
for processing or interchange. The Unicode Standard uses 8-bit code units
in the UTF-8 encoding form [...]".

What you are thinking about is a code point.


> Unicode operations should be supported by a different class that
> is really a lazy range of dchar implemented as an undelying char[], with
> no length, index, or stride operator, and appropriate optimizations.

I can agree with this, but the benefits over what we already have are nigh
zilch.


-- 
  Simen
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
On Wednesday, September 21, 2011 11:20 Christophe Travert wrote:
> Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
> > On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
> >> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
> >> > unicode natively. Yet the 'D strings are strange and confusing'
> >> > argument comes up quite often on the web.
> >> 
> >> Well, I think they are. The ptr+length stuff is amasing, but the
> >> behavior of strings in phobos is weird.
> >> 
> >> mini-quiz: what should std.range.drop(some_string, 1) do ?
> >> hint: what it actually does is not what the documentation of phobos
> >> suggests*...
> > 
> > What do you mean? It does exactly what it says that it does.
> 
> It does say it uses the slice operator if the range is sliceable, and
> the documentation to isSliceable fails to precise that a narrow string
> is not sliceable.

1. drop says nothing about slicing.
2. popFrontN (which drop calls) says that it slices for ranges that support 
slicing. Strings do not unless they're arrays of dchar.

Yes, hasSlicing should probably be clearer about narrow strings, but that has 
nothing to do with drop.

> > Yeah, well, as long as char is a unicode code unit, that's the way that
> > it goes.
> 
> They are not unicode units.
> 
> void main() {
> char a = 'ä';
> writeln(a); // outputs: \344
> writeln('ä'); // outputs: ä
> }
> 
> Obviouly, a code unit don't fit in a char.
> Thus 'char[]' is not what the name claims it is.
> 
> Unicode operations should be supported by a different class that
> is really a lazy range of dchar implemented as an undelying char[], with
> no length, index, or stride operator, and appropriate optimizations.

The problem with

char a = 'ä';

is completely separate from char[]. That has to do with the fact that the 
compiler isn't properly dealing with narrowing conversions for an individual 
character. A char is _by definition_ a UTF-8 code unit. 

char a = 'ä';

shouldn't even be legal. It's a compiler bug. Most code which operates on 
individual chars or wchars is buggy. dchar is what should be used for 
individual characters. The code you give is buggy because it's trying to use 
char as individual character, and the compiler has a bug, so it doesn't catch 
it.

> > In general, Phobos does a good job of using slicing and other
> > optimizations on strings when it can in spite of the fact that they're
> > not sliceable ranges, but there are cases where the fact that you _have_
> > to process them to be able to find the nth code point means that you
> > just can't process them as efficiently as your typical array. That's
> > life with a variable- length encoding. - and that includes
> > std.algorithm.copy. But that's an easy one to get around, since if you
> > wanted to ignore unicode safety and just copy some chunk of the string,
> > you can always just slice it directly with no need for copy.
> 
> Dealing with utfencoded strings is less efficient, but there is a number
> of algorithms that can be optimized for utfencoded strings, like copying
> or finding an ascii char in a string. Unfortunately, there is no
> practical way to do this with the current range API.
>
> About copy, it's not that easy to overcome the problem if you are using
> a template, and that template happens to be instanciated for strings.

In general, you _must_ treat strings as ranges of dchar if you want your code 
to be correct with regards to code points. So, that's the default. If you know 
that your particular algorithm can be optimized for strings without treating 
them strictly as ranges of dchar and still deal with unicode correctly (e.g. 
by slicing them, because you know the correct point to slice to), then you 
special-case your template for narrow strings. Phobos does that in a number of 
places. But you _have_ to special case it because being able to do so depends 
entirely on your algorithm. You can't treat strings that way in the general 
case, or yor code is not going to handle unicode correctly.

No, the way that strings are handled in D is not perfect. But we're trying to 
balance correctness and efficiency, and it does a good job of that. Using 
strings as ranges of dchar is correct, and in a large number of cases, it is 
as efficient as your going to get it (perhaps particular functions could be 
better optimized, but the API can't be made more efficient). And in the cases 
where you know you can safely get better efficiency out of strings by special 
casing them, that's what you do. It works. It works fairly well. And no one 
has been able to come up with a solution that's definitely better.

If there's an issue, it's that the message on how to correctly handle strings 
is not necessarily being communicated as well as it needs to be. In a few 
places, the documentation could probably be improved, but what we probobly 
really need in order to help get the message across is more articles on ranges 
and strings as ranges. I've actually partially written an article on that, but 
due to some compiler bugs regarding std.container, I ended up temporarily
shelving it.

- Jonathan M Davis
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
"Simen Kjaeraas" , dans le message (digitalmars.D:144921), a écrit :
> What you are thinking about is a code point.

Yes, sorry. Then I disagree with "as long as char is a unicode code 
unit, that's the way that it goes", since myString.front should then 
return a code unit, whereas it actually returns a code point.

>> Unicode operations should be supported by a different class that
>> is really a lazy range of dchar implemented as an undelying char[], with
>> no length, index, or stride operator, and appropriate optimizations.
> 
> I can agree with this, but the benefits over what we already have are nigh
> zilch.

I think no one here as any illusion about changing this in D2.

However, this class could be introduced in phobos right now, without 
changing anything about string. It would simply be a bit safer than 
strings.

-- 
Christophe
September 21, 2011
Re: D's confusing strings (was Re: D on hackernews)
"Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
> 1. drop says nothing about slicing.
> 2. popFrontN (which drop calls) says that it slices for ranges that support 
> slicing. Strings do not unless they're arrays of dchar.
> 
> Yes, hasSlicing should probably be clearer about narrow strings, but that has 
> nothing to do with drop.

I never said there was a problem with drop.

> char a = 'ä';
> 
> shouldn't even be legal. It's a compiler bug.

I figured that out.
I wanted to show that a char couldn't hold a code point, but I was too 
fast and confused code points with code units.

>> Dealing with utfencoded strings is less efficient, but there is a number
>> of algorithms that can be optimized for utfencoded strings, like copying
>> or finding an ascii char in a string. Unfortunately, there is no
>> practical way to do this with the current range API.

Maybe there should be a way for the designer of a class to provide an 
overload for some algorithms, like forwarding to myClass.algorithm for 
instance. The problem is that this is an open door for unvoluntary 
hacking.

Oh, I just noticed I'm actually answering to myself. Thinking out loud, 
am I ?

> [...]

After having read all of you, I have no problems with string being a 
lazy range of dchar. But I have a problem with immutable(char)[] being 
lazy range of dchar (ie not being a array), and I have a problem with 
string being immutable(char)[] (ie providing length opIndex and 
opSlice).

Thanks
-- 
Christophe
1 2 3
Top | Discussion index | About this forum | D home