July 14, 2017
On 07/14/2017 09:50 PM, Anton Fediushin wrote:
> But why? This should be true for `char[]`, isn't it?
> -----
> if ((ss == SwapStrategy.unstable && (hasSwappableElements!Range || hasAssignableElements!Range) || ss != SwapStrategy.unstable && hasAssignableElements!Range) && isRandomAccessRange!Range && hasSlicing!Range && hasLength!Range)
> -----

No, those are all false. char[] is treated as a range of code points (dchar), not code units (char). It's decoded on-the-fly ("auto decoding").

Code unit sequences are variable in length. So to get the n-th code point you have to decode everything before it. That's too expensive to be considered random access.

Same for swappable/assignable elements: When the new sequence has a different length than the old one, you have to move everything that follows to make room or to close the gap. Too expensive to be accepted by the templates.

As for why auto-decoding is a thing:

The idea was to treat char[] and friends as ranges of visual characters to have a robust default where you don't accidentally mess your strings up. Unfortunately, a code point isn't always what's commonly understood as a  character (grapheme), and the whole things falls apart. It's a historical accident, really.

Here's a huge discussion thread about maybe getting rid of it:

http://forum.dlang.org/post/nh2o9i$hr0$1@digitalmars.com

And here's what I took away from that:

http://forum.dlang.org/post/nirpdo$167i$1@digitalmars.com
July 14, 2017
On Friday, July 14, 2017 7:50:17 PM MDT Anton Fediushin via Digitalmars-d- learn wrote:
> On Friday, 14 July 2017 at 17:23:41 UTC, Steven Schveighoffer
>
> wrote:
> > Don't do this, because it's not what you think. It's not actually calling std.algorithm.sort, but the builtin array sort property. This will be going away soon.
>
> This sucks. I know, that `.sort` will be removed, but I thought it won't break any code.
>
> > With 2.075, it won't compile even without the parentheses, because a char[] is not an array according to std.algorithm...
>
> But why? This should be true for `char[]`, isn't it?
> -----
> if ((ss == SwapStrategy.unstable && (hasSwappableElements!Range
>
> || hasAssignableElements!Range) || ss != SwapStrategy.unstable &&
>
> hasAssignableElements!Range) && isRandomAccessRange!Range &&
> hasSlicing!Range && hasLength!Range)
> -----
> (It's from
> https://dlang.org/phobos/std_algorithm_sorting.html#sort)

It has to do with Unicode and Andrei's attempt to make ranges Unicode correct without having to think about it. It was a nice thought, but it really doesn't work, and it causes a number of annoying problems. As a quick explanation, in Unicode, you have code units, which combine to make code points. In UTF-8, a code unit is 8 bits. In UTF-16, it's 16 bits, and in UTF-32, it's 32 bits. char is a UTF-8 code unit. wchar is a UTF-16 code unit, and dchar is a UTF-32 cod unit. 32 bits is enough to represent any code point, so a valid dchar is not only a code unit, it's guaranteed to be a code point. So, indexing or slicing a dstring will not cut into the middle of any code points. However, because UTF-8 and UTF-16 potentially require multiple code units in order to represent a code point, you can't just arbitrarily index a string or wstring (or arbitrarily slice either of them) or you risk breaking up a code point.

To avoid this problem and guarantee Unicode correctness, Andrei made it so that the range API does not treat strings or wstrings as either random access or sliceable. So, isRandomAccessRange!string and hasSlicing!string are false. And front returns dchar for all string types, because it decodes the first code point from the code units, which means that popFront could pop one char or wchar off, or it could pop several. Similarly, because the number of code points can't be known without iterating them, hasLength!string is false.

This does prevent you from blindly doing things to your strings using the range API which will make them have invalid code points, but it also makes it really annoying for stuff like sort, because sort requires a random access range to sort, but char[] and wchar[] are not considered to be random access ranges, because they're considered to be ranges of dchar rather than char or wchar. It also hurts performance, because many algorithms don't actually need to decode the code points - which is why many Phobos functions have special overloads for strings; they do the decinding only when required or skip it entirely.

To make matters worse, not only is all of this frustrating to deal with, but it's not even fully correct. When Andrei added the range API to Phobos, he thought that code points where the character you see on the screen (and annoyingly, Unicode _does_ call them characters for some stupid reason), but they really aren't. They do represent printable items, but they don't just include characters such as A. They also include stuff like accents or subscripts such that you sometimd actually need multiple code points to represent an actual character on the screen - and a group of code points which represent a character or glyph that you'd see on the screen are called grapheme clusters. So, to be fully correct, ranges would have to decode clear to grapheme clusters by default, which would horribly inefficient. The correct way to handle Unicode requires that the programmer understand it well enough to know when they should be operating at the code unit level, when they should be operating at the code point level, and when they should be operating at the grapheme level. You really can't automate it - especially not if you want to be efficent. And you rarely want the code point level, making Andrei's choice particularly bad.

So really, ranges should not be treating strings in a special manner; they should be treated as ranges of code units and require the programmer to wrap them in ranges of code points or graphemes as appropriate. But unfortunately, making that change now would break a lot of code. So, we seem to be stuck. The result is that you're forced to either specialize your functions on strings or use functions like std.string.representation or std.utf.byCodeUnit to work around the problem. And of course, this whole issue is incredibly confusing to anyone coming to D - especially those who aren't well-versed in Unicode. :(

- Jonathan M Davis

July 15, 2017
On Friday, 14 July 2017 at 17:23:41 UTC, Steven Schveighoffer wrote:

> import std.string: representation, assumeUTF;
> import std.algorithm: sort;
>
> auto bytes = line.representation.dup;
> bytes.sort;
> auto result = bytes.assumeUTF; // result is now char[]


Why does it have to be char[]?

 auto bytes = line.representation.dup;
 bytes.sort;
 string result = bytes.assumeUTF;

works too.
July 15, 2017
On 07/15/2017 04:33 AM, Namal wrote:
> Why does it have to be char[]?
> 
>   auto bytes = line.representation.dup;
>   bytes.sort;
>   string result = bytes.assumeUTF;
> 
> works too.

That's a compiler bug. The code should not compile, because now you can mutate `result`'s elements through `bytes`. But `result`'s elements are supposed to be immutable.

I've filed an issue: https://issues.dlang.org/show_bug.cgi?id=17654
July 15, 2017
On Friday, 14 July 2017 at 20:52:54 UTC, Jonathan M Davis wrote:
> And of course, this whole issue is incredibly confusing to anyone
> coming to D - especially those who aren't well-versed in Unicode.

Right on. Thanks for your very clear summary (the whole thing, not just the last line!). Much appreciated.
1 2
Next ›   Last »