March 09, 2014
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
> On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win.
>
> I'd be tempted to not ask "how do we back out", but rather, "how can we take this further"? I'd love to ditch the whole "char"/"dchar" thing altogether, and work with graphemes. But that would be massive involvement.

Why do you think it is better?

Let's be clear here: if you are searching/iterating/comparing by code point then your program is either not correct, or no better than doing so by code unit. Graphemes don't really fix this either.

I think this is the main confusion: the belief that iterating by code point has utility.

If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets).

If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character?

To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this?

I do think it's probably too late to change this, but I think there is value in at least getting everyone on the same page.
March 09, 2014
On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:
>
> I'm leaning the same way too. But I also think Andrei is right that, at this point in time, it'd be a terrible move to change things so that "by code unit" is default. For better or worse, that ship has sailed.
>
> Perhaps we *can* deal with the auto-decoding problem not by killing auto-decoding, but by marginalizing it in an additive way:
>
> Convincing arguments have been made that any string-processing code which *isn't* done entirely with the official Unicode algorithms is likely wrong *regardless* of whether std.algorithm defaults to per-code-unit or per-code-point.
>
> So...How's this?: We add any of these Unicode algorithms we may be missing, encourage their use for strings, discourage use of std.algorithm for string processing, and in the meantime, just do our best to reduce unnecessary decoding wherever possible. Then we call it a day and all be happy :)

I've been watching this discussion for the last few days, and I'm kind of a nobody jumping in pretty late, but I think after thinking about the problem for a while I would aggree on a solution along the lines of what you have suggested.

I think Vladimir is definitely right when he's saying that when you have algorithms that deal with natural languages, simply working on the basis of a code unit isn't enough. I think it is also true that you need to select a particular algorithm for dealing with strings of characters, as there are many different algorithms you can use for different languages which behave differently, perhaps several in a single langauge. I also think Andrei is right when he is saying we need to minimise code breakage, and that the string decoding and encoding by default isn't the biggest of performance problems.

I think our best option is to offer a function which creates a range in std.array for getting a range over raw character data, without decoding to code points.

myArray.someAlgorithm; // std.array .front used today with decode calls
myArray.rawData.someAlgorithm; // New range which doesn't decode.

Then we could look at creating algorithms for string processing which don't use the existing dchar abstraction.

myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range of strings, maybe range of range of characters, not dchars

Or even specialise the new algorithm so it looks for arrays and turns them into the ranges for you via the transformation myArray -> myArray.rawData.

myArray.byNaturalSymbol!SomeIndianEncodingHere;

Honestly, I'd leave the details of such an algorithm to Vladimir and not myself, because he's spent far more time looking into Unicode processing than myself. My knowledge of Unicode pretty much just comes from having to deal with foreign language customers and discovering the problems with the code unit abstraction most languages seem to use. (Java and Python suffer from similar issues, but they don't really have algorithms in the way that we do.)

This new set of algorithms taking settings for different encodings could be first implemented in a third party library, tested there, and eventually submitted to Phobos, probably in std.string.

There's my input, I'll duck before I'm beheaded.
March 09, 2014
> - In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring.

With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there.

March 09, 2014
On 09/03/14 04:26, Andrei Alexandrescu wrote:
>>> 2. Add byChar that returns a random-access range iterating a string by
>>> character. Add byWchar that does on-the-fly transcoding to UTF16. Add
>>> byDchar that accepts any range of char and does decoding. And such
>>> stuff. Then whenever one wants to go through a string by code point
>>> can just use str.byChar.
>>
>> This is confusing. Did you mean to say that byChar iterates a string by
>> code unit (not character / code point)?
>
> Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about?

In which case it seems to me a better solution -- "safe" strings by default, unsafe speed-focused solution available if you want it.  ("Safe" here in the more general sense of "Doesn't generate unexpected errors" rather than memory safety.)

March 09, 2014
On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:
> On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
>> On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win.
>>
>> I'd be tempted to not ask "how do we back out", but rather, "how can we take this further"? I'd love to ditch the whole "char"/"dchar" thing altogether, and work with graphemes. But that would be massive involvement.
>
> Why do you think it is better?
>
> Let's be clear here: if you are searching/iterating/comparing by code point then your program is either not correct, or no better than doing so by code unit. Graphemes don't really fix this either.
>
> I think this is the main confusion: the belief that iterating by code point has utility.
>
> If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets).
>
> If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos.

IMO, the "normalization" argument is overrated. I've yet to encounter a real-world case of normalization: only hand written counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program should be aware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.

As for "the belief that iterating by code point has utility." I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail.

As for the grapheme thing, I'm not actually so sure about it myself, so don't take it too seriously.

> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too.

AFAIK, the most common algorithm "case insensitive search" *must* decode.

There may still be cases where it is still not working as intended in the face of normalization, but it is still leaps and bounds better than what we get iterating with codeunits.

To turn it the other way around, *what* are you guys doing, that doesn't require decoding, and where performance is such a killer?

> To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK).

On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis.

Seriously, Bearophile suggested "ABCD".sort(), and it took about 6 pages (!) for someone to point out this would be wrong. Even Walter pointed out that such code should work. *Maybe* it is still wrong in regards to graphemes and normalization, but at *least*, the result is not a corrupted UTF-8 stream.

Walter keeps grinding on about "myCharArray.put('é')" not working, but I'm not sure he realizes how dangerous it would actually be to allow such a thing to work.

In particular, in all these cases, a simple call to "representation" will deactivate the feature, giving you the tools you want.

> I do think it's probably too late to change this, but I think there is value in at least getting everyone on the same page.

Me too. I do see the value in being able to do decode-less iteration. I just think the *default* behavior has the advantage of being correct *most* of the time, and definitely much more correct than without decoding.

I think opt-out of decoding is just a much much much saner approach to string handling.
March 09, 2014
On Friday, 7 March 2014 at 04:11:15 UTC, Nick Sabalausky wrote:
> What about this?:
>
> Anywhere we currently have a front() that decodes, such as your example:
>
>>   @property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[]))
>>   {
>>     assert(a.length, "Attempting to fetch the front of an empty array
>> of " ~
>>            T.stringof);
>>     size_t i = 0;
>>     return decode(a, i);
>>   }
>>
>
> We rip out that front() entirely. The result is *not* technically a range...yet! We could call it a protorange.
>
> Then we provide two functions:
>
> auto decode(someStringProtoRange) {...}
> auto raw(someStringProtoRange) {...}
>
> These convert the protoranges into actual ranges by adding the missing front() function. The 'decode' adds a front() which decodes into dchar, while the 'raw' adds a front() which simply returns the raw underlying type.
>
> I imagine the decode/raw would probably also handle any "length" property (if it exists in the protorange) accordingly.
>
> This way, the user is forced to specify "myStringRange.decode" or "myStringRange.raw" as appropriate, otherwise myStringRange can't be used since it isn't technically a range, only a protorange.
>
> (Naturally, ranges of dchar would always have front, since no decoding is ever needed for them anyway. For these ranges, the decode/raw funcs above would simply be no-ops.)

Strings can be iterated over by code unit, code point, grapheme, grapheme cluster (?), words, sentences, lines, paragraphs, and potentially other things. Therefore, it makes sense two require the same for ranges of dchar, too.

Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni.
March 09, 2014
On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
> Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni.

There already is a std.uni.byCodePoint. It is a higher order range that accepts ranges of graphemes and ranges of code points (such as strings).

`byCodeUnit` is essentially std.string.representation.
March 09, 2014
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
> 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread.

Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.
March 09, 2014
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
> On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
>> Can we look at some example situations that this will break?
>
> Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?

This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead.
March 09, 2014
On 2014-03-09 13:00:45 +0000, "monarch_dodra" <monarchdodra@gmail.com> said:

> AFAIK, the most common algorithm "case insensitive search" *must* decode.

Not necessarily. While the unicode collation algorithms (which should be used to compare text) are defined in term of code points, you could build a collation element table using code units as keys and bypass the decoding step for searching the table. I'm not sure if there would be a significant performance gain though.

That remains an optimization though. The natural way to implement a Unicode algorithm is to base it on code points.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca