May 27, 2016
On Friday, 27 May 2016 at 13:47:32 UTC, ag0aep6g wrote:
>> Misunderstanding. All examples work properly today because of
>> autodecoding. -- Andrei
>
> They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes.
>
> https://dpaste.dzfl.pl/817dec505fd2

I agree. It has happened to me that characters like "é" return length == 2, which has been the cause of some bugs in my code. I'm wiser now, of course, but you wouldn't expect this, if you write

if (input.length == 1)
  speakCharacter(input);  // e.g. when spelling a word
else
  processInput(input);

The worst thing is that you never know, what's going on under the hood and where autodecode slows you down, unbeknownst to yourself.


May 27, 2016
On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
> On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
> > > > However the following do require autodecoding:
> > > > 
> > > > s.walkLength
> > > > s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> > > > s.count!(c => c >= 32) // non-control characters
> > > > 
> > > > Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.
> > > 
> > > But how is the user supposed to know without being a core contributor to Phobos?
> > 
> > Misunderstanding. All examples work properly today because of autodecoding. -- Andrei
> 
> They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes.
> 
> https://dpaste.dzfl.pl/817dec505fd2

Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it.

String handling, especially in the standard library, ought to be (1) efficient where possible, and (2) be as correct as possible (meaning, most corresponding to user expectations -- principle of least surprise). If we can't have both, we should at least have one, right? However, the way autodecoding is currently implemented, we have neither.

Firstly, it is beyond clear that autodecoding adds a significant amount of overhead, and because it's automatic, it applies to ALL string processing in D.  The only way around it is to fight against the standard library and use workarounds to bypass all that meticulously-crafted autodecoding code, begging the question of why we're even spending the effort on said code in the first place.

Secondly, it violates the principle of least surprise when the user, given a string of, say, Korean text, discovers that s.count() *doesn't* return the correct answer.  Oh, it's "correct", all right, if your definition of correct is "number of Unicode code points", but to a Korean user, such an answer is completely meaningless because it has little correspondence with what he would perceive as the number of "characters" in the string. It might as well be a random number and it would be just as meaningful.  It is just as wrong as s.count() returning the number of code units, except that in the current Euro-centric D community the wrong instances are less often encountered and so are often overlooked. But that doesn't change the fact that code that assumes s.count() returns anything remotely meaningful to the user is buggy. Autodecoding into code points only serves to hide the bugs.

As has been said before already countless times, autodecoding, as currently implemented, is neither "correct" nor efficient. Iterating by code point is much faster, but more prone to user mistakes; whereas iterating by grapheme more often corresponds with user expectations but performs quite poorly. The current implementation of autodecoding represents the worst of both worlds: it is both inefficient *and* prone to user mistakes, and worse yet, it serves to conceal such user mistakes by giving the false sense of security that because we're iterating by code points we're somehow magically "correct" by definition.

The fact of the matter is that if you're going to write Unicode string processing code, you're gonna hafta to know the dirty nitty gritty of Unicode strings, including the fine distinctions between code units, code points, grapheme clusters, etc.. Since this is required knowledge anyway, why not just let the user worry about how to iterate over the string? Let the user choose what best suits his application, whether it's working directly with code units for speed, or iterating over grapheme clusters for correctness (in terms of visual "characters"), instead of choosing the pessimal middle ground that's neither efficient nor correct?


T

-- 
Do not reason with the unreasonable; you lose by definition.
May 27, 2016
On 5/26/2016 9:00 AM, Andrei Alexandrescu wrote:
> My thesis: the D1 design decision to represent strings as char[] was disastrous
> and probably one of the largest weaknesses of D1. The decision in D2 to use
> immutable(char)[] for strings is a vast improvement but still has a number of
> issues.

The mutable vs immutable has nothing to do with autodecoding.


> On 05/12/2016 04:15 PM, Walter Bright wrote:
>> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
>> 2. Every time one wants an algorithm to work with both strings and
>> ranges, you wind up special casing the strings to defeat the
>> autodecoding, or to decode the ranges. Having to constantly special case
>> it makes for more special cases when plugging together components. These
>> issues often escape detection when unittesting because it is convenient
>> to unittest only with arrays.
>
> This is a consequence of 1. It is at least partially fixable.

It's a consequence of autodecoding, not arrays.


>> 4. Autodecoding is slow and has no place in high speed string processing.
> I would agree only with the amendment "...if used naively", which is important.
> Knowledge of how autodecoding works is a prerequisite for writing fast string
> code in D.

Having written high speed string processing code in D, that also deals with unicode (i.e. Warp), the only knowledge of autodecoding needed was how to have it not happen. Autodecoding made it slower than necessary in every case it was used. I found no place in Warp where autodecoding was desirable.


> Also, little code should deal with one code unit or code point at a
> time; instead, it should use standard library algorithms for searching, matching
> etc.

That doesn't work so well. There always seems to be a need for custom string processing. Worse, when pipelining strings, the autodecoding changes the type to dchar, which then needs to be re-encoded into the result.

The std.string algorithms I wrote all work much better (i.e. faster) without autodecoding, while maintaining proper Unicode support. I.e. the autodecoding did not benefit the algorithms at all, and if the user is to use standard algorithms instead of custom ones, then autodecoding is not necessary.


> When needed, iterating every code unit is trivially done through indexing.

This implies replacing pipelining with loops, and also falls apart if indexing is redone to index by code points.


> Also allow me to point that much of the slowdown can be addressed tactically.
> The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore
> easily speculated. We can and we should arrange code to minimize impact.

I.e. special case the code to avoid autodecoding.

The trouble is that the low level code cannot avoid autodecoding, as it happens before the low level code gets it. This is conceptually backwards, and winds up requiring every algorithm to special case strings, even when completely unnecessary. (The 'copy' algorithm is an example of utterly unnecessary decoding.)

When teaching people how to write algorithms, having to write every one twice, once for ranges and arrays, and a specialization for strings even when decoding is never necessary (such as for 'copy'), is embarrassing.


>> 5. Very few algorithms require decoding.
> The key here is leaving it to the standard library to do the right thing instead
> of having the user wonder separately for each case. These uses don't need
> decoding, and the standard library correctly doesn't involve it (or if it
> currently does it has a bug):
>
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
>
> However the following do require autodecoding:
>
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters
>
> Currently the standard library operates at code point level even though inside
> it may choose to use code units when admissible. Leaving such a decision to the
> library seems like a wise thing to do.

Running my char[] through a pipeline and having it come out sometimes as char[] and sometimes dchar[] and sometimes ubyte[] is hidden and surprising behavior.


>> 6. Autodecoding has two choices when encountering invalid code units -
>> throw or produce an error dchar. Currently, it throws, meaning no
>> algorithms using autodecode can be made nothrow.
> Agreed. This is probably the most glaring mistake. I think we should open a
> discussion no fixing this everywhere in the stdlib, even at the cost of breaking
> code.

A third option is to pass the invalid code units through unmolested, which won't work if autodecoding is used.


>> 7. Autodecode cannot be used with unicode path/filenames, because it is
>> legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
>> out in the wild that pure Unicode is not universal - there's lots of
>> dirty Unicode that should remain unmolested, and autocode does not play
>> with that.
> If paths are not UTF-8, then they shouldn't have string type (instead use
> ubyte[] etc). More on that below.

Requiring code units to be all 100% valid is not workable, nor is redoing them to be ubytes. More on that below.


>> 8. In my work with UTF-8 streams, dealing with autodecode has caused me
>> considerably extra work every time. A convenient timesaver it ain't.
> Objection. Vague.

Sorry I didn't log the time I spent on it.


>> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
>> importing std.array one way or another, and then autodecode is there.
> Turning off autodecoding is as easy as inserting .representation after any
> string.

.representation changes the type to ubyte[]. All knowledge that this is a Unicode string then gets lost for the rest of the pipeline.


> (Not to mention using indexing directly.)

Doesn't work if you're pipelining.


>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
>> benefit of being arrays in the first place.
> First off, you always have the option with .representation. That's a great name
> because it gives you the type used to represent the string - i.e. an array of
> integers of a specific width.

I found .representation to be unworkable because it changed the type.


>> 11. Indexing an array produces different results than autodecoding,
>> another glaring special case.
> This is a direct consequence of the fact that string is immutable(char)[] and
> not a specific type. That error predates autodecoding.

Even if it is made a special type, the problem of what an index means will remain. Of course, indexing by code point is an O(n) operation, which I submit is surprising and shouldn't be supported as [i] even by a special type (for the same reason that indexing of linked lists is frowned upon). Giving up indexing means giving up efficient slicing, which would be a major downgrade for D.


> Overall, I think the one way to make real steps forward in improving string
> processing in the D language is to give a clear answer of what char, wchar, and
> dchar mean.

They mean code units. This is not ambiguous. How a code unit is different from a ubyte:

A. I know you hate bringing up my personal experience, but here goes. I've programmed in C forever. In C, char is used for both small integers and characters. It's always been a source of confusion, and sometimes bugs, to conflate the two:

     struct S { char field; };

Which is it, a character or a small integer? I have to rely on reading the code. It's a definite improvement in D that they are distinguished, and I feel that improvement every time I have to deal with C/C++ code and see 'char' used as a small integer instead of a character.

B. Overloading is different, and that's good. For example, writeln(T[]) produces different results for char[] and ubyte[], and this is unsurprising and expected. It "just works".

C. More overloading:

      writeln('a');

Does anyone want that to print 96? Does anyone really want 'a' to be of type dchar? (The trouble with that is type inference when building up more complex types, as you'll wind up with hidden dchar[] if not careful. My experience with dchar[] is it is almost never desirable, as it is too memory hungry.)

May 27, 2016
On 5/27/16 10:15 AM, Chris wrote:
> It has happened to me that characters like "é" return length == 2

Would normalization make length 1? -- Andrei
May 27, 2016
On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote:
> Would normalization make length 1? -- Andrei

In some, but not all cases.
May 27, 2016
On 5/27/16 1:11 PM, Walter Bright wrote:
> They mean code units.

Always valid or potentially invalid as well? -- Andrei
May 27, 2016
On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:
> Exactly. And we just keep getting stuck on this point. It seems that the
> message just isn't getting through. The unfounded assumption continues
> to be made that iterating by code point is somehow "correct" by
> definition and nobody can challenge it.

Which languages are covered by code points, and which languages require graphemes consisting of multiple code points? How does normalization play into this? -- Andrei

May 27, 2016
On 05/27/2016 08:42 PM, Andrei Alexandrescu wrote:
> Which languages are covered by code points, and which languages require
> graphemes consisting of multiple code points? How does normalization
> play into this? -- Andrei

I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.

I think there are scripts that use combining characters extensively, but Unicode also has stuff like combining arrows. Those can make sense in an otherwise plain English text.

For example: 'a' + U+20D7 = a⃗.

There is no combined character for that, so normalization can't do anything here.
May 27, 2016
On 5/27/16 3:10 PM, ag0aep6g wrote:
> I don't think there is value in distinguishing by language. The point of
> Unicode is that you shouldn't need to do that.

It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei

May 27, 2016
On 5/27/16 1:11 PM, Walter Bright wrote:
> The std.string algorithms I wrote all work much better (i.e. faster)
> without autodecoding, while maintaining proper Unicode support.

Violent agreement is occurring here. We have plenty of those and need more. -- Andrei