May 31, 2016
On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:
> On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:
>> On 5/31/16 3:56 AM, Walter Bright wrote:
>>> If there is an abstraction for strings that is efficient, consistent,
>>> useful, and hides the fact that it is UTF, I am not aware of it.
>>
>> It's been mentioned several times: a string type that does not offer
>> range primitives; instead it offers explicit primitives (such as
>> byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.
>
> Not exactly. Such a string type does not hide the fact that it's UTF.
> Rather, it forces you to deal with the fact that its UTF.

How is that different from what I said? -- Andrei
May 31, 2016
On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:
> Equality does not require decoding. Similarly, functions like find don't
> either. Something like filter generally would, but it's also not
> particularly normal to filter a string on a by-character basis. You'd
> probably want to get to at least the word level in that case.

It's nice that the stdlib takes care of that.

> To make matters worse, functions like find or splitter are frequently used
> to look for ASCII delimiters, even when the strings themselves contain
> Unicode characters. So, even if decoding were necessary when looking for a
> Unicode character, it's utterly wasteful when the character you're looking
> for is ASCII.

Good idea. We could overload functions such as find on char, wchar, and dchar. Jonathan, could you look into a PR to do that?

> But searching generally does not require decoding so long as
> the same character is always encoded the same way.

Yah, a good rule of thumb is to get the same (consistent, heh) results for a given string (including a given normalization) regardless of the encoding used. So e.g. it's nice that walkLength the same number for the string whether it's UTF8/16/32.


Andrei
May 31, 2016
On Friday, May 27, 2016 09:40:21 H. S. Teoh via Digitalmars-d wrote:
> On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
> > On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
> > > > > However the following do require autodecoding:
> > > > >
> > > > > s.walkLength
> > > > > s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> > > > > s.count!(c => c >= 32) // non-control characters
> > > > >
> > > > > Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.
> > > >
> > > > But how is the user supposed to know without being a core contributor to Phobos?
> > >
> > > Misunderstanding. All examples work properly today because of autodecoding. -- Andrei
> >
> > They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes.
> >
> > https://dpaste.dzfl.pl/817dec505fd2
>
> Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it.
>
> String handling, especially in the standard library, ought to be (1) efficient where possible, and (2) be as correct as possible (meaning, most corresponding to user expectations -- principle of least surprise). If we can't have both, we should at least have one, right? However, the way autodecoding is currently implemented, we have neither.

Exactly. Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct. More, full characters fit in a single code unit, but they still don't all fit. You have to go to the grapheme level for that.

IIRC, Andrei talked in TDPL about how UTF-8 was better than UTF-16, because you figured out when you screwed up Unicode handling more quickly, because very few Unicode characters fit in single UTF-8 code unit, whereas many more fit in a single UTF-16 code unit, making it harder to catch errors with UTF-16. Well, we're making the same mistake but with UTF-32 instead of UTF-16. The code is still wrong, but it's that much harder to catch that it's wrong.

> Firstly, it is beyond clear that autodecoding adds a significant amount of overhead, and because it's automatic, it applies to ALL string processing in D.  The only way around it is to fight against the standard library and use workarounds to bypass all that meticulously-crafted autodecoding code, begging the question of why we're even spending the effort on said code in the first place.

The standard library has to fight against itself because of autodecoding! The vast majority of the algorithms in Phobos are special-cased on strings in an attempt to get around autodecoding. That alone should highlight the fact that autodecoding is problematic.

> The fact of the matter is that if you're going to write Unicode string processing code, you're gonna hafta to know the dirty nitty gritty of Unicode strings, including the fine distinctions between code units, code points, grapheme clusters, etc.. Since this is required knowledge anyway, why not just let the user worry about how to iterate over the string? Let the user choose what best suits his application, whether it's working directly with code units for speed, or iterating over grapheme clusters for correctness (in terms of visual "characters"), instead of choosing the pessimal middle ground that's neither efficient nor correct?

There is no solution here that's going to be both correct and efficient. Ideally, we either need to provide a fully correct solution that's dog slow, or we need to provide a solution that's efficient but requires that the programmer understand Unicode to write correct code. Right now, we have a slow solution that's incorrect.

- Jonathan M Davis

May 31, 2016
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> Saying that operating at the code point level - UTF-32 - is correct
> is like saying that operating at UTF-16 instead of UTF-8 is correct.

Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
May 31, 2016
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> The standard library has to fight against itself because of autodecoding!
> The vast majority of the algorithms in Phobos are special-cased on strings
> in an attempt to get around autodecoding. That alone should highlight the
> fact that autodecoding is problematic.

The way I see it is it's specialization to speed things up without giving up the higher level abstraction. -- Andrei
May 31, 2016
On Tuesday, May 31, 2016 13:01:11 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:
> > On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d
wrote:
> >> On 5/31/16 3:56 AM, Walter Bright wrote:
> >>> If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it.
> >>
> >> It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.
> >
> > Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF.
>
> How is that different from what I said? -- Andrei

My point was that Walter was stating that you can't have a type that hides
the fact that it's dealing with Unicode while still being efficient, whereas
you mentioned a proposal for a type that does not hide the fact that
it's dealing with Unicode. So, you weren't really responding with a type
that rebutted Walter's statement. Rather, you responded with a type that
attempts to make its Unicode nature more explicit than immutable(char)[].

- Jonathan M Davis

May 31, 2016
On Friday, May 27, 2016 16:41:09 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
> > That's what we've been trying to say all along!
>
> If that's the case things are pretty dire, autodecoding or not. -- Andrei

True enough. Correctly handling Unicode in the general case is ridiculously hard - especially if you want to be efficient. We could do everything at the grapheme level to get the correctness, but we'd be so slow that it would be ridiculous.

Fortunately, many string algorithms really don't need to care much about Unicode so long as the strings involved are normalized. For instance, a function like find can usually compare code units without decoding anything (though even then, depending on the normalization, you run the risk of finding a part of a character if it involves combining code points - e.g. searching for e could give you the first part of é if its encoded with the e followed by the accent).

But ultimately, fully correct string handling requires having a far better understanding of Unicode than most programmers have. Even the percentage of programmers here that have that level of understanding isn't all that great - though the fact that D supports UTF-8, UTF-16, and UTF-32 the way that it does has led a number of us to dig further into Unicode and learn it better in ways that we probably wouldn't have if all it had was char. It highlights that there is something that needs to be learned to get this right in a way that most languages don't.

- Jonathan M Davis


May 31, 2016
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> > Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.
>
> Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei

Okay. If you have the letter A, it will fit in one UTF-8 code unit, one UTF-16 code unit, and one UTF-32 code unit (so, one code point).

assert("A"c.length == 1);
assert("A"w.length == 1);
assert("A"d.length == 1);

If you have 月, then you get

assert("月"c.length == 3);
assert("月"w.length == 1);
assert("月"d.length == 1);

whereas if you have 𐀆, then you get

assert("𐀆"c.length == 4);
assert("𐀆"w.length == 2);
assert("𐀆"d.length == 1);

So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for
holding an entire character, but it still looks like UTF-32 does. However,
what about characters like é or שׂ? Notice that שׂ takes up more than one code
point.

assert("שׂ"c.length == 4);
assert("שׂ"w.length == 2);
assert("שׂ"d.length == 2);

It's ש with some sort of dot marker on it that they have in Hebrew, but it's a single character in spite of the fact that it's multiple code points. é is in a similar, though more complicated boat. With D, you'll get

assert("é"c.length == 2);
assert("é"w.length == 1);
assert("é"d.length == 1);

because the compiler decides to use the version of é that's a single code point. However, Unicode is set up so that that accent can be its own code point and be applied to any other code point - be it an e, an a, or even something like the number 0. If we normalize é, we can see other versions of it that take up more than one code point. e.g.

assert("é"d.normalize!NFC.length == 1);
assert("é"d.normalize!NFD.length == 2);
assert("é"d.normalize!NFKC.length == 1);
assert("é"d.normalize!NFKD.length == 2);

And you can even put that accent on 0 by doing something like

assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

One or more code units combine to make a single code point, but one or more code points also combine to make a grapheme. So, while there is a definite layer of separation between code units and code points, it's still the case that a single code point is not guaranteed to be a single character. You do indeed have encodings with code units and not code points (though those still have different normalizations, which is kind of like having different encodings), but in terms of correctness, you have the same problem with treating code points as characters that you have as treating code units as characters. You're still not guaranteed that you're operating on full characters and risk chopping them up. It's just that at the code point level, you're generally chopping something up that is visually separable (like an accent from a letter or a superscript on a symbol), whereas with code units, you end up with utter garbage when you chop them incorrectly.

By operating at the code point level, we're correct for _way_ more characters than we would be than if we treated char like a full character, but we're still not fully correct, and it's a lot harder to notice when you screw it up, because the number of characters which are handled incorrectly is far smaller.

- Jonathan M Davis


May 31, 2016
On Monday, May 30, 2016 14:24:23 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/30/2016 12:34 PM, Jack Stouffer wrote:
> > On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
> >> D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.
> >
> > Don't be so sure. All string handling code would become broken, even if it appears to work at first.
>
> That kind of makes this thread less productive than "How to improve autodecoding?" -- Andrei

I think that the first step is getting Phobos to work with all ranges of character types - be they char, wchar, dchar, or graphemes. Then the algorithms themselves will work whether we have auto-decoding or not. With that done, we can at minimum tell folks to use byCodeUnit, byChar!T, byGrapheme, etc. to get the correct, efficient behavior. Right now, if you try to use ranges like byCodeUnit, they work with some of Phobos but not enough to really work as a viable replacement to auto-decoding strings.

With all that done, at least it should be reasonably easy for folks to sanely get around auto-decoding, though the question still remains at that point how possible it will be to remove auto-decoding and treat ranges of char the same way that byCodeUnit would. But at bare minimum, it's what we need to do to make it possible and reasonable to work around auto-decoding when you need to while specifying the level of Unicode that you actually want to operate at.

- Jonathan M Davis

May 31, 2016
On 5/31/16 2:21 PM, Jonathan M Davis via Digitalmars-d wrote:
> I think that the first step is getting Phobos to work with all ranges of
> character types - be they char, wchar, dchar, or graphemes. Then the
> algorithms themselves will work whether we have auto-decoding or not. With
> that done, we can at minimum tell folks to use byCodeUnit, byChar!T,
> byGrapheme, etc. to get the correct, efficient behavior. Right now, if you
> try to use ranges like byCodeUnit, they work with some of Phobos but not
> enough to really work as a viable replacement to auto-decoding strings.

Great. Could you put together a sample PR so we understand the implications better? Thanks! -- Andrei