May 31, 2016
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
> On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
>> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
>>> Saying that operating at the code point level - UTF-32 - is correct
>>> is like saying that operating at UTF-16 instead of UTF-8 is correct.
>>
>> Could you please substantiate that? My understanding is that code unit
>> is a higher-level Unicode notion independent of encoding, whereas code
>> point is an encoding-dependent representation detail. -- Andrei
>
> Okay. If you have the letter A, it will fit in one UTF-8 code unit, one
> UTF-16 code unit, and one UTF-32 code unit (so, one code point).
>
> assert("A"c.length == 1);
> assert("A"w.length == 1);
> assert("A"d.length == 1);
>
> If you have 月, then you get
>
> assert("月"c.length == 3);
> assert("月"w.length == 1);
> assert("月"d.length == 1);
>
> whereas if you have 𐀆, then you get
>
> assert("𐀆"c.length == 4);
> assert("𐀆"w.length == 2);
> assert("𐀆"d.length == 1);
>
> So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for
> holding an entire character, but it still looks like UTF-32 does.

Does walkLength yield the same number for all representations?

> However,
> what about characters like é or שׂ? Notice that שׂ takes up more than one code
> point.
>
> assert("שׂ"c.length == 4);
> assert("שׂ"w.length == 2);
> assert("שׂ"d.length == 2);
>
> It's ש with some sort of dot marker on it that they have in Hebrew, but it's
> a single character in spite of the fact that it's multiple code points. é is
> in a similar, though more complicated boat. With D, you'll get
>
> assert("é"c.length == 2);
> assert("é"w.length == 1);
> assert("é"d.length == 1);
>
> because the compiler decides to use the version of é that's a single code
> point.

Does walkLength yield the same number for all representations?

> However, Unicode is set up so that that accent can be its own code
> point and be applied to any other code point - be it an e, an a, or even
> something like the number 0. If we normalize é, we can see other
> versions of it that take up more than one code point. e.g.
>
> assert("é"d.normalize!NFC.length == 1);
> assert("é"d.normalize!NFD.length == 2);
> assert("é"d.normalize!NFKC.length == 1);
> assert("é"d.normalize!NFKD.length == 2);

Does walkLength yield the same number for all representations?

> And you can even put that accent on 0 by doing something like
>
> assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);
>
> One or more code units combine to make a single code point, but one or more
> code points also combine to make a grapheme.

That's right. D's handling of UTF is at the code unit level (like all of Unicode is portably defined). If you want graphemes use byGrapheme.

It seems you destroyed your own argument, which was:

> Saying that operating at the code point level - UTF-32 - is correct
> is like saying that operating at UTF-16 instead of UTF-8 is correct.

You can't claim code units are just a special case of code points.


Andrei

May 31, 2016
On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote:
> UTF-8 is an antiquated hack that needs to be eradicated.  It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.  It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.

Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere anytime soon - if ever. The Win32 API does use UTF-16, and Java and C# do, but vast sea of code that is C or C++ generally uses UTF-8 as do plenty of other programming languages.

And even aside from English, most European languages are going to be more efficient with UTF-8, because they're still primarily ASCII even if they contain characters that are not. Stuff like Chinese is definitely worse in UTF-8 than it would be in UTF-16, but there are a lot of languages other than English which are going to encode better with UTF-8 than UTF-16 - let alone UTF-32.

Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it for it to be going anywhere, and most folks have no problem with that. Any attempt to get rid of it would be a huge, uphill battle.

But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving the standard library - so anyone who wants to avoid UTF-8 is free to do so.

- Jonathan M Davis

May 31, 2016
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
> D's

Phobos'

> handling of UTF is at the code unit

code point

> level (like all of Unicode is portably defined).

May 31, 2016
On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
> On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
> > On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d
wrote:
> >> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> >>> Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.
> >>
> >> Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
> >
> Does walkLength yield the same number for all representations?

walkLength treats a code point like it's a character. My point is that
that's incorrect behavior. It will not result in correct string processing
in the general case, because a code point is not guaranteed to be a
full character.

walkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters.

> > And you can even put that accent on 0 by doing something like
> >
> > assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);
> >
> > One or more code units combine to make a single code point, but one or
> > more
> > code points also combine to make a grapheme.
>
> That's right. D's handling of UTF is at the code unit level (like all of Unicode is portably defined). If you want graphemes use byGrapheme.
>
> It seems you destroyed your own argument, which was:
> > Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct.
>
> You can't claim code units are just a special case of code points.

The point is that treating a code point like it's a full character is just as wrong as treating a code unit as if it were a full character. It's _not_ guaranteed to be a full character. Treating code points as full characters does give you the correct result in more cases than treating a code unit as a full character gives you the correct result, but it still gives you the wrong result in many cases. If we want to have fully correct behavior without making the programmer deal with all of the Unicode issues themselves, then we need to operate at the grapheme level so that we are operating on full characters (though that obviously comes at a high cost to efficiency).

Treating code points as characters like we do right now does not give the correct result in the general case just like treating code units as characters doesn't give the correct result in the general case. Both work some of the time, but neither works all of the time.

Autodecoding attempts to hide the fact that it's operating on Unicode but does not actually go far enough to result in correct behavior. So, we pay the cost of decoding without getting the benefit of correctness.

- Jonathan M Davis


May 31, 2016
On Friday, May 27, 2016 04:31:49 Vladimir Panteleev via Digitalmars-d wrote:
> On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
> >> 9. Autodecode cannot be turned off, i.e. it isn't practical to
> >> avoid
> >> importing std.array one way or another, and then autodecode is
> >> there.
> >
> > Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)
>
> This is neither easy nor practical. It makes writing reliable string handling code a chore in D. Because it is difficult to find all places where this must be done, it is not possible to do on a program-wide scale, thus bugs can only be discovered when this or that component fails because it was not tested with Unicode strings.

In addition, as soon as you have ubyte[], none of the string-related functions work. That's fixable, but as it stands, operating on ubyte[] instead of char[] is a royal pain.

- Jonathan M Davis
May 31, 2016
On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
> On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
>> >On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
>>> > >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d
> wrote:
>>>> > >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
>>>>> > >>>Saying that operating at the code point level - UTF-32 - is correct
>>>>> > >>>is like saying that operating at UTF-16 instead of UTF-8 is correct.
>>>> > >>
>>>> > >>Could you please substantiate that? My understanding is that code unit
>>>> > >>is a higher-level Unicode notion independent of encoding, whereas code
>>>> > >>point is an encoding-dependent representation detail. -- Andrei
>>> > >
>> >Does walkLength yield the same number for all representations?
> walkLength treats a code point like it's a character. My point is that
> that's incorrect behavior. It will not result in correct string processing
> in the general case, because a code point is not guaranteed to be a
> full character.
> ...

What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size.

Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units.

> walkLength does not report the length of a character as one in all cases
> just like length does not report the length of a character as one in all
> cases. walkLength is counting bigger units than length, but it's still
> counting pieces of a character rather than counting full characters.
>

The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456
May 31, 2016
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]
> Does walkLength yield the same number for all representations?

Let's put the question this way. Given the following string, what do *you* think walkLength should return?

	şŭt̥ḛ́k̠

I think any reasonable person would have to say it should return 5, because there are 5 visual "characters" here. Otherwise, what is even the meaning of walkLength?! For it to return anything other than 5 means that it's a leaky abstraction, because it's leaking low-level "implementation details" of the Unicode representation of this string.

However, with the current implementation of autodecoding, walkLength
returns 11.  Can anyone reasonably argue that it's reasonable for
"şŭt̥ḛ́k̠".walkLength to equal 11?  What difference does this make if we
get rid of autodecoding, and walkLength returns 17 instead? *Both* are
wrong.

17 is actually the right answer if you're looking to allocate a buffer large enough to hold this string, because that's the number of bytes it occupies.

5 is the right answer to an end user who knows nothing about Unicode.

11 is an answer that a question that only makes sense to a Unicode specialist, and that no layperson understands.

11 is the answer we currently give. And that, at the cost of across-the-board performance degradation.  Yet you're seriously arguing that 11 should be the right answer, by insisting that the current implementation of autodecoding is "correct".  It boggles the mind.


T

-- 
Today's society is one of specialization: as you grow, you learn more and more about less and less. Eventually, you know everything about nothing.
May 31, 2016
On 05/31/2016 02:46 PM, Timon Gehr wrote:
> On 31.05.2016 20:30, Andrei Alexandrescu wrote:
>> D's
>
> Phobos'

foreach, too. -- Andrei

May 31, 2016
On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:
> walkLength treats a code point like it's a character.

No, it treats a code point like it's a code point. -- Andrei
May 31, 2016
On 05/31/2016 02:57 PM, Jonathan M Davis via Digitalmars-d wrote:
> In addition, as soon as you have ubyte[], none of the string-related
> functions work. That's fixable, but as it stands, operating on ubyte[]
> instead of char[] is a royal pain.

That'd be nice to fix indeed. Please break the ground? -- Andrei