The Case Against Autodecode (page 9)

On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote: > On 05/27/2016 03:39 PM, Dmitry Olshansky wrote: >> No, this is not the point of normalization. > > What is? -- Andrei 1) A grapheme may include several combining characters (such as diacritics) whose order is not supposed to be semantically significant. Normalization sorts them in a standardized way so that string comparisons return the expected result for graphemes which differ only by the internal order of their constituent combining code points. 2) Some graphemes (like accented latin letters) can be represented by a single code point OR a letter followed by a combining diacritic. Normalization either splits them all apart (NFD), or combines them whenever possible (NFC). Again, this is primarily intended to make things like string comparisons work as expected, and perhaps to simplify low-level tasks like graphical rendering of text. (Disclaimer: This is an oversimplification, because nothing about Unicode is ever simple.)

On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote: > On 05/27/2016 03:39 PM, Dmitry Olshansky wrote: >> On 27-May-2016 21:11, Andrei Alexandrescu wrote: >>> On 5/27/16 10:15 AM, Chris wrote: >>>> It has happened to me that characters like "é" return length == 2 >>> >>> Would normalization make length 1? -- Andrei >> >> No, this is not the point of normalization. > > What is? -- Andrei Here is an example about normalization. In Unicode, the grapheme Ä is composed of two code points: A (the ascii A) and the ¨ character. However, one of the goals of unicode was to be backwards to compatible with earlier encodings that extended ASCII (codepages). In some codepages, Ä was an actual codepoint. So in some cases you would have the unicode one which is two codepoints and the one from some codepages which would be one. Those should be the same though, i.e compare the same. In order to do that, there is normalization. What is does is to _expand_ the single codepoint Ä into A + ¨

On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote: > Those should be the same though, i.e compare the same. In order to do that, there is normalization. What is does is to _expand_ the single codepoint Ä into A + ¨ Unless I'm mistaken, this depends on the form used. For example, in NFKC you'd get the single codepoint Ä. — David

On 5/27/2016 11:27 AM, Andrei Alexandrescu wrote: > On 5/27/16 1:11 PM, Walter Bright wrote: >> They mean code units. > > Always valid or potentially invalid as well? -- Andrei Some years ago I would have said always valid. Experience, however, says that Unicode is often dirty and code should be tolerant of that. Consider Unicode in a text editor. You can't have it throwing exceptions, silently changing things to replacement characters, etc., when there's a few invalid sequences in it. You also can't just say "the file isn't Unicode" and refuse to display the Unicode in it. It isn't hard to deal with invalid Unicode in a user friendly manner.

On Fri, May 27, 2016 at 04:41:09PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote: > > That's what we've been trying to say all along! > > If that's the case things are pretty dire, autodecoding or not. -- Andrei Like it or not, Unicode ain't merely some glorified form of C's ASCII char arrays. It's about time we faced the reality and dealt with it accordingly. Trying to sweep the complexities of Unicode under the rug is not doing us any good. T -- The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike Ellis

On 28-May-2016 01:04, tsbockman wrote: > On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote: >> On 05/27/2016 03:39 PM, Dmitry Olshansky wrote: >>> No, this is not the point of normalization. >> >> What is? -- Andrei > > 1) A grapheme may include several combining characters (such as > diacritics) whose order is not supposed to be semantically significant. > Normalization sorts them in a standardized way so that string > comparisons return the expected result for graphemes which differ only > by the internal order of their constituent combining code points. > > 2) Some graphemes (like accented latin letters) can be represented by a > single code point OR a letter followed by a combining diacritic. > Normalization either splits them all apart (NFD), or combines them > whenever possible (NFC). Again, this is primarily intended to make > things like string comparisons work as expected, and perhaps to simplify > low-level tasks like graphical rendering of text. Quite accurate statement of the goals. Normalization is all about having canonical order of combining code points. > > (Disclaimer: This is an oversimplification, because nothing about > Unicode is ever simple.) > -- Dmitry Olshansky

May 28, 2016

Re: The Case Against Autodecode

Posted by Marc Schütz
in reply to Andrei Alexandrescu

Permalink

Marc Schütz

Posted in reply to Andrei Alexandrescu

Permalink

On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:
> On 5/27/16 6:56 AM, Marc Schütz wrote:
>> It is not, which has been shown by various posts in this thread.
>
> Couldn't quite find strong arguments. Could you please be more explicit on which you found most convincing? -- Andrei

There are several possibilities of what iteration over a char range can mean. (For the sake of simplicity, let's ignore special cases like `find` and `split`; instead, let's look at `walkLength`, `retro` and similar.)

BEFORE the introduction of auto decoding, it used to iterate over UTF8 code _units_, which is wrong for any non-ASCII data (except for the unlikely case where you really want code units).

AFTER the introduction of auto decoding, it iterates over UTF8 code _points_, which is wrong for combined characters, e.g. äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for the even more unlikely case where you really want code points).

That is, both the BEFORE and AFTER behaviour are wrong, both break for various kinds of input in different ways.

So, is AFTER an improvement over BEFORE? The set of inputs where auto decoding produces wrong output is likely smaller, making it slightly less likely to encounter problems in practice; on the other hand, it's still wrong, and it's harder to find these problems during testing. That's like "improving" a bicycle so that it only breaks down after riding it for 30 minutes instead of just after 10 minutes, so you won't notice it during a test ride.

But there are even more possibilities. It could iterate over graphemes, which is expensive, but more likely to produce the results that the user wants. Or it could iterate by lines, or words (and there are different ways to define what a word is), and so on.

The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do.

So, what was the original goal when introducing auto decoding? To improve correctness, right? I would argue that this goal has not been achieved. Have a look at the article [1], which IMO gives good criteria for how a _correct_ string type should behave. Both BEFORE and AFTER fail most of them.

[1] https://mortoray.com/2013/11/27/the-string-type-is-broken/

On 5/28/16 6:59 AM, Marc Schütz wrote: > The fundamental problem is choosing one of those possibilities over the > others without knowing what the user actually wants, which is what both > BEFORE and AFTER do. OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.) So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives. Andrei

On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote: > On 5/27/16 10:15 AM, Chris wrote: >> It has happened to me that characters like "é" return length == 2 > > Would normalization make length 1? -- Andrei No, I've tried it. I think dchar[] returns one or you check by grapheme.

On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote: > So it harkens back to the original mistake: strings should NOT be arrays with > the respective primitives. An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required. A string class does not do that (from the article: "I admit the correct answer is not always clear").

Forums