June 02, 2016
On Thu, Jun 02, 2016 at 04:29:48PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 06/02/2016 04:22 PM, cym13 wrote:
> > 
> > A:“We should decode to code points”
> > B:“No, decoding to code points is a stupid idea.”
> > A:“No it's not!”
> > B:“Can you show a concrete example where it does something useful?”
> > A:“Sure, look at that!”
> > B:“This isn't working at all, look at all those counter-examples!”
> > A:“It may not work for your examples but look how easy it is to
> >     find code points!”
> 
> With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- Andrei

With ASCII strings, all of std.algorithm operates correctly on ASCII bytes. So let's standardize on ASCII strings.

What a vacuous argument! Basically you're saying "I define code points to be correct. Therefore, I conclude that decoding to code points is correct."  Well, duh.  Unfortunately such vacuous conclusions have no bearing in the real world of Unicode handling.


T

-- 
I am Ohm of Borg. Resistance is voltage over current.
June 03, 2016
Am Thu, 2 Jun 2016 18:54:21 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 06/02/2016 06:10 PM, Marco Leise wrote:
> > Am Thu, 2 Jun 2016 15:05:44 -0400
> > schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:
> > 
> >> On 06/02/2016 01:54 PM, Marc Schütz wrote:
> >>> Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?
> >>
> >> Pretty much everything.
> >>
> >> s.all!(c => c == 'ö')
> >
> > Andrei, your ignorance is really starting to grind on everyones nerves.
> 
> Indeed there seem to be serious questions about my competence, basic comprehension, and now knowledge.

That's not my general impression, but something is different with this thread.

> I understand it is tempting to assume that a disagreement is caused by the other simply not understanding the matter. Even if that were true it's not worth sacrificing civility over it.

Civility has had us caught in an 36 pages long, tiresome debate with us mostly talking past each other. I was being impolite and can't say I regret it, because I prefer this answer over the rest of the thread. It's more informed, elaborate and conclusive.

> > If after 350 posts you still don't see
> > why this is incorrect: s.any!(c => c == 'o'), you must be
> > actively skipping the informational content of this thread.
> 
> Is it 'o' with an umlaut or without?
>
> At any rate, consider s of type string and x of type dchar. The dchar type is defined as "a Unicode code point", or at least my understanding that has been a reasonable definition to operate with in the D language ever since its first release. Also in the D language, the various string types char[], wchar[] etc. with their respective qualified versions are meant to hold Unicode strings with one of the UTF8, UTF16, and UTF32 encodings.
>
> Following these definitions, it stands to reason to infer that the call s.find(c => c == x) means "find the code point x in string s and return the balance of s positioned there". It's prima facie application of the definitions of the entities involved.
> 
> Is this the only possible or recommended meaning? Most likely not, viz. the subtle cases in which a given grapheme is represented via either one or multiple code points by means of combining characters. Is it the best possible meaning? It's even difficult to define what "best" means (fastest, covering most languages, etc).
> 
> I'm not claiming that meaning is the only possible, the only recommended, or the best possible. All I'm arguing is that it's not retarded, and within a certain universe confined to operating at code point level (which is reasonable per the definitions of the types involved) it can be considered correct.
> 
> If at any point in the reasoning above some rampant ignorance comes about, please point it out.

No, it's pretty close now. We can all agree that there is no "best" way, only different use cases. Just defining Phobos to work on code points gives the false illusion that it does the correct thing in all use cases - after all D claims to support Unicode. But in case you wanted to iterate on visual letters it is incorrect and otherwise slow when you work on ASCII structured formats (JSON, XML, paths, Warp, ...). Then there is explaining the different default iteration schemes when using foreach vs. range API (no big deal, just not easily justified) and the cost of implementation when dealing with char[]/wchar[].

From this observation we concluded that decoding should be opt-in and that when we need it, it should be a conscious decision. Unicode is quite complex and learning about the difference between code points and grapheme clusters when segmenting strings will benefit code quality.

As for the question, do multi-code-point graphemes ever appear
in the wild ? OS X is known to use NFD on its native file
system and there is a hint on Wikipedia that some symbols from
Thai or Hindi's Devanagari need them:
https://en.wikipedia.org/wiki/UTF-8#Disadvantages
Some form of Lithuanian seems to have a use for them, too:
http://www.unicode.org/L2/L2012/12026r-n4191-lithuanian.pdf
Aside from those there is nothing generally wrong about
decomposed letters appearing in strings, even though the
use of NFC is encouraged.

> > […harsh tone removed…] in the end we have to assume you
> > will make a decisive vote against any PR with the intent
> > to remove auto-decoding from Phobos.
> 
> This seems to assume I have some vesting in the position that makes it independent of facts. That is not the case. I do what I think is right to do, and you do what you think is right to do.

Your vote outweighs that of many others for better or worse.
When a decision needs to be made and the community is divided,
we need you or Walter or anyone who is invested in the matter
to cast a ruling vote. However when several dozen people
support an idea after discussion, hearing everyones arguments
with practically no objections and you overrule everyone
tensions build up. I welcome the idea to delegate some of the
tasks to smaller groups. No single person is knowledgeable in
every area of CS and both a bus factor of 1 and too big a
group can hinder decision making.
It would help to know for the future, if you understand your
role as one with veto powers or if you could arrange with
giving up responsibilities to decisions within the community
and if so under what conditions.

> > Your so called vocal minority is actually D's panel of Unicode experts who understand that auto-decoding is a false ally and should be on the deprecation track.
> 
> They have failed to convince me. But I am more convinced than before that RCStr should not offer a default mode of iteration. I think its impact is lost in this discussion, because once it's understood RCStr will become D's recommended string type, the entire matter becomes moot.
>
> > Remember final-by-default? You promised, that your objection about breaking code means that D2 will only continue to be fixed in a backwards compatible way, be it the implementation of shared or whatever else. Yet months later you opened a thread with the title "inout must go". So that must have been an appeasement back then. People don't forget these things easily and RCStr seems to be a similar distraction, considering we haven't looked into borrowing/scoped enough and you promise wonders from it.
> 
> What the hell is this, digging dirt on me? Paying back debts? Please stop that crap.

No, that was my actual impression. I must apologize for generalizing it to other people though. I welcome that RCStr project and hope it will be good. At this time though it is not yet fleshed out and we can't tell how fast its adoption will be. Remember that DIPs on scope and RC have had the past tendency to go into long debates with unclear outcome. Unlike this thread, which may be the first in D's forum history with such a high agreement across the board.

> Andrei

-- 
Marco

June 03, 2016
On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/2/2016 3:27 PM, John Colvin wrote:
> > > I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.
> > 
> > There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character?
> 
> I didn't say ordering, I said there should be no such thing as "normalization" in Unicode, where two codepoints are considered to be identical to some other codepoint.

I think it was a combination of historical baggage and trying to accomodate unusual but still valid use cases.

The historical baggage was that Unicode was trying to unify all of the various already-existing codepages out there, and many of those codepages already come with various precomposed characters. To maximize compatibility with existing codepages, Unicode tried to preserve as much of the original mappings as possible within each 256-point block, so these precomposed characters became part of the standard.

However, there weren't enough of them -- some people demanded less common character + diacritic combinations, and some languages had writing so complex their characters had to be composed from more basic parts. The original Unicode range was 16-bit, so there wasn't enough room to fit all of the precomposed characters people demanded, plus there were other things people wanted, like multiple diacritics (e.g., in IPA). So the concept of combining diacritics was invented, in part to prevent combinatorial explosion from soaking up the available code point space, in part to allow for novel combinations of diacritics that somebody out there somewhere might want to represent.  However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence. (Normalization, of course, also subsumes a few other things, such as collation, but this is one of the factors behind it.)

(This is a greatly over-simplified description, of course. At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see). Then you have the wonderful Indic and Arabic cursive writings, where letterforms mutate depending on the surrounding context, which, if you were to include all variants as distinct code points, would occupy many more pages than they currently do.  And also sticky issues like the oft-mentioned Turkish i, which is encoded as a Latin i but behaves differently w.r.t. upper/lowercasing when in Turkish locale -- some cases of this, IIRC, are unfixable bugs in Phobos because we currently do not handle locales. So you see, imagining that code points == the solution to Unicode string handling is a joke. Writing correct Unicode handling is *hard*.)

As with all sufficiently complex software projects, Unicode represents a compromise between many contradictory factors -- writing systems in the world being the complex, not-very-consistent beasts they are -- so such "dirty" details are somewhat inevitable.


T

-- 
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
June 03, 2016
On Thursday, June 02, 2016 15:05:44 Andrei Alexandrescu via Digitalmars-d wrote:
> The intent of autodecoding was to make std.algorithm work meaningfully
> with strings. As it's easy to see I just went through
> std.algorithm.searching alphabetically and found issues literally with
> every primitive in there. It's an easy exercise to go forth with the others.

It comes down to the question of whether it's better to fail quickly when Unicode is handled incorrectly so that it's obvious that you're doing it wrong, or whether it's better for it to work in a large number of cases so that for a lot of code it "just works" but is still wrong in the general case, and it's a lot less obvious that that's the case, so many folks won't realize that they need to do more in order to have their string handling be Unicode-correct.

With code units - especially UTF-8 - it becomes obvious very quickly that treating each element of the string/range as a character is wrong. With code points, you have to work far harder to find examples that are incorrect. So, it's not at all obvious (especially to the lay programmer) that the Unicode handling is incorrect and that their code is wrong - but their code will end up working a large percentage of the time in spite of it being wrong in the general case.

So, yes, it's trivial to show how operating on ranges of code units as if they were characters gives incorrect results far more easily than operating on ranges of code points does. But operating on code points as if they were characters is still going to give incorrect results in the general case.

Regardless of auto-decoding, the anwser is that the programmer needs to understand the Unicode issues and use ranges of code units or code points where appropriate and use ranges of graphemes where appropriate. It's just that if we default to handling code points, then a lot of code will be written which treats those as characters, and it will provide the correct result more often than it would if it treated code units as characters.

In any case, I've probably posted too much in this thread already. It's clear that the first step to solving this problem is to improve Phobos so that it handles ranges of code units, code points, and graphemes correctly whether auto-decoding is involved or not, and only then can we consider the possibility of removing auto-decoding (and even then, the answer may still be that we're stuck, because we consider the resulting code breakage to be too great). But whether Phobos retains auto-decoding or not, the Unicode handling stuff in general is the same, and what we need to do to improve the siutation is the same. So, clearly, I need to do a much better job of finding time to work on D so that I can create some PRs to help the situation.  Unfortunately, it's far easier to find a few minutes here and there while waiting on other stuff to shoot off a post or two in the newsgroup than it is to find time to substantively work on code. :|

- Jonathan M Davis

June 03, 2016
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> However, this
> meant that some precomposed characters were "redundant": they
> represented character + diacritic combinations that could equally well
> be expressed separately. Normalization was the inevitable consequence.

It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead.

There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple.

I.e. have the normalization up front when the text is created rather than everywhere else.
June 03, 2016
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> At the time
> Unicode also had to grapple with tricky issues like what to do with
> lookalike characters that served different purposes or had different
> meanings, e.g., the mu sign in the math block vs. the real letter mu in
> the Greek block, or the Cyrillic A which looks and behaves exactly like
> the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
> *not* mean the same thing (it's the equivalent of R), or the Cyrillic В
> whose lowercase is в not b, and also had a different sound, but
> lowercase Latin b looks very similar to Cyrillic ь, which serves a
> completely different purpose (the uppercase is Ь, not B, you see).

I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.

They should have put me in charge of Unicode. I'd have put a stop to much of the madness :-)
June 03, 2016
On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
> On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>> However, this
>> meant that some precomposed characters were "redundant": they
>> represented character + diacritic combinations that could equally well
>> be expressed separately. Normalization was the inevitable consequence.
>
> It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead.
>
> There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple.
>
> I.e. have the normalization up front when the text is created rather than everywhere else.

I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.
June 03, 2016
On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:
> On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>> At the time
>> Unicode also had to grapple with tricky issues like what to do with
>> lookalike characters that served different purposes or had different
>> meanings, e.g., the mu sign in the math block vs. the real letter mu in
>> the Greek block, or the Cyrillic A which looks and behaves exactly like
>> the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
>> *not* mean the same thing (it's the equivalent of R), or the Cyrillic В
>> whose lowercase is в not b, and also had a different sound, but
>> lowercase Latin b looks very similar to Cyrillic ь, which serves a
>> completely different purpose (the uppercase is Ь, not B, you see).
>
> I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.

That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances.

I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.

June 03, 2016
On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:
> On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> >> However, this
> >> meant that some precomposed characters were "redundant": they
> >> represented character + diacritic combinations that could
> >> equally well
> >> be expressed separately. Normalization was the inevitable
> >> consequence.
> >
> > It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead.
> >
> > There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple.
> >
> > I.e. have the normalization up front when the text is created rather than everywhere else.
>
> I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.

I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do.

As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want.

- Jonathan M Davis

June 03, 2016
On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote:
> On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:
>> On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
>> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>> >> However, this
>> >> meant that some precomposed characters were "redundant": they
>> >> represented character + diacritic combinations that could
>> >> equally well
>> >> be expressed separately. Normalization was the inevitable
>> >> consequence.
>> >
>> > It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead.
>> >
>> > There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple.
>> >
>> > I.e. have the normalization up front when the text is created rather than everywhere else.
>>
>> I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.
>
> I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do.
>
> As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want.
>
> - Jonathan M Davis

I do exactly this. Validate and normalize.