June 04, 2016
On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:
>>
> Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint.

In Unicode there are 2 different codepoints for lower case sigma ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma. Codepoint U+3A2 is undefined. So your objection is not hypothetic, it is actually an issue for uppercase() and lowercase() functions.
Another difficulty besides dotless and dotted i of Turkic, the double letters used in latin transcription of cyrillic text in east and south europe dž, lj, nj and dz, which have an uppercase forme (DŽ, LJ, NJ, DZ) and a titlecase form (Dž, Lj, Nj, Dz).

>
> Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not.  And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows).  That sounds far worse than what we have today.

As an anecdote I can tell the story of the accession to the European Union of Romania and Bulgaria in 2007. The issue was that 3 letters used by Romanian and Bulgarian had been forgotten by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B and 2 Cyrillic letters that I do not remember). The Romanian used as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and U+162), which look a little bit alike. When the Commission finally managed to force Mirosoft to correct the fonts to include them, we could start to correct the data. The transition was finished in 2012 and was only possible because no other language we deal with uses the "wrong" codepoints (Turkish but fortunately we only have a handful of them in our db's). So 5 years of ad hoc processing for the substicion of 4 codepoints.
BTW: using combining diacritics was out of the question at that time simply because Microsoft Word didn't support it at that time and many documents we encountered still only used codepages (one has also to remember that in big institution like the EC, the IT is always several years behind the open market, which means that when product is in release X, the Institution still might use release X-5 years).


June 04, 2016
One has also to take into consideration that Unicode is the way it is because it was not invented in an empty space. It had to take consideration of the existing and find compromisses allowing its adoption. Even if they had invented the perfect encoding, NO ONE WOULD HAVE USED IT, as it would have fubar the existing.
As it was invented it allowed a (relatively smooth) transition. Here some points that made it even possible that Unicode could be adopted at all:
- 16 bits: while that choice was a bit shortsighted, 16 bits is a good compromice between compactness and richness (BMP suffice to express nearly all living languages).
- Using more or less the same arrangement of codepoints as in the different codepages. This allowed to transform legacy documents with simple scripts (matter of fact I wrote a script to repair misencoded Greek documents, it consisted mainly of  unich = ch>0x80 ? ch+0x2D0 : ch;
- Utf-8: this was the genious stroke encoding that allowed to mix it all without requiring awful acrobatics (Joakim is completely out to lunch on that one, shifting encoding without self-synchronisation are hellish, that's why Chinese and Japanese adopted Unicode without hesitation, they had enough experience with their legacy encodings.
- Letting time for the transition.

So all the points that people here criticize, were in fact the reason why Unicode could even be become the standard it is now.
June 04, 2016
On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
> On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:
>> It works for books.
> Because books don't allow their readers to change the font.

Unicode is not the font.


> This madness already exists *without* Unicode. If you have a page with a
> single glyph 'm' printed on it and show it to an English speaker, he
> will say it's lowercase M. Show it to a Russian speaker, and he will say
> it's lowercase Т.  So which letter is it, M or Т?

It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot.

('m' doesn't always mean m in english, either. It depends on the context.)

Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-)


> If you're going to represent both languages, you cannot get away from
> needing to represent letters abstractly, rather than visually.

Books do visually just fine!


> So should O and 0 share the same glyph or not? They're visually the same
> thing,

No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0.


> The very fact that we distinguish between O and 0, independently of what
> Unicode did/does, is already proof enough that going by visual
> representation is inadequate.

Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I.


> In other words toUpper and toLower does not belong in the standard
> library. Great.

Unicode and the standard library are two different things.

June 04, 2016
On 03/06/2016 20:12, Dmitry Olshansky wrote:
> On 02-Jun-2016 23:27, Walter Bright wrote:

>> I wonder what rationale there is for Unicode to have two different
>> sequences of codepoints be treated as the same. It's madness.
>
> Yeah, Unicode was not meant to be easy it seems. Or this is whatever
> happens with evolutionary design that started with "everything is a
> 16-bit character".
>

Typing as someone who as spent some time creating typefaces, having two representations makes sense, and it didn't start with Unicode, it started with movable type.

It is much easier for a font designer to create the two codepoint versions of characters for most instances, i.e. make the base letters and the diacritics once. Then what I often do is make single codepoint versions of the ones I'm likely to use, but only if they need more tweaking than the kerning options of the font format allow. I'll omit the history lesson on how this was similar in the case of movable type.

Keyboards for different languages mean that a character that is a single keystroke in one case is two together or in sequence in another. This means that Unicode not only represents completed strings, but also those that are mid composition. The ordering that it uses to ensure that graphemes have a single canonical representation is based on the order that those multi-key characters are entered. I wouldn't call it elegant, but its not inelegant either.

Trying to represent all sufficiently similar glyphs with the same codepoint would lead to a layout problem. How would you order them so that strings of any language can be sorted by their local sorting rules, without having to special case algorithms?

Also consider ligatures, such as those for "ff", "fi", "ffi", "fl", "ffl" and many, many more. Typographers create these glyphs whenever available kerning tools do a poor job of combining them from the individual glyphs. From the point of view of meaning they should still be represented as individual codepoints, but for display (electronic or print) that sequence needs to be replaced with the single codepoint for the ligature.

I think that in order to understand the decisions of the Unicode committee, one has to consider that they are trying to unify the concerns of representing written information from two sides. One side prioritises storage and manipulation, while the other considers aesthetics and design workflow more important. My experience of using Unicode from both sides gives me a different appreciation for the difficulties of reconciling the two.

A...

P.S.

Then they started adding emojis, and I lost all faith in humanity ;)
June 05, 2016
On Friday, 3 June 2016 at 12:04:39 UTC, Chris wrote:
> I do exactly this. Validate and normalize.

And once you've done this, auto decoding is useless because the same character has the same representation anyway.

June 05, 2016
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:
> On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
>> Eventually you have no choice but to encode by logical meaning rather
>> than by appearance, since there are many lookalikes between different
>> languages that actually mean something completely different, and often
>> behaves completely differently.
>
> It's almost as if printed documents and books have never existed!

TIL: books are read by computers.

June 05, 2016
On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:
> Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.
>

Interestingly enough, I've mentioned earlier here that only people from the US would believe that documents with mixed languages aren't commonplace. I wasn't expecting to be proven right that fast.

June 05, 2016
On Friday, June 03, 2016 15:38:38 Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
> > Actually, I would argue that the moment that Unicode is concerned with
> > what
> > the character actually looks like rather than what character it logically
> > is that it's gone outside of its charter. The way that characters
> > actually look is far too dependent on fonts, and aside from display code,
> > code does not care one whit what the character looks like.
>
> What I meant was pretty clear. Font is an artistic style that does not change context nor semantic meaning. If a font choice changes the meaning then it is not a font.

Well, maybe I misunderstood what was being argued, but it seemed like you've been arguing that two characters should be considered the same just because they look similar, whereas H. S. Teoh is arguing that two characters can be logically distinct while still looking similar and that they should be treated as distinct in Unicode because they're logically distinct. And if that's what's being argued, then I agree with H. S. Teoh.

I expect - at least ideally - for Unicode to contain identifiers for characters that are distinct from whatever their visual representation might be. Stuff like fonts then worries about how to display them, and hopefully don't do stupid stuff like make a capital I look like a lowercase l (though they often do, unfortunately). But if two characters in different scripts - be they latin and cyrillic or whatever - happen to often look the same but would be considered two different characters by humans, then I would expect Unicode to consider them to be different, whereas if no one would reasonably consider them to be anything but exactly the same character, then there should only be one character in Unicode.

However, if we really have crazy stuff where subtly different visual representations of the letter g are considered to be one character in English and two in Russian, then maybe those should be three different characters in Unicode so that the English text can clearly be operating on g, whereas the Russian text is doing whatever it does with its two characters that happen to look like g. I don't know. That sort of thing just gets ugly. But I definitely think that Unicode characters should be made up of what the logical characters are and leave the visual representation up to the fonts and the like.

Now, how to deal with uppercase vs lowercase and all of that sort of stuff is a completely separate issue IMHO, and that comes down to how the characters are somehow logically associated with one another, and it's going to be very locale-specific such that it's not really part of the core of Unicode's charter IMHO (though I'm not sure that it's bad if there's a set of locale rules that go along with Unicode for those looking to correctly apply such rules - they just have nothing to do with code points and graphemes and how they're represented in code).

- Jonathan M Davis
June 05, 2016
On Saturday, 4 June 2016 at 08:12:47 UTC, Walter Bright wrote:
> On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
>> On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:
>>> It works for books.
>> Because books don't allow their readers to change the font.
>
> Unicode is not the font.
>
>
>> This madness already exists *without* Unicode. If you have a page with a
>> single glyph 'm' printed on it and show it to an English speaker, he
>> will say it's lowercase M. Show it to a Russian speaker, and he will say
>> it's lowercase Т.  So which letter is it, M or Т?
>
> It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot.
>
> ('m' doesn't always mean m in english, either. It depends on the context.)
>
> Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-)
>
>
>> If you're going to represent both languages, you cannot get away from
>> needing to represent letters abstractly, rather than visually.
>
> Books do visually just fine!
>
>
>> So should O and 0 share the same glyph or not? They're visually the same
>> thing,
>
> No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0.
>
>
>> The very fact that we distinguish between O and 0, independently of what
>> Unicode did/does, is already proof enough that going by visual
>> representation is inadequate.
>
> Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I.
>
>
>> In other words toUpper and toLower does not belong in the standard
>> library. Great.
>
> Unicode and the standard library are two different things.

Even if a character in different languages share a glyph or look identical though, it makes sense to duplicate them with different code points/units/whatever.

Simple functions like isCyrillicLetter() can then do a simple less-than / greater-than comparison instead of having a lookup table to check different numeric representations scattered throughout the Unicode table. Functions like toUpper and toLower become easier to write as well (for SOME languages anyhow), it's simply myletter +/- numlettersinalphabet. Redundancy here is very helpful.

Maybe instead of Unicode they should have called it Babel... :)

"The Lord said, “If as one people speaking the same language they have begun to do this, then nothing they plan to do will be impossible for them. Come, let us go down and confuse their language so they will not understand each other.”"

-Jon
June 05, 2016
On 6/5/2016 1:07 AM, deadalnix wrote:
> On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:
>> Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode
>> codepoint decisions.
>>
>
> Interestingly enough, I've mentioned earlier here that only people from the US
> would believe that documents with mixed languages aren't commonplace. I wasn't
> expecting to be proven right that fast.
>

You'd be in error. I've been casually working on my grandfather's thesis trying to make a web version of it, and it is mixed German, French, and English. I've also made a digital version of an old history book that is mixed English, old English, German, French, Greek, old Greek, and Egyptian hieroglyphs (available on Amazons in your neighborhood!).

I've also lived in Germany for 3 years, though that was before computers took over the world.