June 03, 2016
On Friday, June 03, 2016 03:08:43 Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> > At the time
> > Unicode also had to grapple with tricky issues like what to do with
> > lookalike characters that served different purposes or had different
> > meanings, e.g., the mu sign in the math block vs. the real letter mu in
> > the Greek block, or the Cyrillic A which looks and behaves exactly like
> > the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
> > *not* mean the same thing (it's the equivalent of R), or the Cyrillic В
> > whose lowercase is в not b, and also had a different sound, but
> > lowercase Latin b looks very similar to Cyrillic ь, which serves a
> > completely different purpose (the uppercase is Ь, not B, you see).
>
> I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.
>
> They should have put me in charge of Unicode. I'd have put a stop to much of the madness :-)

Actually, I would argue that the moment that Unicode is concerned with what the character actually looks like rather than what character it logically is that it's gone outside of its charter. The way that characters actually look is far too dependent on fonts, and aside from display code, code does not care one whit what the character looks like.

For instance, take the capital letter I, the lowercase letter l, and the number one. In some fonts that are feeling cruel towards folks who actually want to read them, two of those characters - or even all three of them - look identical. But I think that you'll agree that those characters should be represented as distinct characters in Unicode regardless of what they happen to look like in a particular font.

Now, take a cyrllic letter that looks similar to a latin letter. If they're logically equivalent such that no code would ever want to distinguish between the two and such that no font would ever even consider representing them differently, then they're truly the same letter, and they should only have one Unicode representation. But if anyone would ever consider them to be logically distinct, then it makes no sense for them to be considered to be the same character by Unicode, because they don't have the same identity. And that distinction is quite clear if any font would ever consider representing the two characters differently, no matter how slight that difference might be.

Really, what a character looks like has nothing to do with Unicode. The exact same Unicode is used regardless of how the text is displayed. Rather, what Unicode is doing is providing logical identifiers for characters so that code can operate on them, and display code can then do whatever it does to display those characters, whether they happen to look similar or not. I would think that the fact that non-display code does not care one whit about what a character looks like and that display code can have drastically different visual representations for the same character would make it clear that Unicode is concerned with having identifiers for logical characters and that that is distinct from any visual representation.

- Jonathan M Davis


June 03, 2016
On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
> But if we were to encode appearance instead of logical meaning, that
> would mean the *same* lowercase Cyrillic ь would have multiple,
> different encodings depending on which font was in use.

I don't see that consequence at all.


> That doesn't
> seem like the right solution either.  Do we really want Unicode strings
> to encode font information too??

No.

>  'Cos by that argument, serif and sans
> serif letters should have different encodings, because in languages like
> Hebrew, a tiny little serif could mean the difference between two
> completely different letters.

If they are different letters, then they should have a different code point. I don't see why this is such a hard concept.


> And what of the Arabic and Indic scripts? They would need to encode the
> same letter multiple times, each being a variation of the physical form
> that changes depending on the surrounding context. Even the Greek sigma
> has two forms depending on whether it's at the end of a word or not --
> so should it be two code points or one?

Two. Again, why is this hard to grasp? If there is meaning in having two different visual representations, then they are two codepoints. If the visual representation is the same, then it is one codepoint. If the difference is only due to font selection, that it is the same codepoint.


> Besides, that still doesn't solve the problem of what "i".uppercase()
> should return. In most languages, it should return "I", but in Turkish
> it should not.
> And if we really went the route of encoding Cyrillic
> letters the same as their Latin lookalikes, we'd have a problem with
> what "m".uppercase() should return, because now it depends on which font
> is in effect (if it's a Cyrillic cursive font, the correct answer is
> "Т", if it's a Latin font, the correct answer is "M" -- the other
> combinations: who knows).  That sounds far worse than what we have
> today.

The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.
June 03, 2016
On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
> Actually, I would argue that the moment that Unicode is concerned with what
> the character actually looks like rather than what character it logically is
> that it's gone outside of its charter. The way that characters actually look
> is far too dependent on fonts, and aside from display code, code does not
> care one whit what the character looks like.

What I meant was pretty clear. Font is an artistic style that does not change context nor semantic meaning. If a font choice changes the meaning then it is not a font.

June 03, 2016
On Friday, 3 June 2016 at 22:38:38 UTC, Walter Bright wrote:
> If a font choice changes the meaning then it is not a font.

Nah, then it is an Awesome Font that is totally Web Scale!

i wish i was making that up http://fontawesome.io/ i hate that thing

But, it is kinda legal: gotta love the Unicode private use area!
June 04, 2016
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:
> It's almost as if printed documents and books have never existed!
some old xUSSR books which has some English text sometimes used Cyrillic font to represent English. it was awful, and barely readable. this was done to ease the work of compositors, and the result was unacceptable. do you feel a recognizable pattern here? ;-)
June 03, 2016
On Fri, Jun 03, 2016 at 03:35:18PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
[...]
> > 'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters.
> 
> If they are different letters, then they should have a different code point.  I don't see why this is such a hard concept.
[...]

It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again:

- Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
  cursive form. In some font renderings the two are IDENTICAL glyphs, in
  spite of being completely different, unrelated letters.  However, in
  non-cursive form, Cyrillic lowercase т is visually distinct.

- Similarly, lowercase Cyrillic П in cursive font looks like lowercase
  Latin n, and in some fonts they are identical glyphs. Again,
  completely unrelated letters, yet they have the SAME VISUAL
  REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
  п, which is visually distinct from Latin n.

- These aren't the only ones, either.  Other Cyrillic false friends
  include cursive Д, which in some fonts looks like lowercase Latin g.
  But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters.  By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding.

Similarly, since lowercase Cyrillic П is n (in cursive font), we should encode it the same way as Latin lowercase n. But again, the letterform changes based on font.  Your criteria of "same visual representation" does not work outside of English.  What you imagine to be a simple, straightforward concept is far from being simple once you're dealing with the diverse languages and writing systems of the world.

Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two?  Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø?  Obviously not, but according to your "visual representation" idea, the answer should be yes.

The bottomline is that uppercase O and the digit 0 represent different LOGICAL entities, in spite of their sharing the same visual representation.  Eventually you have to resort to representing *logical* entities ("characters") rather than visual appearance, which is a property of the font, and has no place in a digital text encoding.


> > Besides, that still doesn't solve the problem of what
> > "i".uppercase() should return. In most languages, it should return
> > "I", but in Turkish it should not.
> > And if we really went the route of encoding Cyrillic letters the
> > same as their Latin lookalikes, we'd have a problem with what
> > "m".uppercase() should return, because now it depends on which font
> > is in effect (if it's a Cyrillic cursive font, the correct answer is
> > "Т", if it's a Latin font, the correct answer is "M" -- the other
> > combinations: who knows).  That sounds far worse than what we have
> > today.
> 
> The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return?  Or are you saying the standard library should not include such a basic function as a case-changing function?


T

-- 
Customer support: the art of getting your clients to pay for your own incompetence.
June 03, 2016
On 6/3/2016 5:42 PM, ketmar wrote:
> sometimes used Cyrillic font to represent English.

Nobody here suggested using the wrong font, it's completely irrelevant.

June 04, 2016
On Saturday, 4 June 2016 at 02:46:31 UTC, Walter Bright wrote:
> On 6/3/2016 5:42 PM, ketmar wrote:
>> sometimes used Cyrillic font to represent English.
>
> Nobody here suggested using the wrong font, it's completely irrelevant.

you suggested that unicode designers should make similar-looking glyphs share the same code, and it reminds me this little story. maybe i misunderstood you, though.
June 03, 2016
On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
> It's not a hard concept, except that these different letters have
> lookalike forms with completely unrelated letters. Again:
>
> - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
>   cursive form. In some font renderings the two are IDENTICAL glyphs, in
>   spite of being completely different, unrelated letters.  However, in
>   non-cursive form, Cyrillic lowercase т is visually distinct.
>
> - Similarly, lowercase Cyrillic П in cursive font looks like lowercase
>   Latin n, and in some fonts they are identical glyphs. Again,
>   completely unrelated letters, yet they have the SAME VISUAL
>   REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
>   п, which is visually distinct from Latin n.
>
> - These aren't the only ones, either.  Other Cyrillic false friends
>   include cursive Д, which in some fonts looks like lowercase Latin g.
>   But in non-cursive font, it's д.
>
> Just given the above, it should be clear that going by visual
> representation is NOT enough to disambiguate between these different
> letters.

It works for books. Unicode invented a problem, and came up with a thoroughly wretched "solution" that we'll be stuck with for generations. One of those bad solutions is have the reader not know what a glyph actually is without pulling back the cover to read the codepoint. It's madness.


> By your argument, since lowercase Cyrillic Т is, visually,
> just m, it should be encoded the same way as lowercase Latin m. But this
> is untenable, because the letterform changes with a different font. So
> you end up with the unworkable idea of a font-dependent encoding.

Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.


> Or, to use an example closer to home, uppercase Latin O and the digit 0
> are visually identical. Should they be encoded as a single code point or
> two?  Worse, in some fonts, the digit 0 is rendered like Ø (to
> differentiate it from uppercase O). Does that mean that it should be
> encoded the same way as the Danish letter Ø?  Obviously not, but
> according to your "visual representation" idea, the answer should be
> yes.

Don't confuse fonts with code points. It'd be adequate if Unicode defined a canonical glyph for each code point, and let the font makers do what they wish.


>> The notion of 'case' should not be part of Unicode, as that is
>> semantic information that is beyond the scope of Unicode.
> But what should "i".toUpper return?

Not relevant to my point that Unicode shouldn't decide what "upper case" for all languages means, any more than Unicode should specify a font. Now when you argue that Unicode should make such decisions, note what a spectacularly hopeless job of it they've done.

June 03, 2016
On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
> > It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again:
> > 
> > - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in cursive form. In some font renderings the two are IDENTICAL glyphs, in spite of being completely different, unrelated letters. However, in non-cursive form, Cyrillic lowercase т is visually distinct.
> > 
> > - Similarly, lowercase Cyrillic П in cursive font looks like lowercase Latin n, and in some fonts they are identical glyphs. Again, completely unrelated letters, yet they have the SAME VISUAL REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is п, which is visually distinct from Latin n.
> > 
> > - These aren't the only ones, either.  Other Cyrillic false friends include cursive Д, which in some fonts looks like lowercase Latin g. But in non-cursive font, it's д.
> > 
> > Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters.
> 
> It works for books.

Because books don't allow their readers to change the font.


> Unicode invented a problem, and came up with a thoroughly wretched "solution" that we'll be stuck with for generations. One of those bad solutions is have the reader not know what a glyph actually is without pulling back the cover to read the codepoint. It's madness.

This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т.  So which letter is it, M or Т?

The fundamental problem is that writing systems for different languages interpret the same letter forms differently.  In English, lowercase g has at least two different forms that we recognize as the same letter. However, to a Cyrillic reader the two forms are distinct, because one of them looks like a Cyrillic letter but the other one looks foreign. So should g be encoded as a single point or two different points?

In a similar vein, to a Cyrillic reader the glyphs т and m represent the same letter, but to an English letter they are clearly two different things.

If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually.


> > By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding.
> 
> Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.

It's not a bad font. It's standard practice to print Cyrillic cursive letters with different glyphs. Russian readers can read both without any problem.  The same letter is represented by different glyphs, and therefore the abstract letter is a more fundamental unit of meaning than the glyph itself.


> > Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two?  Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø?  Obviously not, but according to your "visual representation" idea, the answer should be yes.
> 
> Don't confuse fonts with code points. It'd be adequate if Unicode defined a canonical glyph for each code point, and let the font makers do what they wish.

So should O and 0 share the same glyph or not? They're visually the same thing, even though some fonts render them differently. What should be the canonical shape of O vs. 0? If they are the same shape, then by your argument they must be the same code point, regardless of what font makers do to disambiguate them.  Good luck writing a parser that can't tell between an identifier that begins with O vs. a number literal that begins with 0.

The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate.


> > > The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.
> > But what should "i".toUpper return?
> 
> Not relevant to my point that Unicode shouldn't decide what "upper case" for all languages means, any more than Unicode should specify a font. Now when you argue that Unicode should make such decisions, note what a spectacularly hopeless job of it they've done.

In other words toUpper and toLower does not belong in the standard library. Great.


T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.