The Case Against Autodecode (page 43)

On 06/02/2016 05:37 PM, Andrei Alexandrescu wrote: > On 6/2/16 5:35 PM, deadalnix wrote: >> On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote: >>> On 6/2/16 5:20 PM, deadalnix wrote: >>>> The good thing when you define works by whatever it does right now >>> >>> No, it works as it was designed. -- Andrei >> >> Nobody says it doesn't. Everybody says the design is crap. > > I think I like it more after this thread. -- Andrei Well there's a fantastic argument.

June 03, 2016

Re: The Case Against Autodecode

Posted by H. S. Teoh
in reply to Vladimir Panteleev

Permalink

H. S. Teoh

Posted in reply to Vladimir Panteleev

Permalink

On Fri, Jun 03, 2016 at 10:14:15AM +0000, Vladimir Panteleev via Digitalmars-d wrote:
> On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:
> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> > > At the time Unicode also had to grapple with tricky issues like
> > > what to do with lookalike characters that served different
> > > purposes or had different meanings, e.g., the mu sign in the math
> > > block vs. the real letter mu in the Greek block, or the Cyrillic A
> > > which looks and behaves exactly like the Latin A, yet the Cyrillic
> > > Р, which looks like the Latin P, does *not* mean the same thing
> > > (it's the equivalent of R), or the Cyrillic В whose lowercase is в
> > > not b, and also had a different sound, but lowercase Latin b looks
> > > very similar to Cyrillic ь, which serves a completely different
> > > purpose (the uppercase is Ь, not B, you see).
> > 
> > I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.
> 
> That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances.
> 
> I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.

Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π in
some fonts, but in cursive form it looks more like Latin lowercase n.
It wouldn't make sense to encode Cyrillic п the same as Greek π or Latin
lowercase n just by appearance, since logically it stands as its own
character despite its various appearances.  But it wouldn't make sense
to encode it differently just because you're using a different font!
Similarly, lowercase Cyrillic т in some cursive fonts looks like
lowercase Latin m.  I don't think it would make sense to encode
lowercase Т as Latin m just because of that.

Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently.

T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

On 6/3/2016 3:10 AM, Vladimir Panteleev wrote: > I don't think it would work (or at least, the analogy doesn't hold). It would > mean that you can't add new precomposited characters, because that means that > previously valid sequences are now invalid. So don't add new precomposited characters when a recognized existing sequence exists.

On 6/3/2016 3:14 AM, Vladimir Panteleev wrote: > That's not right either. Cyrillic letters can look slightly different from their > latin lookalikes in some circumstances. > > I'm sure there are extremely good reasons for not using the latin lookalikes in > the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use > separate codes for the lookalikes. It's not restricted to Unicode. How did people ever get by with printed books and documents?

On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote: > Eventually you have no choice but to encode by logical meaning rather > than by appearance, since there are many lookalikes between different > languages that actually mean something completely different, and often > behaves completely differently. It's almost as if printed documents and books have never existed!

On 03.06.2016 20:41, Walter Bright wrote: > On 6/3/2016 3:14 AM, Vladimir Panteleev wrote: >> That's not right either. Cyrillic letters can look slightly different >> from their >> latin lookalikes in some circumstances. >> >> I'm sure there are extremely good reasons for not using the latin >> lookalikes in >> the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use >> separate codes for the lookalikes. It's not restricted to Unicode. > > > How did people ever get by with printed books and documents? They can disambiguate the letters based on context well enough.

On Friday, 3 June 2016 at 18:41:36 UTC, Walter Bright wrote: > How did people ever get by with printed books and documents? Printed books pick one font and one layout, then is read by people. It doesn't have to be represented in some format where end users can change the font and size etc.

On 02-Jun-2016 23:27, Walter Bright wrote: > On 6/2/2016 12:34 PM, deadalnix wrote: >> On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote: >>> Pretty much everything. Consider s and s1 string variables with possibly >>> different encodings (UTF8/UTF16). >>> >>> * s.all!(c => c == 'ö') works only with autodecoding. It returns >>> always false >>> without. >>> >> >> False. Many characters can be represented by different sequences of >> codepoints. >> For instance, ê can be ê as one codepoint or ^ as a modifier followed >> by e. ö is >> one such character. > > There are 3 levels of Unicode support. What Andrei is talking about is > Level 1. > > http://unicode.org/reports/tr18/tr18-5.1.html > > I wonder what rationale there is for Unicode to have two different > sequences of codepoints be treated as the same. It's madness. Yeah, Unicode was not meant to be easy it seems. Or this is whatever happens with evolutionary design that started with "everything is a 16-bit character". -- Dmitry Olshansky

On 6/3/2016 11:54 AM, Timon Gehr wrote: > On 03.06.2016 20:41, Walter Bright wrote: >> How did people ever get by with printed books and documents? > They can disambiguate the letters based on context well enough. Characters do not have semantic meaning. Their meaning is always inferred from the context. Unicode's troubles started the moment they stepped beyond their charter.

On Fri, Jun 03, 2016 at 11:43:07AM -0700, Walter Bright via Digitalmars-d wrote: > On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote: > > Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently. > > It's almost as if printed documents and books have never existed! But if we were to encode appearance instead of logical meaning, that would mean the *same* lowercase Cyrillic ь would have multiple, different encodings depending on which font was in use. That doesn't seem like the right solution either. Do we really want Unicode strings to encode font information too?? 'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters. And what of the Arabic and Indic scripts? They would need to encode the same letter multiple times, each being a variation of the physical form that changes depending on the surrounding context. Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint. Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today. T -- Let's eat some disquits while we format the biskettes.

Forums