The Case Against Autodecode (page 8)

On 05/27/2016 09:30 PM, Andrei Alexandrescu wrote: > It seems code points are kind of useless because they don't really mean > anything, would that be accurate? -- Andrei I think so, yeah. Due to combining characters, code points are similar to code units: a Unicode thing that you need to know about of when working below the human-perceived character (grapheme) level.

On 27-May-2016 21:11, Andrei Alexandrescu wrote: > On 5/27/16 10:15 AM, Chris wrote: >> It has happened to me that characters like "é" return length == 2 > > Would normalization make length 1? -- Andrei No, this is not the point of normalization. -- Dmitry Olshansky

May 27, 2016

Re: The Case Against Autodecode

Posted by H. S. Teoh
in reply to Andrei Alexandrescu

Permalink

H. S. Teoh

Posted in reply to Andrei Alexandrescu

Permalink

On Fri, May 27, 2016 at 02:42:27PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:
> > Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it.
> 
> Which languages are covered by code points, and which languages require graphemes consisting of multiple code points? How does normalization play into this? -- Andrei

This is a complicated issue; for a full explanation you'll probably want to peruse the Unicode codices. For example:

	http://www.unicode.org/faq/char_combmark.html

But in brief, it's mostly a number of common European languages have 1-to-1 code point to character mapping, as well as Chinese writing. Outside of this narrow set, you're on shaky ground.  Examples (that I can think of, there are many others):

- Almost all Korean characters are composed of multiple code points.

- The Indic languages (which cover quite a good number of Unicode code
  pages) have ligatures that require multiple code points.

- The Thai block contains a series of combining diacritics for vowels
  and tones.

- Hebrew vowel points require multiple code points;

- A good number of native American scripts require combining marks,
  e.g., Navajo.

- International Phonetic Alphabet (primarily only for linguistic uses,
  but could be widespread because it's relevant everywhere language is
  spoken).

- Classical Greek accents (though this is less common, mostly being used
  only in academic circles).

Even within the realm of European languages and languages that use some version of the Latin script, there is an entire block of code points in Unicode (the U+0300 block) dedicated to combining diacritics. A good number of combinations do not have precomposed characters.

Now as far as normalization is concerned, it only helps if a particular combination of diacritics on a base glyph have a precomposed form. A large number of the above languages do not have precomposed characters simply because of the sheer number of combinations. The only reason the CJK block actually includes a huge number of precomposed characters was because the rules for combining the base forms are too complex to encode compositionally. Otherwise, most languages with combining diacritics would not have precomposed characters assigned to their respective blocks.  In fact, a good number (all?) of precomposed Latin characters were included in Unicode only because they existed in pre-Unicode days and some form of compatibility was desired back when Unicode was still not yet widely adopted.

So basically, besides a small number of languages, the idea of 1 code point == 1 character is pretty unworkable. Especially in this day and age of worldwide connectivity.

T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!

On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 5/27/16 3:10 PM, ag0aep6g wrote: > > I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that. > > It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character". T -- English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry Wall

On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote: > It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei It might help to think of code points as being a kind of byte code for a text-representing VM. It's not meaningless, but it also isn't trivial and relevant metrics can only be seen in application. BTW you don't even have to get into unicode to hit complications. Tab, backspace, carriage return, these are part of ASCII but already complicate questions. http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior came up on a quick search. Does the backspace character reduce the length of a string? In some contexts, maybe.

On 5/27/16 3:30 PM, Andrei Alexandrescu wrote: > On 5/27/16 3:10 PM, ag0aep6g wrote: >> I don't think there is value in distinguishing by language. The point of >> Unicode is that you shouldn't need to do that. > > It seems code points are kind of useless because they don't really mean > anything, would that be accurate? -- Andrei > The only unmistakably correct use I can think of is transcoding from one UTF representation to another. That is, in order to transcode from UTF8 to UTF16, I don't need to know anything about character composition. -Steve

On Fri, May 27, 2016 at 07:53:30PM +0000, Adam D. Ruppe via Digitalmars-d wrote: > On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote: > > It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei > > It might help to think of code points as being a kind of byte code for a text-representing VM. > > It's not meaningless, but it also isn't trivial and relevant metrics can only be seen in application. > > BTW you don't even have to get into unicode to hit complications. Tab, backspace, carriage return, these are part of ASCII but already complicate questions. > > http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior > > came up on a quick search. Does the backspace character reduce the length of a string? In some contexts, maybe. Fun fact: on some old Unix boxen, Backspace + underscore was interpreted to mean "underline the previous character". Probably inherited from the old typewriter days. Scarily enough, some Posix terminals may still interpret this sequence this way! An early precursor of Unicode combining diacritics, perhaps? :-D T -- Everybody talks about it, but nobody does anything about it! -- Mark Twain

On 05/27/2016 03:39 PM, Dmitry Olshansky wrote: > On 27-May-2016 21:11, Andrei Alexandrescu wrote: >> On 5/27/16 10:15 AM, Chris wrote: >>> It has happened to me that characters like "é" return length == 2 >> >> Would normalization make length 1? -- Andrei > > No, this is not the point of normalization. What is? -- Andrei

On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote: > On 05/27/2016 03:39 PM, Dmitry Olshansky wrote: >> On 27-May-2016 21:11, Andrei Alexandrescu wrote: >>> On 5/27/16 10:15 AM, Chris wrote: >>>> It has happened to me that characters like "é" return length == 2 >>> >>> Would normalization make length 1? -- Andrei >> >> No, this is not the point of normalization. > > What is? -- Andrei This video will be helpfull :) https://www.youtube.com/watch?v=n0GK-9f4dl8 It talks about Unicode in C++, but also explains how unicode works.

Forums