January 14, 2011
"Nick Sabalausky" <a@a.a> wrote in message news:igori7$1ovh$1@digitalmars.com...
> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:igoqrm$1n5r$1@digitalmars.com...
>> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
>> [snip]
>>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>>
>>> Or:
>>>
>>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>>
>>> Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes.
>>>
>>> Note that while some characters exist in pre-combined form (such as the
>>> {u
>>> with the umlaut} above), legend has it there are others than can only be
>>> represented using a combining character.
>>>
>>> It's also my understanding, though I'm not certain, that sometimes
>>> multiple
>>> combining characters can be used together on the same "root" character.
>>
>> Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?
>>
>
> My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain.
>
> FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character
>
> Michel or spir might have better links though.
>

Heh, as if that wasn't bad enough, there's also digraphs which, from what I can tell, seem to be single code-points that represent more than one glyph/character/grapheme:

http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode

This page may be helpful too: http://en.wikipedia.org/wiki/Precomposed_character



January 14, 2011
Am 14.01.2011 08:00, schrieb Nick Sabalausky:
> "Nick Sabalausky"<a@a.a>  wrote in message
> news:igori7$1ovh$1@digitalmars.com...
>> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org>  wrote in message
>> news:igoqrm$1n5r$1@digitalmars.com...
>>> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
>>> [snip]
>>>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>>>
>>>> Or:
>>>>
>>>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>>>
>>>> Those *both* get rendered exactly the same, and both represent the same
>>>> four-letter sequence. In the second example, the 'u' and the {umlaut
>>>> combining character} combine to form one grapheme. The f's and n's just
>>>> happen to be single-code-point graphemes.
>>>>
>>>> Note that while some characters exist in pre-combined form (such as the
>>>> {u
>>>> with the umlaut} above), legend has it there are others than can only be
>>>> represented using a combining character.
>>>>
>>>> It's also my understanding, though I'm not certain, that sometimes
>>>> multiple
>>>> combining characters can be used together on the same "root" character.
>>>
>>> Thanks. One further question is: in the above example with u-with-umlaut,
>>> there is one code point that corresponds to the entire combination. Are
>>> there combinations that do not have a unique code point?
>>>
>>
>> My understanding is "yes". At least that's what I've heard, and I've never
>> heard any claims of "no". I don't know of any specific ones offhand,
>> though. Actually, it might be possible to use any combining character with
>> any old letter or number (like maybe a 7 with an umlaut), though I'm not
>> certain.
>>
>> FWIW, the Wikipedia article might help, or at least link to other things
>> that might help: http://en.wikipedia.org/wiki/Combining_character
>>
>> Michel or spir might have better links though.
>>
>
> Heh, as if that wasn't bad enough, there's also digraphs which, from what I
> can tell, seem to be single code-points that represent more than one
> glyph/character/grapheme:
>
> http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode
>
> This page may be helpful too:
> http://en.wikipedia.org/wiki/Precomposed_character
>

OMG, this is really fucked up.
Can't we just go back to 8bit charsets like ISO 8859-* etc? :/
January 14, 2011
On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a@a.a> wrote:

> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message
> news:igoqrm$1n5r$1@digitalmars.com...
>> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
>> [snip]
>>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>>
>>> Or:
>>>
>>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>>
>>> Those *both* get rendered exactly the same, and both represent the same
>>> four-letter sequence. In the second example, the 'u' and the {umlaut
>>> combining character} combine to form one grapheme. The f's and n's just
>>> happen to be single-code-point graphemes.
>>>
>>> Note that while some characters exist in pre-combined form (such as the
>>> {u
>>> with the umlaut} above), legend has it there are others than can only be
>>> represented using a combining character.
>>>
>>> It's also my understanding, though I'm not certain, that sometimes
>>> multiple
>>> combining characters can be used together on the same "root" character.
>>
>> Thanks. One further question is: in the above example with u-with-umlaut,
>> there is one code point that corresponds to the entire combination. Are
>> there combinations that do not have a unique code point?
>>
>
> My understanding is "yes". At least that's what I've heard, and I've never
> heard any claims of "no". I don't know of any specific ones offhand, though.
> Actually, it might be possible to use any combining character with any old
> letter or number (like maybe a 7 with an umlaut), though I'm not certain.
>
> FWIW, the Wikipedia article might help, or at least link to other things
> that might help: http://en.wikipedia.org/wiki/Combining_character

http://en.wikipedia.org/wiki/Unicode_normalization

Linked from that page, the normalization process is probably something we need to look at.  Using decomposed canonical form would mean we need more state than just what code-unit are we on, plus it creates more likelyhood that a match will be found with part of a grapheme (spir or Michel brought it up earlier).  So I think the correct case is to use composed canonical form.  This is after just reading that page, so maybe I'm missing something.

Non-composable combinations would be a problem.  The string range is formed on the basis that the element type is a dchar.  If there are combinations that cannot be composed into a single dchar, then the element type has to be a dchar array (or some other type which contains all the info).  The other option is to simply leave them decomposed.  Then you risk things like partial matches.

I'm leaning towards a solution like this: While iterating a string, it should output dchars in normalized composed form.  But a specialized comparison function should be used when doing things like searches or regex, because it might not be possible to compose two combining characters.

The drawback to this is that a dchar might not be able to represent a grapheme (only if it cannot be composed), but I think it's too much of a hit in complexity and performance to make the element type of a string larger than a dchar.

Those who wish to work with a more comprehensive string type can use a more complex string type such as the one created by spir.

Does that sound reasonable?

-Steve
January 14, 2011
Am 14.01.2011 07:26, schrieb Nick Sabalausky:
> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org>  wrote in message
> news:igoj6s$17r6$1@digitalmars.com...
>>
>> I'm not so sure about that. What do you base this assessment on? Denis
>> wrote a library that according to him does grapheme-related stuff nobody
>> else does. So apparently graphemes is not what people care about (although
>> it might be what they should care about).
>>
>
> It's what they want, they just don't know it.
>
> Graphemes are what many people *think* code points are.
>

Agreed. Up until spir mentioned graphemes in this newsgroup I always thought that one Unicode code point == one character on the screen.

I guess in the majority of use cases you want to operate on user perceived characters.
January 14, 2011
On Friday 14 January 2011 04:47:59 Steven Schveighoffer wrote:
> On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a@a.a> wrote:
> > "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:igoqrm$1n5r$1@digitalmars.com...
> > 
> >> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
> >> [snip]
> >> 
> >>> [ 'f', {u with the umlaut}, 'n', 'f' ]
> >>> 
> >>> Or:
> >>> 
> >>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
> >>> 
> >>> Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes.
> >>> 
> >>> Note that while some characters exist in pre-combined form (such as the
> >>> {u
> >>> with the umlaut} above), legend has it there are others than can only
> >>> be
> >>> represented using a combining character.
> >>> 
> >>> It's also my understanding, though I'm not certain, that sometimes
> >>> multiple
> >>> combining characters can be used together on the same "root" character.
> >> 
> >> Thanks. One further question is: in the above example with
> >> u-with-umlaut,
> >> there is one code point that corresponds to the entire combination. Are
> >> there combinations that do not have a unique code point?
> > 
> > My understanding is "yes". At least that's what I've heard, and I've
> > never
> > heard any claims of "no". I don't know of any specific ones offhand,
> > though.
> > Actually, it might be possible to use any combining character with any
> > old
> > letter or number (like maybe a 7 with an umlaut), though I'm not certain.
> > 
> > FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character
> 
> http://en.wikipedia.org/wiki/Unicode_normalization
> 
> Linked from that page, the normalization process is probably something we need to look at.  Using decomposed canonical form would mean we need more state than just what code-unit are we on, plus it creates more likelyhood that a match will be found with part of a grapheme (spir or Michel brought it up earlier).  So I think the correct case is to use composed canonical form.  This is after just reading that page, so maybe I'm missing something.
> 
> Non-composable combinations would be a problem.  The string range is formed on the basis that the element type is a dchar.  If there are combinations that cannot be composed into a single dchar, then the element type has to be a dchar array (or some other type which contains all the info).  The other option is to simply leave them decomposed.  Then you risk things like partial matches.
> 
> I'm leaning towards a solution like this: While iterating a string, it should output dchars in normalized composed form.  But a specialized comparison function should be used when doing things like searches or regex, because it might not be possible to compose two combining characters.
> 
> The drawback to this is that a dchar might not be able to represent a grapheme (only if it cannot be composed), but I think it's too much of a hit in complexity and performance to make the element type of a string larger than a dchar.

Well, there's plenty in std.string that already deals in strings rather than dchar, and for the most part, any case where you couldn't fit a grapheme in a dchar could be covered by using a string.

> Those who wish to work with a more comprehensive string type can use a more complex string type such as the one created by spir.
> 
> Does that sound reasonable?

We really should have something along those lines it seems. From what little _I_ know, the basic approach that you suggest seems like the correct one, but perhaps someone more knowledgeable will be able to come up with a reason why it's not a good idea. Certainly, I think that any solution that I'd come up with would be similar to what you're suggesting.

- Jonathan M Davis
January 14, 2011
On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

>> That's forgetting that most of the time people care about graphemes
>> (user-perceived characters), not code points.
>
> I'm not so sure about that. What do you base this assessment on? Denis
> wrote a library that according to him does grapheme-related stuff nobody
> else does. So apparently graphemes is not what people care about
> (although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.)
(And what about Objective-C? Why did its designers even bother with that?).

The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors:

* The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets.
(Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.)
* ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.)
* Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms.
* Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output?

Thus, practically, programmers can (1) simply don't know the issue (2) have code that really works in typical use cases for their software (3) do not notice their code runs incorrectly.
There is also an intermediate situation between (2) & (3), similar to old problems with previous ASCII-only apps: they work wrongly when used in a non-english environment, but what can users do, concretely? Most often, they just have to cope with incorrectness, reinterpret outputs differently, and/or find workarounds by cheating with the interface.

The responsability of designers of tools for programmers is, imo, important. We should make the issue clear, first (very difficult, it's an ubiquitous myth to break down), and propose services that run correctly in situations where said issue is relevant, here manipulation of universal text, even if not very efficient at start.
On my side, and about D, I wish that most D programmers (1) are aware of the problem (2) understand its why's & how's (3) know there is a correct solution. Then, (4) use it actually is their choice (and I don't care whether or not they do).

>>>> It also supports this:
>>>>
>>>> foreach(i, d; s)
>>>> {
>>>> writeln("The character in position ", i, " is ", d);
>>>> }
>>>>
>>>> where i is the index (might not be sequential)
>>>
>>> Well string supports that too, albeit with the nit that you need to
>>> specify dchar.
>>
>> Except it breaks with combining characters. For instance, take the
>> string "t̃", which is two code points -- 't' followed by combining tilde
>> (U+0303) -- and you'll get the following output:
>>
>> The character in position 0 is t
>> The character in position 1 is ̃
>>
>> (Note that the tilde becomes combined with the preceding space
>> character.)
>>
>> The conception of character that normal people have does not match the
>> notion of code points when combining characters enters the equation.
>
> This might be a good time to see whether we need to address graphemes
> systematically. Could you please post a few links that would educate me
> and others in the mysteries of combining characters?

Beware! far too long text. https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction
(the directory above contains the current rough implementation of Text, plus a bit of its brother package DUnicode)

> Thanks,
>
> Andrei

Denis
_________________
vita es estrany
spir.wikidot.com

January 14, 2011
On 01/14/2011 07:26 AM, Nick Sabalausky wrote:
> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org>  wrote in message
> news:igoj6s$17r6$1@digitalmars.com...
>>
>> I'm not so sure about that. What do you base this assessment on? Denis
>> wrote a library that according to him does grapheme-related stuff nobody
>> else does. So apparently graphemes is not what people care about (although
>> it might be what they should care about).
>>
>
> It's what they want, they just don't know it.
>
> Graphemes are what many people *think* code points are.
>
>>
>> This might be a good time to see whether we need to address graphemes
>> systematically. Could you please post a few links that would educate me
>> and others in the mysteries of combining characters?
>>
>
> Maybe someone else has a link to an explanation (I don't), but it's
> basically just this:

If anyone finds a pointer to such an explanation, bravo, and than you. (You will certainly not find it in Unicode literature, for instance.)
Nick's explanation below is good and concise. (Just 2 notes added.)

> Three levels of abstraction from lowest to highest:
> - Code Unit (ie, encoding)
> - Code Point (ie, what Unicode assigns distinct numbers to)
> - Grapheme (ie, what we think of as a "character")
>
> A code-point can be made up of one or more code-units. Likewise, a grapheme
> can be made up of one or more code-points.
>
> There are (at least) two types of code points:
>
> - Regular ones, such as letters, digits, and punctuation.
>
> - "Combining Characters", such as accent marks (or if you're familiar with
> Japanese, the little things in the upper-right corner that change an "s" to
> a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a
> vowel). Ie, things that are not characters in their own right, but merely
> modify other characters. These can be often (always?) be thought of as being
> like overlays.

You can also say there are 2 kinds of characters: simple like "u" & composite "ü" or "ṵ̈̈". The former are coded with a single (base) code, the latter with one (rarely more) base codes and an arbitrary number of combining codes.

For a majority of _common_ characters made of 2 or 3 codes (western language letters, korean Hangul syllables,...), precombined codes have been added to the set. Thus, they can be coded with a single code like simple characters.

[Also note, to avoid things be too simple ;-), some (few) combining codes called "prepend" come _before_ the base in raw code sequence...]

> If a code point representing a "combining character" exists in a string,
> then instead of being displayed as a character it merely modifies whatever
> code-point came before it.
>
> So, for instance, if you want to store the German word for five (in all
> lower-case), there are two ways to do it:
>
> [ 'f', {u with the umlaut}, 'n', 'f' ]
>
> Or:
>
> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Note: the second form is the base form for Unicode. There are reasons to have chosen it (see my text), and why UCS does not and simply cannot propose precomposed codes for all possible composite characters.

> Those *both* get rendered exactly the same, and both represent the same
> four-letter sequence. In the second example, the 'u' and the {umlaut
> combining character} combine to form one grapheme. The f's and n's just
> happen to be single-code-point graphemes.
>
> Note that while some characters exist in pre-combined form (such as the {u
> with the umlaut} above), legend has it there are others than can only be
> represented using a combining character.
>
> It's also my understanding, though I'm not certain, that sometimes multiple
> combining characters can be used together on the same "root" character.

There is no logical limit, only practical such as how to display 3 diacritics above the same base? You can invent a script for a mythical folk's language if you like :-)
Also, some examples of real language characters (Hebrew, IIRC) in Unicode test data sets hold up to 8 codes.

> Caveat: There may very well be further complications that I'm not aware of.
> Heck, knowing Unicode, there probably are.

Denis
_________________
vita es estrany
spir.wikidot.com

January 14, 2011
On Fri, 14 Jan 2011 08:14:02 -0500, spir <denis.spir@gmail.com> wrote:

> On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:
>
>>> That's forgetting that most of the time people care about graphemes
>>> (user-perceived characters), not code points.
>>
>> I'm not so sure about that. What do you base this assessment on? Denis
>> wrote a library that according to him does grapheme-related stuff nobody
>> else does. So apparently graphemes is not what people care about
>> (although it might be what they should care about).
>
> I'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.)
> (And what about Objective-C? Why did its designers even bother with that?).
>
> The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors:
>
> * The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets.
> (Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.)
> * ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.)
> * Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms.
> * Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output?

* I don't even know how to make a grapheme that is more than one code-unit, let alone more than one code-point :)  Every time I try, I get 'invalid utf sequence'.

I feel significantly ignorant on this issue, and I'm slowly getting enough knowledge to join the discussion, but being a dumb American who only speaks English, I have a hard time grasping how this shit all works.

-Steve
January 14, 2011
On 01/14/2011 07:33 AM, Andrei Alexandrescu wrote:
> Thanks. One further question is: in the above example with
> u-with-umlaut, there is one code point that corresponds to the entire
> combination. Are there combinations that do not have a unique code point?

See my previous follow-up to nick's explanation. But the answer is yes, not only for usual characters, but due to the fact that a user is, theoratically and practically, totally free to combine base ad combining codes --even to invent chracters. The only limit is that fonts will not know how to display unprobable combinations.
(See also my presentation text, shows an example of dots below and above greek letters.)

Denis
_________________
vita es estrany
spir.wikidot.com

January 14, 2011
On 01/14/2011 07:44 AM, Nick Sabalausky wrote:
> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org>  wrote in message
> news:igoqrm$1n5r$1@digitalmars.com...
>> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
>> [snip]
>>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>>
>>> Or:
>>>
>>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>>
>>> Those *both* get rendered exactly the same, and both represent the same
>>> four-letter sequence. In the second example, the 'u' and the {umlaut
>>> combining character} combine to form one grapheme. The f's and n's just
>>> happen to be single-code-point graphemes.
>>>
>>> Note that while some characters exist in pre-combined form (such as the
>>> {u
>>> with the umlaut} above), legend has it there are others than can only be
>>> represented using a combining character.
>>>
>>> It's also my understanding, though I'm not certain, that sometimes
>>> multiple
>>> combining characters can be used together on the same "root" character.
>>
>> Thanks. One further question is: in the above example with u-with-umlaut,
>> there is one code point that corresponds to the entire combination. Are
>> there combinations that do not have a unique code point?
>>
>
> My understanding is "yes". At least that's what I've heard, and I've never
> heard any claims of "no". I don't know of any specific ones offhand, though.
> Actually, it might be possible to use any combining character with any old
> letter or number (like maybe a 7 with an umlaut), though I'm not certain.

The problem is then whether a font knows how to display it. My usual fonts (DejaVu series, pretty good with Unicode) show:
	7̈
meaning they do not know how to combine digits with diacritics (they do it well with other rather strange combinations.)

But: one of the relevant advantages of decomposed forms is that when they don't know the character, they can still show at least the component marks, here '7' & '~'. Which is better than nothing for a user who knows the scripting system. If I try to display for instance a _precomposed_ syllable from a language my font does not know, i will get instead either a little square with the codepoint written inside in minuscules digits, or a placeholder like inversed-video "?".


denis
_________________
vita es estrany
spir.wikidot.com