January 14, 2011
On 01/14/2011 01:52 PM, Daniel Gibson wrote:
> Am 14.01.2011 07:26, schrieb Nick Sabalausky:
>> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org> wrote in message
>> news:igoj6s$17r6$1@digitalmars.com...
>>>
>>> I'm not so sure about that. What do you base this assessment on? Denis
>>> wrote a library that according to him does grapheme-related stuff nobody
>>> else does. So apparently graphemes is not what people care about
>>> (although
>>> it might be what they should care about).
>>>
>>
>> It's what they want, they just don't know it.
>>
>> Graphemes are what many people *think* code points are.
>>
>
> Agreed. Up until spir mentioned graphemes in this newsgroup I always
> thought that one Unicode code point == one character on the screen.
>
> I guess in the majority of use cases you want to operate on user
> perceived characters.

That's what makes sense for the user in 99.9% case, thus that's what makes sense for the programmer, thus that's what makes sense for the language/type/lib designer.

denis
_________________
vita es estrany
spir.wikidot.com

January 14, 2011
On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
>
> * I don't even know how to make a grapheme that is more than one
> code-unit, let alone more than one code-point :)  Every time I try, I
> get 'invalid utf sequence'.
>
> I feel significantly ignorant on this issue, and I'm slowly getting
> enough knowledge to join the discussion, but being a dumb American who
> only speaks English, I have a hard time grasping how this shit all works.

1. See my text at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

2.
    writeln ("A\u0308\u0330");
<A + tilde above + umlaut below> (or the opposite)
If it does not display properly, either set your terminal to UTF* or use a more unicode-aware font (eg DejaVu series).

The point is not playing like that with Unicode flexibility. Rather that composite characters are just normal thingies in most languages of the world. Actually, on this point, english is a rare exception (discarding letters imported from foreign languages like french 'à'); to the point of beeing, I guess, the only western language without any diacritic.


Denis
_________________
vita es estrany
spir.wikidot.com

January 14, 2011
On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:

> On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
>>
>> * I don't even know how to make a grapheme that is more than one
>> code-unit, let alone more than one code-point :)  Every time I try, I
>> get 'invalid utf sequence'.
>>
>> I feel significantly ignorant on this issue, and I'm slowly getting
>> enough knowledge to join the discussion, but being a dumb American who
>> only speaks English, I have a hard time grasping how this shit all works.
>
> 1. See my text at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

I can't read that document, it's black background with super-dark-grey text.

> 2.
>      writeln ("A\u0308\u0330");
> <A + tilde above + umlaut below> (or the opposite)
> If it does not display properly, either set your terminal to UTF* or use a more unicode-aware font (eg DejaVu series).

OK, I'll have to remember this so I can use it to test my string type ;)

> The point is not playing like that with Unicode flexibility. Rather that composite characters are just normal thingies in most languages of the world. Actually, on this point, english is a rare exception (discarding letters imported from foreign languages like french 'à'); to the point of beeing, I guess, the only western language without any diacritic.

Is it common to have multiple modifiers on a single character?  The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English.

I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist.  My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type).  If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough.

Either way, we need a string type that can be compared canonically for things like searches or opEquals.

-Steve
January 14, 2011
On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/13/11 7:09 PM, Michel Fortin wrote:
>> That's forgetting that most of the time people care about graphemes
>> (user-perceived characters), not code points.
> 
> I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

Apple implemented all these things in the NSString class in Cocoa. They did all this work on Unicode at the beginning of Mac OS X, at a time where making such changes wouldn't break anything.

It's a hard thing to change later when you have code that depend on the old behaviour. It's a complicated matter and not so many people will understand the issues, so it's no wonder many languages just deal with code points.


> This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?

As usual, Wikipedia offers a good summary and a couple of references. Here's the part about combining characters: <http://en.wikipedia.org/wiki/Combining_character>.

There's basically four ranges of code points which are combining:
- Combining Diacritical Marks (0300–036F)
- Combining Diacritical Marks Supplement (1DC0–1DFF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Combining Half Marks (FE20–FE2F)

A code point followed by one or more code points in these ranges is conceptually a single character (a grapheme).

But for comparing strings correctly, you need to determine the canonical equivalence. Wikipedia describes it in Unicode Normalization article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full algorithm specification can be found here: <http://unicode.org/reports/tr15/> (the algorithm . The canonical form has both a composed and decomposed variant, the first trying to use pre-combined character when possible, the second not using any pre-combined character. Not only combining marks are concerned, there are a few single-code-point characters which have a duplicate somewhere else in the code point table.

Also, there's two normalizations: the canonical one (described above) and the compatibility one which is more lax (making the ligature "fl" would equivalent to "fl" for instance). If a user searches for some text in a document, it's probably better to search using the compatibility normalization so that "flower" (with ligature) and "flower" (without ligature) can match each other. If you want to search case-insensitively, then you'll need to implement the collation algorithm, but that's getting further.

If you're wondering which direction to take, this official FAQ seems like a good resource (especially the first few questions):
<http://www.unicode.org/faq/normalization.html>

One important thing to note is that most of the time, strings come already in the normalized pre-composed form. So the normalization algorithm should be optimized for the case it has nothing to do. That's what is said in section 1.3 Description of the Normalization Algorithm in the specification: <http://www.unicode.org/reports/tr15/#Description_Norm>.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 14, 2011
On 2011-01-14 01:44:19 -0500, "Nick Sabalausky" <a@a.a> said:

> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message
> news:igoqrm$1n5r$1@digitalmars.com...
>> Thanks. One further question is: in the above example with u-with-umlaut,
>> there is one code point that corresponds to the entire combination. Are
>> there combinations that do not have a unique code point?
> 
> My understanding is "yes". At least that's what I've heard, and I've never
> heard any claims of "no". I don't know of any specific ones offhand, though.
> Actually, it might be possible to use any combining character with any old
> letter or number (like maybe a 7 with an umlaut), though I'm not certain.

Correct, there's a lot of combinations with no pre-combined form. This should be no surprise given that you can apply any number of combining marks to any character.

	mythical 7 with an umlaut: 7̈
	mythical 7 with umlaut, ring above, and acute accent: 7̈̊́

I can't guaranty your news reader will display the above correctly, but it works as described in mine (Unison on Mac OS X). In fact, it should work in all Cocoa-based applications. This probably includes iOS-based devices too, but I haven't tested there.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 14, 2011
On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:
> 
>> The point is not playing like that with Unicode flexibility. Rather that  composite characters are just normal thingies in most languages of the  world. Actually, on this point, english is a rare exception (discarding  letters imported from foreign languages like french 'à'); to the point  of beeing, I guess, the only western language without any diacritic.
> 
> Is it common to have multiple modifiers on a single character?

Not in my knowledge. But I rarely deal with non-latin texts, there's probably some scripts out there that takes advantage of this.


> The  problem I see with using decomposed canonical form for strings is that we  would have to return a dchar[] for each 'element', which severely  complicates code that, for instance, only expects to handle English.

Actually, returning a sliced char[] or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower.

In the case of NSString in Cocoa, you can only access the 'characters' in their UTF-16 form. But everything from comparison to search for substring is done using graphemes. It's like they implemented specialized Unicode-aware algorithms for these functions. There's no genericness about how it handles graphemes.

I'm not sure yet about what would be the right approach for D.


> I was hoping to lazily transform a string into its composed canonical  form, allowing the (hopefully rare) exception when a composed character  does not exist.  My thinking was that this at least gives a useful string  representation for 90% of usages, leaving the remaining 10% of usages to  find a more complex representation (like your Text type).  If we only get  like 20% or 30% there by making dchar the element type, then we haven't  made it useful enough.
> 
> Either way, we need a string type that can be compared canonically for  things like searches or opEquals.

I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead. Also bring the idea above that iterating on a string would yield graphemes as char[] and this code would work perfectly irrespective of whether you used combining characters:

	foreach (grapheme; "exposé") {
		if (grapheme == "é")
			break;
	}

I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 14, 2011
"spir" <denis.spir@gmail.com> wrote in message news:mailman.619.1295012086.4748.digitalmars-d@puremagic.com...
>
> If anyone finds a pointer to such an explanation, bravo, and than you. (You will certainly not find it in Unicode literature, for instance.) Nick's explanation below is good and concise. (Just 2 notes added.)

Yea, most Unicode explanations seem to talk all about "code-units vs code-points" and then they'll just have a brief note like "There's also other things like digraphs and combining codes." And that'll be all they mention.

You're right about the Unicode literature. It's the usual standards-body documentation, same as W3C: "Instead of only some people understanding how this works, lets encode the documentation in legalese (and have twenty only-slightly-different versions) to make sure that nobody understands how it works."

> You can also say there are 2 kinds of characters: simple like "u" & composite "ü" or "ü??". The former are coded with a single (base) code, the latter with one (rarely more) base codes and an arbitrary number of combining codes.

Couple questions about the "more than one base codes":

- Do you know an example offhand?

- Does that mean like a ligature where the base codes form a single glyph, or does it mean that the combining code either spans or operates over multiple glyphs? Or can it go either way?

> For a majority of _common_ characters made of 2 or 3 codes (western language letters, korean Hangul syllables,...), precombined codes have been added to the set. Thus, they can be coded with a single code like simple characters.
>

Out of curiosity, how do decomposed Hangul characters work? (Or do you know?) Not actually knowing any Korean, my understanding is that they're a set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it is like a series of base codes that automatically combine, or are there combining characters involved?

> [Also note, to avoid things be too simple ;-), some (few) combining codes called "prepend" come _before_ the base in raw code sequence...]
>

Fun!



January 14, 2011
"spir" <denis.spir@gmail.com> wrote in message news:mailman.624.1295013588.4748.digitalmars-d@puremagic.com...
>
> If it does not display properly, either set your terminal to UTF* or use a more unicode-aware font (eg DejaVu series).
>

How to do that on the Windows (XP) command prompt, for anyone who doesn't know:

Step 1:

Right-click title bar, "Properties", "Font" tab, set font to "Lucidia Console" (It'll look weird at first, but you get used to it.)

Step 2 (I had to google this step):

For just the current terminal session: Run "chcp 65001". (Ie "CHange Code Page) Also, you can run "chcp" to just see what codepage you're already set to.

To make it work permanently: Put "chcp 65001" into the registry key "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"



January 14, 2011
"Nick Sabalausky" <a@a.a> wrote in message news:igq9u6$1bqu$1@digitalmars.com...
>
> Step 2 (I had to google this step):
>
> For just the current terminal session: Run "chcp 65001". (Ie "CHange Code Page) Also, you can run "chcp" to just see what codepage you're already set to.
>
> To make it work permanently: Put "chcp 65001" into the registry key "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"
>

Forget that step 2, that causes "Active code page: 65001" to be sent to stdout *every* time system() is invoked. We shouldn't be relying on that. *This* is what should be done (and this really should be done in all D command line apps - or better yet, put into the runtime):

import std.stdio;

version(Windows)
{
    import std.c.windows.windows;
    extern(Windows) export BOOL SetConsoleOutputCP(UINT);
}

void main()
{
    version(Windows) SetConsoleOutputCP(65001);

    writeln("HuG says: Fukken Über Death Terminal");
}

See also: http://d.puremagic.com/issues/show_bug.cgi?id=1448



January 14, 2011
On 1/14/11, Nick Sabalausky <a@a.a> wrote:
> import std.stdio;
>
> version(Windows)
> {
>     import std.c.windows.windows;
>     extern(Windows) export BOOL SetConsoleOutputCP(UINT);
> }
>
> void main()
> {
>     version(Windows) SetConsoleOutputCP(65001);
>
>     writeln("HuG says: Fukken Über Death Terminal");
> }
>

Does that work for you? I get back:
HuG says: Fukken Ãœber Death Terminal