Jump to page: 1 25  
Page
Thread overview
The Unicode Casing Algorithms
Jun 04, 2004
Arcane Jill
Jun 04, 2004
Arcane Jill
Jun 04, 2004
Kris
Jun 04, 2004
Hauke Duden
Unicode (was The Unicode Casing Algorithms)
Jun 04, 2004
Arcane Jill
Jun 04, 2004
Ben Hinkle
Jun 04, 2004
Arcane Jill
Jun 04, 2004
Hauke Duden
Jun 04, 2004
Walter
Jun 04, 2004
Arcane Jill
Jun 04, 2004
Hauke Duden
Jun 04, 2004
Hauke Duden
Jun 04, 2004
Walter
Jun 04, 2004
Hauke Duden
Jun 04, 2004
Kris
Jun 04, 2004
Walter
Jun 05, 2004
Kris
Jun 05, 2004
Kris
Jun 05, 2004
Walter
Jun 05, 2004
Arcane Jill
Jun 05, 2004
Sean Kelly
Jun 05, 2004
Arcane Jill
Jun 05, 2004
Walter
Jun 05, 2004
Sean Kelly
Jun 05, 2004
Hauke Duden
Jun 04, 2004
David L. Davis
Jun 04, 2004
Walter
Jun 04, 2004
Walter
Jun 04, 2004
Arcane Jill
Jun 04, 2004
Walter
Jun 07, 2004
Roberto Mariottini
Jun 07, 2004
Arcane Jill
Jun 08, 2004
Roberto Mariottini
Jun 08, 2004
Arcane Jill
Jun 08, 2004
Hauke Duden
Jun 08, 2004
Arcane Jill
Jun 08, 2004
Hauke Duden
Jun 08, 2004
Arcane Jill
Jun 08, 2004
Hauke Duden
Jun 08, 2004
Arcane Jill
Jun 08, 2004
Hauke Duden
June 04, 2004
Sean makes some good points in his posts, but the D character set is Unicode by definition. Let me go through this:

>Some languages don't have upper and lowercase letters.

This is true, but it's not relevant. This was relevant back in the days of conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't necessarily mean 'A'. But in Unicode this simply doesn't matter, because there is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak Russian.


>And many others don't
>convert properly using the default routines,

Again, this is true, if by "default routines" you mean existing C routines. But they do convert properly if you employ the Unicode casing algorithms. These guys (the Unicode Consortium) have been figuring out this stuff for the last few decades, and have knowledge and experience which encompasses pretty much all the scripts in the world.


>even if the ASCII character set
>contains all the appropriate symbols.

ASCII, of course, doesn't even contain e-acute, a symbol used, for example in the English word "café". This symbol (having codepoint '\u00E9') exists in ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to 0x7F). I realise from the context that Sean did know that.



>So tolower(x)==tolower(y) may yield the
>incorrect result if the string contains characters beyond the usual 52 ASCII
>English values.

Absolutely. The existing tolower() function is not suitable for Unicode. It exists for historical reasons, and is useful in compiling legacy code. But it really should be deprecated.

Having said that, one can't deprecate a function until one has something with which to replace it. Hmmm....




>I'd like to assume that a D string is a sequence of characters,
>unicode or otherwise, and I think it would be a mistake to provide methods that
>don't work properly outside of ASCII English. While I'm not much of an expert
>on localization, I do think that the library should be designed with
>localization in mind.

Would you like to know what the localization issues ARE?

In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I, while dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about it, the Turkish system actually makes more sense). But Unicode wanted to be a superset of ASCII, so that particular casing rule did not become a part of the standard. Lithuanian retains the dot in a lowercase i when followed by accents.

I believe that it would be perfectly acceptable to provide default casing algorithms which work for the whole world apart from the above exceptions. Special functions could be written for those languages if needed.

For the rest of the world, it all works smoothly, and differences in display are consigned to "font rendering issues". For example, in French, it is unusual to display an accent on an uppercase letter - but '\u00E9' (e acute) still uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY the acute accent is considered a rendering issue, not a character issue, and is a problem which is solved very, very neatly simply by supplying specialized French fonts (in which '\u00C9' is rendered without an accent). Similarly, in tradition Irish, the letter i is written without a font - but the codepoint is still '\u0069', same as for the rest of us. Likewise with French, the decision not to display the dot is a mere rendering issue.


>For a more thorough explanation, Scott Meyers discusses the problem in one of his "Effective C++" books, the second one IIRC.

Yes, but that was then and this is now. Unicode was invented precisely to solve this kind of problem, and solve it it has. There is neither any need nor any sense in our reinventing the wheel here. To case-convert a Unicode character, one merely looks up that character in the published Unicode charts. These are purposefully in machine-readable form, and are easily parsed.

Case COMPARISONS are defined in Unicode Technical Report #30, Character Foldings (http://www.unicode.org/reports/tr30/). This is slightly more tricky, for reasons I won't go into here, but all of the algorithms are easily implementable.

Collation, as we know, IS locale dependent. This is even more tricky, but everything you need to know is defined in Unicode Technical Standard #10, Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/)

If I had the time, I'd implement all of this myself, but I'm working on something else right now. I do hope, however, that D doesn't do a half-assed job and not be standards-compliant with the defined Unicode algorithms. I'm with what Walter says in the D manual on this one:

Unicode is the future.

Arcane Jill


June 04, 2004
In article <c9p8dn$2j2i$1@digitaldaemon.com>, Arcane Jill says...

Typo correction:

>in tradition Irish, the letter i is written without a font

should read:

>in traditional Irish, the letter i is written without a DOT

Sorry about that,
Jill



June 04, 2004
Just wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.



Arcane Jill wrote:
> Sean makes some good points in his posts, but the D character set is Unicode by
> definition. Let me go through this:
> 
> 
>>Some languages don't have upper and lowercase letters.
> 
> 
> This is true, but it's not relevant. This was relevant back in the days of
> conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't
> necessarily mean 'A'. But in Unicode this simply doesn't matter, because there
> is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will
> lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak
> Russian.
> 
> 
> 
>>And many others don't
>>convert properly using the default routines,
> 
> 
> Again, this is true, if by "default routines" you mean existing C routines. But
> they do convert properly if you employ the Unicode casing algorithms. These guys
> (the Unicode Consortium) have been figuring out this stuff for the last few
> decades, and have knowledge and experience which encompasses pretty much all the
> scripts in the world.
> 
> 
> 
>>even if the ASCII character set
>>contains all the appropriate symbols.
> 
> 
> ASCII, of course, doesn't even contain e-acute, a symbol used, for example in
> the English word "café". This symbol (having codepoint '\u00E9') exists in
> ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to
> 0x7F). I realise from the context that Sean did know that.
> 
> 
> 
> 
>>So tolower(x)==tolower(y) may yield the
>>incorrect result if the string contains characters beyond the usual 52 ASCII
>>English values.
> 
> 
> Absolutely. The existing tolower() function is not suitable for Unicode. It
> exists for historical reasons, and is useful in compiling legacy code. But it
> really should be deprecated.
> 
> Having said that, one can't deprecate a function until one has something with
> which to replace it. Hmmm....
> 
> 
> 
> 
> 
>>I'd like to assume that a D string is a sequence of characters,
>>unicode or otherwise, and I think it would be a mistake to provide methods that
>>don't work properly outside of ASCII English. While I'm not much of an expert
>>on localization, I do think that the library should be designed with
>>localization in mind.
> 
> 
> Would you like to know what the localization issues ARE?
> 
> In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I, while
> dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about
> it, the Turkish system actually makes more sense). But Unicode wanted to be a
> superset of ASCII, so that particular casing rule did not become a part of the
> standard. Lithuanian retains the dot in a lowercase i when followed by accents.
> 
> I believe that it would be perfectly acceptable to provide default casing
> algorithms which work for the whole world apart from the above exceptions.
> Special functions could be written for those languages if needed.
> 
> For the rest of the world, it all works smoothly, and differences in display are
> consigned to "font rendering issues". For example, in French, it is unusual to
> display an accent on an uppercase letter - but '\u00E9' (e acute) still
> uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY
> the acute accent is considered a rendering issue, not a character issue, and is
> a problem which is solved very, very neatly simply by supplying specialized
> French fonts (in which '\u00C9' is rendered without an accent). Similarly, in
> tradition Irish, the letter i is written without a font - but the codepoint is
> still '\u0069', same as for the rest of us. Likewise with French, the decision
> not to display the dot is a mere rendering issue.
> 
> 
> 
>>For a more thorough explanation, Scott Meyers discusses the problem in one of
>>his "Effective C++" books, the second one IIRC.
> 
> 
> Yes, but that was then and this is now. Unicode was invented precisely to solve
> this kind of problem, and solve it it has. There is neither any need nor any
> sense in our reinventing the wheel here. To case-convert a Unicode character,
> one merely looks up that character in the published Unicode charts. These are
> purposefully in machine-readable form, and are easily parsed.
> 
> Case COMPARISONS are defined in Unicode Technical Report #30, Character Foldings
> (http://www.unicode.org/reports/tr30/). This is slightly more tricky, for
> reasons I won't go into here, but all of the algorithms are easily
> implementable.
> 
> Collation, as we know, IS locale dependent. This is even more tricky, but
> everything you need to know is defined in Unicode Technical Standard #10,
> Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/)
> 
> If I had the time, I'd implement all of this myself, but I'm working on
> something else right now. I do hope, however, that D doesn't do a half-assed job
> and not be standards-compliant with the defined Unicode algorithms. I'm with
> what Walter says in the D manual on this one:
> 
> Unicode is the future.
> 
> Arcane Jill
> 
> 
June 04, 2004
In article <c9pi28$jj$1@digitaldaemon.com>, Hauke Duden says...
>
>Just wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested.
>
>I'll try to finish it up and post it here tonight.

Wow! I'm so impressed. How's it done? Have you defined a String class?

I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should compare equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they won't compare as equal in a straightforward dchar[] == test. (Even the lengths are different). I imagined crafting a String class which knew all about Unicode normalization, so that:

>      assert(String("\u0065\u0301") == String("\u00E9"));

would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one.

..and not forgetting the conversions:

>       // String s;
>       dchar[] a = s.nfc();
>       dchar[] b = s.nfd();
>       dchar[] c = s.nfkc();
>       dchar[] d = s.nfkd();

If your module is slready complete, I guess it's too late for me to point you in the direction of UPR, a binary format for Unicode character properties (much easier to parse than the code-charts). Info is at: http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might want to bear it in mind for the future, unless you've already got your own code for parsing the code-charts (for when the next version of Unicode comes out).

Anyway, good luck. I'm really pleased to see someone taking all this seriously. There are just too many people of the "ASCII's good enough for me" ilk, and it makes a refreshing change to see D and its supporters taking the initiative here.

Arcane Jill


June 04, 2004
Arcane Jill wrote:

> In article <c9pi28$jj$1@digitaldaemon.com>, Hauke Duden says...
>>
>>Just wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested.
>>
>>I'll try to finish it up and post it here tonight.
> 
> Wow! I'm so impressed. How's it done? Have you defined a String class?
> 
> I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should compare equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they won't compare as equal in a straightforward dchar[] == test. (Even the lengths are different). I imagined crafting a String class which knew all about Unicode normalization, so that:
> 
>>      assert(String("\u0065\u0301") == String("\u00E9"));
> 
> would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one.

Instead of making a String class another approach would be to write
 char[] normalize(char[])
that uses COW like std.string and use the regular comparison. That is the
model used by tolower and friends. If it is desired an equivalent to cmp
can be devised that takes normalization into account much like
std.string.icmp takes case into account.

A class for String came up a while ago and the basic argument against it was that it wasn't needed - functions work fine. Maybe we'll get to the point where a class is needed but the mental model of <length, ptr> and COW functions is so simple it would be a big change to give it up.

-Ben
June 04, 2004
In article <c9ppdu$c90$1@digitaldaemon.com>, Ben Hinkle says...

>Instead of making a String class another approach would be to write
> char[] normalize(char[])
>that uses COW like std.string and use the regular comparison. That is the model used by tolower and friends. If it is desired an equivalent to cmp can be devised that takes normalization into account much like std.string.icmp takes case into account.

Yup, there are all sorts of possible approaches. I could think of a few more too (e.g. optimized comparisons which only need to test the start of the string instead of pre-normalizing all of it). But anyway - I'm keen to see which one Hauke Duden has come up with. I certainly look forward to it.

Jill


June 04, 2004
If it turns out that Jill is Irish, this spells "imminent joviality" to me:

The next time Matthew, Jill, and I disagree on the same thread, some canny wit is bound to make a fricking wisecrack about "There was this Englishman, Irishman, and Scotsman ...".

I'll stake ten bucks, and a slightly worn pocket-protector, that it will be Brad Anderson ... any takers?

<g>



"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:c9pa5a$2ln9$1@digitaldaemon.com...
> In article <c9p8dn$2j2i$1@digitaldaemon.com>, Arcane Jill says...
>
> Typo correction:
>
> >in tradition Irish, the letter i is written without a font
>
> should read:
>
> >in traditional Irish, the letter i is written without a DOT
>
> Sorry about that,
> Jill
>
>
>



June 04, 2004
Arcane Jill wrote:
> In article <c9pi28$jj$1@digitaldaemon.com>, Hauke Duden says...
> 
>>Just wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested.
>>
>>I'll try to finish it up and post it here tonight.
> 
> 
> Wow! I'm so impressed. How's it done? Have you defined a String class?

I'm afraid I don't deserve your praise ;).

While I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.


> I ask because, as I'm sure you know, the Unicode character sequence
> '\u0065\u0301' (lowercase e followed by combining acute accent) should compare
> equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they
> won't compare as equal in a straightforward dchar[] == test. (Even the lengths
> are different). I imagined crafting a String class which knew all about Unicode
> normalization, so that:
> 
> 
>>     assert(String("\u0065\u0301") == String("\u00E9"));
> 
> 
> would hold true. And this needs to hold true even in a case-SENSITIVE compare,
> let alone a case-INsensitive one.

I think that Unicode is so complicated that doing the case foldings and normalizations on-the-fly for every comparison is a bit of an overkill and could also introduce unnecessary performance bottlenecks. For my own programs I have long settled on only comparing strings the simple way (i.e. character for character). That's good enough if you don't have to work on strings that come from outside your program.

For all other situations you can use a normalize function that is called once when the string enters the program.


> If your module is slready complete, I guess it's too late for me to point you in
> the direction of UPR, a binary format for Unicode character properties (much
> easier to parse than the code-charts). Info is at:
> http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might
> want to bear it in mind for the future, unless you've already got your own code
> for parsing the code-charts (for when the next version of Unicode comes out).

Thanks for that info - I will check it out. But as a matter of fact I do already have my own tool for parsing the Unicode data ;). It is more convenient for me, since the module works with static arrays that contain the data in compressed form (a relatively simple RLE algorithm, but effective enough to reduce 2 MB worth of tables to 12 KB).

> Anyway, good luck. I'm really pleased to see someone taking all this seriously.
> There are just too many people of the "ASCII's good enough for me" ilk, and it
> makes a refreshing change to see D and its supporters taking the initiative
> here.

Thanks ;). I agree that far too many people ignore Unicode (right until their application needs to be translated to Japanese, for example). And D is in the position to make it easier for people to do the right thing from the start. We "only" have to make sure that Phobos implements proper Unicode support.

Hauke

June 04, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:c9p8dn$2j2i$1@digitaldaemon.com...
> If I had the time, I'd implement all of this myself, but I'm working on something else right now. I do hope, however, that D doesn't do a
half-assed job
> and not be standards-compliant with the defined Unicode algorithms. I'm
with
> what Walter says in the D manual on this one:
>
> Unicode is the future.

Yes. Thanks for the excellent references. Right now, the std.ctype functions all take an argument of 'dchar'. This means the interface is correct for unicode, even if the current implementation fails to work on anything but ASCII.

If an ambitious person wishes to fix the implementations so they work with unicode, I'll incorporate them.


June 04, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:c9pneo$91a$1@digitaldaemon.com...
> I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should
compare
> equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly
they
> won't compare as equal in a straightforward dchar[] == test. (Even the
lengths
> are different).

Oh durn, even with 20 bit unicode they are *still* having multicharacter sequences? ARRRRGGGGHHH.


« First   ‹ Prev
1 2 3 4 5