std.string.toUpper() for greek characters (page 2)

On 10/03/2012 11:21 AM, Dmitry Olshansky wrote: > On 03-Oct-12 21:10, Ali Çehreli wrote: >> On 10/03/2012 03:56 AM, Minas wrote: [...] >>> map['ά'] = 'Ά'; [...] > Glad you showed up! Why? Do I whine better? :p > One and by far the most useful case is case-insensitive matching. > That being said this doesn't and shouldn't involve toLower/toUpper (and > on the whole string) anywhere. Not only it's multipass vs single pass > but it's also wrong. As a lot of other ASCII-minded carry-overs. As I have written at other times, there is an experimental alphabet-aware string library (unfortunately even the code is in Turkish at this time). That library has the following struct for order-comparing alphabet-aware strings and characters: struct Order { /** * Represents comparing characters at their bases. * * This value indicates that 'a' and 'b' are different. 'C' and 'c' * are the same according to this value. This value disregards upper * and lower cases. */ int base; /** * Represents comparing characters by their accents. * * This value indicates that 'a' and 'â' are different. This value * disregards upper and lower cases. */ int accent; /** * Represents comparing characters also by their upper and lower cases. * * Lower case letter comes before upper case. */ int cased; } (Of course opCmp() cannot return that type. :( ) The idea is that only the application knows what type of comparison makes sense. Ali

October 03, 2012

Re: std.string.toUpper() for greek characters

Posted by Dmitry Olshansky
in reply to Ali Çehreli

Permalink

Dmitry Olshansky

Posted in reply to Ali Çehreli

Permalink

On 03-Oct-12 23:56, Ali Çehreli wrote:
> On 10/03/2012 11:21 AM, Dmitry Olshansky wrote:
>  > On 03-Oct-12 21:10, Ali Çehreli wrote:
>  >> On 10/03/2012 03:56 AM, Minas wrote:
> [...]
>  >>> map['ά'] = 'Ά';
> [...]
>  > Glad you showed up!
>
> Why? Do I whine better? :p
>
Well that might be the case :)
But honestly because you pushed for some Unicode support back in the days. I currently look around to see if there are obviously important things not covered in my project.

>  > One and by far the most useful case is case-insensitive matching.
>  > That being said this doesn't and shouldn't involve toLower/toUpper (and
>  > on the whole string) anywhere. Not only it's multipass vs single pass
>  > but it's also wrong. As a lot of other ASCII-minded carry-overs.
>
> As I have written at other times, there is an experimental
> alphabet-aware string library (unfortunately even the code is in Turkish
> at this time).
>

If we are talking about the order then this is the way to go:
http://unicode.org/reports/tr10/

Looks like it's one of things I haven't to implemented :(

> That library has the following struct for order-comparing alphabet-aware
> strings and characters:
>
> struct Order
> {
>      /**
>       * Represents comparing characters at their bases.
>       *
>       * This value indicates that 'a' and 'b' are different. 'C' and 'c'
>       * are the same according to this value. This value disregards upper
>       * and lower cases.
>       */
>      int base;
>
>      /**
>       * Represents comparing characters by their accents.
>       *
>       * This value indicates that 'a' and 'â' are different. This value
>       * disregards upper and lower cases.
>       */
>      int accent;
>
>      /**
>       * Represents comparing characters also by their upper and lower
> cases.
>       *
>       * Lower case letter comes before upper case.
>       */
>      int cased;
> }
>
> (Of course opCmp() cannot return that type. :( )
>
> The idea is that only the application knows what type of comparison
> makes sense.

So instead library does all of them ? Ouch.. I'm not sure I got the idea.


-- 
Dmitry Olshansky

On 10/03/2012 01:37 PM, Dmitry Olshansky wrote: > On 03-Oct-12 23:56, Ali Çehreli wrote: > If we are talking about the order then this is the way to go: > http://unicode.org/reports/tr10/ Thank you. I wasn't aware of that long read. :) >> struct Order >> { >> int base; >> int accent; >> int cased; >> } >> >> (Of course opCmp() cannot return that type. :( ) >> >> The idea is that only the application knows what type of comparison >> makes sense. > > So instead library does all of them ? Ouch.. I'm not sure I got the idea. The idea was that there would be AlphabetChar and AlphabetString that knew about what writing system that they belonged to: AlphabetChar!en, AlphabetChar!tr, etc. For example, while letter ç is a distinct letter in the Turkish alphabet, it is an accented form of c in most Latin-based alphabets. That affects the 'base' member above. On the other hand, â is an accented 'a' both in the Turkish and the Latin-based alphabets. So the 'base' comparison for â and a would be the same. Collation takes the alphabet into account. Although AlphabetChar!en is not compatible with AlphabetChar!tr, they can be forced to be compared according to the collation information of any alphabet. So, that experimental library provides a number of alphabets with their own collation orders. I see now that the library should have supported the Unicode document that you have linked above. I will do some reading. :) Ali

Forums