View mode: basic / threaded / horizontal-split · Log in · Help
October 03, 2012
std.string.toUpper() for greek characters
Currently, toUpper() (and probably toLower()) does not handle 
greek characters correctly. I fixed toUpper() by making a another 
function for greek characters

// called if (c >= 0x387 && c <= 0x3CE)
dchar toUpperGreek(dchar c)
{
	if( c >= 'α' && c <= 'ω' )
	{
		if( c == 'ς' )
			c = 'Σ';
		else
			c -= 32;
	}
	else
	{
		dchar[dchar] map;
		map['ά'] = 'Ά';
		map['έ'] = 'Έ';
		map['ή'] = 'Ή';
		map['ί'] = 'Ί';
		map['ϊ'] = 'Ϊ';
		map['ΐ'] = 'Ϊ';
		map['ό'] = 'Ό';
		map['ύ'] = 'Ύ';
		map['ϋ'] = 'Ϋ';
		map['ΰ'] = 'Ϋ';
		map['ώ'] = 'Ώ';
		
		c = map[c];
	}
	
	return c;
}

Then, in toUpper()
{
   ....
   if (c >= 0x387 && c <= 0x3CE)
      c = toUpperGreek()...
   ///
}

Do you think it should stay like that or I should copy-paste it 
in the body of toUpper()?

I'm going to fix toLower() as well and make a pull request.
October 03, 2012
Re: std.string.toUpper() for greek characters
On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
> Currently, toUpper() (and probably toLower()) does not handle 
> greek characters correctly. I fixed toUpper() by making a 
> another function for greek characters
>
> // called if (c >= 0x387 && c <= 0x3CE)
> dchar toUpperGreek(dchar c)
> {
> 	if( c >= 'α' && c <= 'ω' )
> 	{
> 		if( c == 'ς' )
> 			c = 'Σ';
> 		else
> 			c -= 32;
> 	}
> 	else
> 	{
> 		dchar[dchar] map;
> 		map['ά'] = 'Ά';
> 		map['έ'] = 'Έ';
> 		map['ή'] = 'Ή';
> 		map['ί'] = 'Ί';
> 		map['ϊ'] = 'Ϊ';
> 		map['ΐ'] = 'Ϊ';
> 		map['ό'] = 'Ό';
> 		map['ύ'] = 'Ύ';
> 		map['ϋ'] = 'Ϋ';
> 		map['ΰ'] = 'Ϋ';
> 		map['ώ'] = 'Ώ';
> 		
> 		c = map[c];
> 	}
> 	
> 	return c;
> }
>
> Then, in toUpper()
> {
>    ....
>    if (c >= 0x387 && c <= 0x3CE)
>       c = toUpperGreek()...
>    ///
> }
>
> Do you think it should stay like that or I should copy-paste it 
> in the body of toUpper()?
>
> I'm going to fix toLower() as well and make a pull request.

A switch with 11 cases is very likely going to be a lot faster 
than the hash table approach you're using, especially since the 
AA is not cached and will be dynamically allocated on every call.
October 03, 2012
Re: std.string.toUpper() for greek characters
On Wednesday, 3 October 2012 at 11:03:27 UTC, Jakob Ovrum wrote:
> On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
>> Currently, toUpper() (and probably toLower()) does not handle 
>> greek characters correctly. I fixed toUpper() by making a 
>> another function for greek characters
>>
>> // called if (c >= 0x387 && c <= 0x3CE)
>> dchar toUpperGreek(dchar c)
>> {
>> 	if( c >= 'α' && c <= 'ω' )
>> 	{
>> 		if( c == 'ς' )
>> 			c = 'Σ';
>> 		else
>> 			c -= 32;
>> 	}
>> 	else
>> 	{
>> 		dchar[dchar] map;
>> 		map['ά'] = 'Ά';
>> 		map['έ'] = 'Έ';
>> 		map['ή'] = 'Ή';
>> 		map['ί'] = 'Ί';
>> 		map['ϊ'] = 'Ϊ';
>> 		map['ΐ'] = 'Ϊ';
>> 		map['ό'] = 'Ό';
>> 		map['ύ'] = 'Ύ';
>> 		map['ϋ'] = 'Ϋ';
>> 		map['ΰ'] = 'Ϋ';
>> 		map['ώ'] = 'Ώ';
>> 		
>> 		c = map[c];
>> 	}
>> 	
>> 	return c;
>> }
>>
>> Then, in toUpper()
>> {
>>   ....
>>   if (c >= 0x387 && c <= 0x3CE)
>>      c = toUpperGreek()...
>>   ///
>> }
>>
>> Do you think it should stay like that or I should copy-paste 
>> it in the body of toUpper()?
>>
>> I'm going to fix toLower() as well and make a pull request.
>
> A switch with 11 cases is very likely going to be a lot faster 
> than the hash table approach you're using, especially since the 
> AA is not cached and will be dynamically allocated on every 
> call.


I had this in mind as well. I will change it, thanks.
October 03, 2012
Re: std.string.toUpper() for greek characters
On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
> Currently, toUpper() (and probably toLower()) does not handle 
> greek characters correctly. I fixed toUpper() by making a 
> another function for greek characters
>
> // called if (c >= 0x387 && c <= 0x3CE)
> dchar toUpperGreek(dchar c)
> {
> 	if( c >= 'α' && c <= 'ω' )
> 	{
> 		if( c == 'ς' )
> 			c = 'Σ';
> 		else
> 			c -= 32;
> 	}
> 	else
> 	{
> 		dchar[dchar] map;
> 		map['ά'] = 'Ά';
> 		map['έ'] = 'Έ';
> 		map['ή'] = 'Ή';
> 		map['ί'] = 'Ί';
> 		map['ϊ'] = 'Ϊ';
> 		map['ΐ'] = 'Ϊ';
> 		map['ό'] = 'Ό';
> 		map['ύ'] = 'Ύ';
> 		map['ϋ'] = 'Ϋ';
> 		map['ΰ'] = 'Ϋ';
> 		map['ώ'] = 'Ώ';
> 		
> 		c = map[c];
> 	}
> 	
> 	return c;
> }
>
> Then, in toUpper()
> {
>    ....
>    if (c >= 0x387 && c <= 0x3CE)
>       c = toUpperGreek()...
>    ///
> }
>
> Do you think it should stay like that or I should copy-paste it 
> in the body of toUpper()?
>
> I'm going to fix toLower() as well and make a pull request.

Regarding toLower() a problem I see is how to handle sigma (Σ), 
because it has two possible lower case representations depending 
where it occurs in a word. But of course toLower() is working on 
character basis, so it cannot know what the receiver plans to do 
with the character.

--
Paulo
October 03, 2012
Re: std.string.toUpper() for greek characters
On Wednesday, 3 October 2012 at 13:27:25 UTC, Paulo Pinto wrote:
> On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
>> Currently, toUpper() (and probably toLower()) does not handle 
>> greek characters correctly. I fixed toUpper() by making a 
>> another function for greek characters
>>
>> // called if (c >= 0x387 && c <= 0x3CE)
>> dchar toUpperGreek(dchar c)
>> {
>> 	if( c >= 'α' && c <= 'ω' )
>> 	{
>> 		if( c == 'ς' )
>> 			c = 'Σ';
>> 		else
>> 			c -= 32;
>> 	}
>> 	else
>> 	{
>> 		dchar[dchar] map;
>> 		map['ά'] = 'Ά';
>> 		map['έ'] = 'Έ';
>> 		map['ή'] = 'Ή';
>> 		map['ί'] = 'Ί';
>> 		map['ϊ'] = 'Ϊ';
>> 		map['ΐ'] = 'Ϊ';
>> 		map['ό'] = 'Ό';
>> 		map['ύ'] = 'Ύ';
>> 		map['ϋ'] = 'Ϋ';
>> 		map['ΰ'] = 'Ϋ';
>> 		map['ώ'] = 'Ώ';
>> 		
>> 		c = map[c];
>> 	}
>> 	
>> 	return c;
>> }
>>
>> Then, in toUpper()
>> {
>>   ....
>>   if (c >= 0x387 && c <= 0x3CE)
>>      c = toUpperGreek()...
>>   ///
>> }
>>
>> Do you think it should stay like that or I should copy-paste 
>> it in the body of toUpper()?
>>
>> I'm going to fix toLower() as well and make a pull request.
>
> Regarding toLower() a problem I see is how to handle sigma 
> (Σ), because it has two possible lower case representations 
> depending where it occurs in a word. But of course toLower() is 
> working on character basis, so it cannot know what the receiver 
> plans to do with the character.
>
> --
> Paulo

Yeah, that's a problem indeed. I will make it become 'σ', and 
the programmer can change the final'σ' to 'ς' himself.
October 03, 2012
Re: std.string.toUpper() for greek characters
On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
> Do you think it should stay like that or I should copy-paste it 
> in the body of toUpper()?
>
> I'm going to fix toLower() as well and make a pull request.

In any case, you should coordinate with Dmitry Olshansky, since 
he is (was?) working on Unicode support in Phobos.

David
October 03, 2012
Re: std.string.toUpper() for greek characters
On 10/03/2012 03:56 AM, Minas wrote:
> Currently, toUpper() (and probably toLower()) does not handle greek
> characters correctly. I fixed toUpper() by making a another function for
> greek characters
>
> // called if (c >= 0x387 && c <= 0x3CE)
> dchar toUpperGreek(dchar c)
> {
> if( c >= 'α' && c <= 'ω' )
> {
> if( c == 'ς' )
> c = 'Σ';
> else
> c -= 32;
> }
> else
> {
> dchar[dchar] map;
> map['ά'] = 'Ά';
> map['έ'] = 'Έ';
> map['ή'] = 'Ή';
> map['ί'] = 'Ί';
> map['ϊ'] = 'Ϊ';
> map['ΐ'] = 'Ϊ';
> map['ό'] = 'Ό';
> map['ύ'] = 'Ύ';
> map['ϋ'] = 'Ϋ';
> map['ΰ'] = 'Ϋ';
> map['ώ'] = 'Ώ';
>
> c = map[c];
> }
>
> return c;
> }
>
> Then, in toUpper()
> {
> ....
> if (c >= 0x387 && c <= 0x3CE)
> c = toUpperGreek()...
> ///
> }
>
> Do you think it should stay like that or I should copy-paste it in the
> body of toUpper()?
>
> I'm going to fix toLower() as well and make a pull request.

I don't want to detract from the usefulness of these functions but 
toupper and tolower has been two of the strangests functions of the 
computer history. It is amazing that they are still accepted, because 
they are useful in very limited situations and those situations are 
becoming rarer as more and more systems support Unicode.

Two quick examples:

1) How should this string be capitalized in a scientific article?

  "Anti-obesity effects of α-lipoic acid"

I don't think the α in there should be upper-cased.

2) How should this name be capitalized in a list of names?

  "Ali"

It completely depends on the writing system of that string itself, not 
even the current locale. (There are two uppercases that I know of, which 
can be considered as correct: "ALI" and "ALİ".)

I agree that your toUpper() and toLower() will be useful in many 
contexts but will necessarily do the wrong thing in others.

Ali
October 03, 2012
Re: std.string.toUpper() for greek characters
On 03-Oct-12 18:11, Minas wrote:
> On Wednesday, 3 October 2012 at 13:27:25 UTC, Paulo Pinto wrote:
>> On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
>>> Currently, toUpper() (and probably toLower()) does not handle greek
>>> characters correctly. I fixed toUpper() by making a another function
>>> for greek characters

And a lot of others. And it is handwritten and thus unmaintainable.

>>>
>>> // called if (c >= 0x387 && c <= 0x3CE)
>>> dchar toUpperGreek(dchar c)
>>> {
>>>     if( c >= 'α' && c <= 'ω' )
>>>     {
>>>         if( c == 'ς' )
>>>             c = 'Σ';
>>>         else
>>>             c -= 32;
>>>     }
>>>     else
>>>     {
>>>         dchar[dchar] map;
>>>         map['ά'] = 'Ά';
>>>         map['έ'] = 'Έ';
>>>         map['ή'] = 'Ή';
>>>         map['ί'] = 'Ί';
>>>         map['ϊ'] = 'Ϊ';
>>>         map['ΐ'] = 'Ϊ';
>>>         map['ό'] = 'Ό';
>>>         map['ύ'] = 'Ύ';
>>>         map['ϋ'] = 'Ϋ';
>>>         map['ΰ'] = 'Ϋ';
>>>         map['ώ'] = 'Ώ';
>>>
>>>         c = map[c];
>>>     }
>>>
>>>     return c;
>>> }
>>>
>>> Then, in toUpper()
>>> {
>>>   ....
>>>   if (c >= 0x387 && c <= 0x3CE)
>>>      c = toUpperGreek()...
>>>   ///
>>> }
>>>
>>> Do you think it should stay like that or I should copy-paste it in
>>> the body of toUpper()?
>>>
>>> I'm going to fix toLower() as well and make a pull request.

I'm *strongly* against bringing these temporary hacks into standard 
library. The fact that toUpper/toLower are outdated is bad but fixing it 
by piling hack after hack on this mess of if/else branches is not the 
way out.
Also I hope you haven't lost a few hundreds over here:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Agreek%3A%5D+%26+%5B%3ACasedLetter%3A%5D&g=

The way out is a proper implementation that is is a direct derivative of 
the Unicode character database. And I've spent this summer on doing this 
proper 'cure' for these kind of problems with Unicode support in D.

Admittedly, my reworked Unicode support probably won't hit the next 
release(2.061). Needs to go through review etc. But I'm determined to 
get it to 2.062.

I'd suggest to keep around you personal version for the moment and then 
just switch to the new std one. However given our release schedule this 
could be anywhere from 4 months to 1 year away :)

>>
>> Regarding toLower() a problem I see is how to handle sigma (Σ),
>> because it has two possible lower case representations depending where
>> it occurs in a word. But of course toLower() is working on character
>> basis, so it cannot know what the receiver plans to do with the
>> character.
>>
>> --
>> Paulo
>
> Yeah, that's a problem indeed. I will make it become 'σ', and the
> programmer can change the final'σ' to 'ς' himself.

I think this is one of a small number of special cases, see the full 
list here:
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt
(handling these subtleties is commonly called 'tailoring' and currently 
I believe is out reach for std library)

Currently mytoLower will do 'σ' as prescribed by simple case folding 
rules. (i.e. the ones that can only map 1:1).

I have case-insensitive string comparison that does 1:n mappings as well 
(and is going to replace current icmp) but it doesn't do tailoring.
One day we may add some language specific tailoring (via locales etc.) 
but we'd better do it carefully.

-- 
Dmitry Olshansky
October 03, 2012
Re: std.string.toUpper() for greek characters
On 03-Oct-12 21:10, Ali Çehreli wrote:
> On 10/03/2012 03:56 AM, Minas wrote:
>> Currently, toUpper() (and probably toLower()) does not handle greek
>> characters correctly. I fixed toUpper() by making a another function for
>> greek characters
>>
>> // called if (c >= 0x387 && c <= 0x3CE)
>> dchar toUpperGreek(dchar c)
>> {
>> if( c >= 'α' && c <= 'ω' )
>> {
>> if( c == 'ς' )
>> c = 'Σ';
>> else
>> c -= 32;
>> }
>> else
>> {
>> dchar[dchar] map;
>> map['ά'] = 'Ά';
>> map['έ'] = 'Έ';
>> map['ή'] = 'Ή';
>> map['ί'] = 'Ί';
>> map['ϊ'] = 'Ϊ';
>> map['ΐ'] = 'Ϊ';
>> map['ό'] = 'Ό';
>> map['ύ'] = 'Ύ';
>> map['ϋ'] = 'Ϋ';
>> map['ΰ'] = 'Ϋ';
>> map['ώ'] = 'Ώ';
>>
>> c = map[c];
>> }
>>
>> return c;
>> }
>>
>> Then, in toUpper()
>> {
>> ....
>> if (c >= 0x387 && c <= 0x3CE)
>> c = toUpperGreek()...
>> ///
>> }
>>
>> Do you think it should stay like that or I should copy-paste it in the
>> body of toUpper()?
>>
>> I'm going to fix toLower() as well and make a pull request.
>
> I don't want to detract from the usefulness of these functions but
> toupper and tolower has been two of the strangests functions of the
> computer history. It is amazing that they are still accepted, because
> they are useful in very limited situations and those situations are
> becoming rarer as more and more systems support Unicode.
>
Glad you showed up!

One and by far the most useful case is case-insensitive matching.
That being said this doesn't and shouldn't involve toLower/toUpper  (and 
on the whole string) anywhere. Not only it's multipass vs single pass 
but it's also wrong. As a lot of other ASCII-minded carry-overs.

Other then this and being used as some intermediate sanitized form I 
don't think it has much use.

> Two quick examples:
>
> 1) How should this string be capitalized in a scientific article?
>
>    "Anti-obesity effects of α-lipoic acid"

There is a lot of lousy conversions. The basic toLower is defined in the 
standard, try it here:
http://unicode.org/cldr/utility/transform.jsp?a=Upper&b=Anti-obesity+effects+of+%CE%B1-lipoic+acid

> I don't think the α in there should be upper-cased.

Depends on why you are doing it in the first place :) Capitalizing 
scientific article strikes me as kind of strange as well.


> 2) How should this name be capitalized in a list of names?
>
>    "Ali"
>
Again what's the goal of capitalization here?
Simplifying matching afterwards? - Then it doesn't matter as long as 
it's lousiness is acceptable (rarely so) and it stays within the system, 
i.e. doesn't leak away.

> It completely depends on the writing system of that string itself, not
> even the current locale. (There are two uppercases that I know of, which
> can be considered as correct: "ALI" and "ALİ".)
>
One word: tailoring. Basically any software made in Turkey has to do ALİ :)
Only half-joking.

> I agree that your toUpper() and toLower() will be useful in many
> contexts but will necessarily do the wrong thing in others.
>
> Ali


-- 
Dmitry Olshansky
October 03, 2012
Re: std.string.toUpper() for greek characters
On 03-Oct-12 20:13, David Nadlinger wrote:
> On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
>> Do you think it should stay like that or I should copy-paste it in the
>> body of toUpper()?
>>
>> I'm going to fix toLower() as well and make a pull request.
>
> In any case, you should coordinate with Dmitry Olshansky, since he is
working on Unicode support in Phobos.

Fixed ;)

-- 
Dmitry Olshansky
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home