Thread overview
Table of strings sorting problem
Mar 11, 2006
Aarti
Mar 11, 2006
S. Chancellor
Mar 11, 2006
Hasan Aljudy
Mar 11, 2006
James Dunne
Mar 11, 2006
John C
Mar 11, 2006
Aarti
Mar 11, 2006
John C
March 11, 2006
Hello all D-Fans!

I encountered a problem with string sorting according to Polish language rules. Here is a simple test program:

// ----------------------------------
import std.stdio;
void main() {
	char[][] table;
	table.length=15;
	
	table[0]="ą";
	table[1]="a";
	table[2]="ć";
	table[3]="c";
	table[4]="ę";
	table[5]="e";
	table[6]="ń";
	table[7]="n";
	table[6]="ł";
	table[7]="l";
	table[8]="ó";
	table[9]="o";
	table[10]="ś";
	table[11]="s";
	table[12]="ź";
	table[13]="ż";
	table[14]="z";

	table.sort;

	foreach(char[] s; table) {
		writef(s);
	}
	writefln();
}
// ----------------------------------

Output of this test is:
aceloszóąćęłśźż

when it should be:
aącćeęlłoósśzźż

It looks like sort doesn't sort properly according to language rules.

Is it a known issue? How to sort strings in D according to language rules?

PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful.

Regards
Marcin Kuszczak
March 11, 2006
On 2006-03-10 17:20:35 -0800, Aarti <aarti@interia.pl> said:

> Hello all D-Fans!
> 
> I encountered a problem with string sorting according to Polish language rules. Here is a simple test program:
> 
> // ----------------------------------
> import std.stdio;
> void main() {
> 	char[][] table;
> 	table.length=15;
> 	
> 	table[0]="ą";
> 	table[1]="a";
> 	table[2]="ć";
> 	table[3]="c";
> 	table[4]="ę";
> 	table[5]="e";
> 	table[6]="ń";
> 	table[7]="n";
> 	table[6]="ł";
> 	table[7]="l";
> 	table[8]="ó";
> 	table[9]="o";
> 	table[10]="ś";
> 	table[11]="s";
> 	table[12]="ź";
> 	table[13]="ż";
> 	table[14]="z";
> 
> 	table.sort;
> 
> 	foreach(char[] s; table) {
> 		writef(s);
> 	}
> 	writefln();
> }
> // ----------------------------------
> 
> Output of this test is:
> aceloszóąćęłśźż
> 
> when it should be:
> aącćeęlłoósśzźż
> 
> It looks like sort doesn't sort properly according to language rules.
> 
> Is it a known issue? How to sort strings in D according to language rules?
> 
> PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful.
> 
> Regards
> Marcin Kuszczak

Sort works off of the binary value of a character.  To implement a sort algorithm for polish language on characters would need to be manually done by you.  You would need to specify a map from the character to it's sort order and sort based on that.   I'm not sure if the sort property takes a delegate, that was something that was proposed before.   You could mainly say it's coincidence that the latin characters fall in order numerically.  (It was probably done on purpose with the person who decided the ASCII character values though.)

-S.

March 11, 2006
S. Chancellor wrote:
> On 2006-03-10 17:20:35 -0800, Aarti <aarti@interia.pl> said:
> 
>> Hello all D-Fans!
>>
>> I encountered a problem with string sorting according to Polish language rules. Here is a simple test program:
>>
>> // ----------------------------------
>> import std.stdio;
>> void main() {
>>     char[][] table;
>>     table.length=15;
>>         table[0]="ą";
>>     table[1]="a";
>>     table[2]="ć";
>>     table[3]="c";
>>     table[4]="ę";
>>     table[5]="e";
>>     table[6]="ń";
>>     table[7]="n";
>>     table[6]="ł";
>>     table[7]="l";
>>     table[8]="ó";
>>     table[9]="o";
>>     table[10]="ś";
>>     table[11]="s";
>>     table[12]="ź";
>>     table[13]="ż";
>>     table[14]="z";
>>
>>     table.sort;
>>
>>     foreach(char[] s; table) {
>>         writef(s);
>>     }
>>     writefln();
>> }
>> // ----------------------------------
>>
>> Output of this test is:
>> aceloszóąćęłśźż
>>
>> when it should be:
>> aącćeęlłoósśzźż
>>
>> It looks like sort doesn't sort properly according to language rules.
>>
>> Is it a known issue? How to sort strings in D according to language rules?
>>
>> PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful.
>>
>> Regards
>> Marcin Kuszczak
> 
> 
> Sort works off of the binary value of a character.  To implement a sort algorithm for polish language on characters would need to be manually done by you.  You would need to specify a map from the character to it's sort order and sort based on that.   I'm not sure if the sort property takes a delegate, that was something that was proposed before.   You could mainly say it's coincidence that the latin characters fall in order numerically.  (It was probably done on purpose with the person who decided the ASCII character values though.)
> 
> -S.
> 

And note that the output
>> aceloszóąćęłśźż
prints "english" characters first!! acelosz
March 11, 2006
Hasan Aljudy wrote:
> S. Chancellor wrote:
> 
>> On 2006-03-10 17:20:35 -0800, Aarti <aarti@interia.pl> said:
>>
>>> Hello all D-Fans!
>>>
>>> I encountered a problem with string sorting according to Polish language rules. Here is a simple test program:
>>>
>>> // ----------------------------------
>>> import std.stdio;
>>> void main() {
>>>     char[][] table;
>>>     table.length=15;
>>>         table[0]="ą";
>>>     table[1]="a";
>>>     table[2]="ć";
>>>     table[3]="c";
>>>     table[4]="ę";
>>>     table[5]="e";
>>>     table[6]="ń";
>>>     table[7]="n";
>>>     table[6]="ł";
>>>     table[7]="l";
>>>     table[8]="ó";
>>>     table[9]="o";
>>>     table[10]="ś";
>>>     table[11]="s";
>>>     table[12]="ź";
>>>     table[13]="ż";
>>>     table[14]="z";
>>>
>>>     table.sort;
>>>
>>>     foreach(char[] s; table) {
>>>         writef(s);
>>>     }
>>>     writefln();
>>> }
>>> // ----------------------------------
>>>
>>> Output of this test is:
>>> aceloszóąćęłśźż
>>>
>>> when it should be:
>>> aącćeęlłoósśzźż
>>>
>>> It looks like sort doesn't sort properly according to language rules.
>>>
>>> Is it a known issue? How to sort strings in D according to language rules?
>>>
>>> PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful.
>>>
>>> Regards
>>> Marcin Kuszczak
>>
>>
>>
>> Sort works off of the binary value of a character.  To implement a sort algorithm for polish language on characters would need to be manually done by you.  You would need to specify a map from the character to it's sort order and sort based on that.   I'm not sure if the sort property takes a delegate, that was something that was proposed before.   You could mainly say it's coincidence that the latin characters fall in order numerically.  (It was probably done on purpose with the person who decided the ASCII character values though.)
>>
>> -S.
>>
> 
> And note that the output
>  >> aceloszóąćęłśźż
> prints "english" characters first!! acelosz

Correction:  ASCII characters first, because they are in the range 0-127.  Look at the unicode tables; they're publicly available.  Other latin languages use the ASCII characters.

The problem is language and culture-specific collation.  It is a very difficult problem to solve generically, since each language has many subcultures and each subculture agrees on different rules for collating text.  See discussions on ICU in the archives.

If one is looking for an explanation of the problem along with a collation solution, I would recommend: http://www.unicode.org/reports/tr10/

-- 
Regards,
James Dunne
March 11, 2006
Aarti wrote:
> Hello all D-Fans!
> 
> I encountered a problem with string sorting according to Polish language rules. Here is a simple test program:
> 
> // ----------------------------------
> import std.stdio;
> void main() {
>     char[][] table;
>     table.length=15;
>         table[0]="ą";
>     table[1]="a";
>     table[2]="ć";
>     table[3]="c";
>     table[4]="ę";
>     table[5]="e";
>     table[6]="ń";
>     table[7]="n";
>     table[6]="ł";
>     table[7]="l";
>     table[8]="ó";
>     table[9]="o";
>     table[10]="ś";
>     table[11]="s";
>     table[12]="ź";
>     table[13]="ż";
>     table[14]="z";
> 
>     table.sort;
> 
>     foreach(char[] s; table) {
>         writef(s);
>     }
>     writefln();
> }
> // ----------------------------------
> 
> Output of this test is:
> aceloszóąćęłśźż
> 
> when it should be:
> aącćeęlłoósśzźż
> 
> It looks like sort doesn't sort properly according to language rules.
> 
> Is it a known issue? How to sort strings in D according to language rules?
> 
> PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful.
> 
> Regards
> Marcin Kuszczak

As others have implied, D's standard library isn't culturally aware.

I've been working on a locale package for Mango that will eventually allow correct string sorting for specific languages. This is how you'd sort a list of Polish characters:

const char[][] table = [ "a","ą","c","ć","e","ę" ];
Culture.current = Culture.getCulture("pl-PL");
table.sort();
March 11, 2006
John C wrote:

> As others have implied, D's standard library isn't culturally aware.
> 
> I've been working on a locale package for Mango that will eventually allow correct string sorting for specific languages. This is how you'd sort a list of Polish characters:
> 
> const char[][] table = [ "a","ą","c","ć","e","ę" ];
> Culture.current = Culture.getCulture("pl-PL");
> table.sort();

It would be really helpful! Does it already work? I especially don't understand how can I change standard behaviour of table sort property?

I think that internationalization support is one of most important areas which could increase D acceptance all over the world.

Althrough in C++ it's not as easy as it should be, but it's still easier than making own sort function. Especially when I want in my program that sorting according to rules of _many_ different languages should be supported.

Another problem is that D documentation does not say anything that D sorts tables only in binary order. There should be also hint how to implement own sorters for table, because now language does not behave as expected in case of strings.

Regards
Marcin Kuszczak
March 11, 2006
Aarti wrote:
> John C wrote:
> 
>> As others have implied, D's standard library isn't culturally aware.
>>
>> I've been working on a locale package for Mango that will eventually allow correct string sorting for specific languages. This is how you'd sort a list of Polish characters:
>>
>> const char[][] table = [ "a","ą","c","ć","e","ę" ];
>> Culture.current = Culture.getCulture("pl-PL");
>> table.sort();
> 
> 
> It would be really helpful! Does it already work? I especially don't understand how can I change standard behaviour of table sort property?

Well, I've written an implementation that works, but it's not yet ready to be unleashed on the public.

It might be possible to override _adSort ... not tried it yet. Currently it's just a free function, which can be called as if an array property.

> 
> I think that internationalization support is one of most important areas which could increase D acceptance all over the world.
> 
> Althrough in C++ it's not as easy as it should be, but it's still easier than making own sort function. Especially when I want in my program that sorting according to rules of _many_ different languages should be supported.
> 
> Another problem is that D documentation does not say anything that D sorts tables only in binary order. There should be also hint how to implement own sorters for table, because now language does not behave as expected in case of strings.
> 
> Regards
> Marcin Kuszczak