Thread overview | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
September 19, 2013 Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow] Thanks. |
September 19, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris | On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote: > Short question in case anyone knows the answer straight away: > > How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? > > Now I'm getting this: > > [wow, ara, ába, marca] > > ===> sort(listAbove); > > [ara, marca, wow, ába] > > I'd like to get: > > [ ába, ara, marca, wow] > > Thanks. Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm" http://d.puremagic.com/issues/show_bug.cgi?id=10566 Unfortunately, I don't know of any workarounds for you :/ |
September 19, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Thursday, 19 September 2013 at 15:34:28 UTC, monarch_dodra wrote:
> On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote:
>> Short question in case anyone knows the answer straight away:
>>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>>
>> Now I'm getting this:
>>
>> [wow, ara, ába, marca]
>>
>> ===> sort(listAbove);
>>
>> [ara, marca, wow, ába]
>>
>> I'd like to get:
>>
>> [ ába, ara, marca, wow]
>>
>> Thanks.
>
> Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm"
> http://d.puremagic.com/issues/show_bug.cgi?id=10566
>
> Unfortunately, I don't know of any workarounds for you :/
Good that I asked! Imagine the time I would have wasted.
|
September 19, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris | Chris: > How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered. But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a schwartzSort: http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py (If you translate that function to D, you could also add it to Dub later.) Bye, bearophile |
September 19, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Thursday, 19 September 2013 at 15:42:52 UTC, bearophile wrote:
> Chris:
>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>
> The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered.
> But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a schwartzSort:
>
> http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py
>
> (If you translate that function to D, you could also add it to Dub later.)
>
> Bye,
> bearophile
Ok, thanks. We'll see.
|
September 19, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris | On 09/19/2013 08:18 AM, Chris wrote: > Short question in case anyone knows the answer straight away: > > How do I sort text so that non-ascii characters like "á" are treated in > the same way as "a"? > > Now I'm getting this: > > [wow, ara, ába, marca] > > ===> sort(listAbove); > > [ara, marca, wow, ába] > > I'd like to get: > > [ ába, ara, marca, wow] > > Thanks. > > > I have a project that tries to do exactly that: https://code.google.com/p/trileri/source/browse/trunk/tr/dizgi.d#823 However, it is in Turkish and in need of a rewrite. :/ For the whole thing to work, every character must be of a certain alphabet. Here is the English alphabet: https://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d#747 Here is how I define e.g. á to be an accented version of a: https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#23 However, some characters stand individually as they are not accents but proper letters themselves (e.g. ç of the Turkish alphabet): https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#44 Well... I hope to get back to it at some point, taking advantage of the new std.uni as well. Ali |
September 19, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris | On 19-9-2013 17:18, Chris wrote: > Short question in case anyone knows the answer straight away: > > How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? > > Now I'm getting this: > > [wow, ara, ába, marca] > > ===> sort(listAbove); > > [ara, marca, wow, ába] > > I'd like to get: > > [ ába, ara, marca, wow] If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); } |
September 19, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jos van Uden | On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden
wrote:
> On 19-9-2013 17:18, Chris wrote:
>> Short question in case anyone knows the answer straight away:
>>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>>
>> Now I'm getting this:
>>
>> [wow, ara, ába, marca]
>>
>> ===> sort(listAbove);
>>
>> [ara, marca, wow, ába]
>>
>> I'd like to get:
>>
>> [ ába, ara, marca, wow]
>
> If you only need to process extended ascii, then you could perhaps
> make do with a transliterated sort, something like:
>
> import std.stdio, std.string, std.algorithm, std.uni;
>
> void main() {
> auto sa = ["wow", "ara", "ába", "Marca"];
> writeln(sa);
> trSort(sa);
> writeln(sa);
> }
>
> void trSort(C, alias less = "a < b")(C[] arr) {
> static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ";
> static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy";
> schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr);
> }
Ok, thanks, will try that. I'll let you know if it worked.
|
September 24, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jos van Uden | On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:
> On 19-9-2013 17:18, Chris wrote:
>> Short question in case anyone knows the answer straight away:
>>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>>
>> Now I'm getting this:
>>
>> [wow, ara, ába, marca]
>>
>> ===> sort(listAbove);
>>
>> [ara, marca, wow, ába]
>>
>> I'd like to get:
>>
>> [ ába, ara, marca, wow]
>
> If you only need to process extended ascii, then you could perhaps
> make do with a transliterated sort, something like:
>
> import std.stdio, std.string, std.algorithm, std.uni;
>
> void main() {
> auto sa = ["wow", "ara", "ába", "Marca"];
> writeln(sa);
> trSort(sa);
> writeln(sa);
> }
>
> void trSort(C, alias less = "a < b")(C[] arr) {
> static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ";
> static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy";
> schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr);
> }
Thanks a million, Jos! This does the trick for me.
|
September 24, 2013 Re: Sorting with non-ASCII characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris | On 24-9-2013 11:26, Chris wrote: > On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote: >> On 19-9-2013 17:18, Chris wrote: >>> Short question in case anyone knows the answer straight away: >>> >>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? >>> >>> Now I'm getting this: >>> >>> [wow, ara, ába, marca] >>> >>> ===> sort(listAbove); >>> >>> [ara, marca, wow, ába] >>> >>> I'd like to get: >>> >>> [ ába, ara, marca, wow] >> >> If you only need to process extended ascii, then you could perhaps >> make do with a transliterated sort, something like: >> >> import std.stdio, std.string, std.algorithm, std.uni; >> >> void main() { >> auto sa = ["wow", "ara", "ába", "Marca"]; >> writeln(sa); >> trSort(sa); >> writeln(sa); >> } >> >> void trSort(C, alias less = "a < b")(C[] arr) { >> static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; >> static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; >> schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); >> } > > Thanks a million, Jos! This does the trick for me. Great. Be aware that the above code does a case insensitive sort, if you need case sensitive, you can use something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.no); writeln(sa); writeln; sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.yes); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr, CaseSensitive cs = CaseSensitive.yes) { static c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸ"d; static c2 = "aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUYY"d; if (cs == CaseSensitive.no) arr.schwartzSort!(a => a.toLower.tr(c1, c2), less); else arr.schwartzSort!(a => a.tr(c1, c2), less); } |
Copyright © 1999-2021 by the D Language Foundation