Jump to page: 1 2
Thread overview
Sorting with non-ASCII characters
Sep 19, 2013
Chris
Sep 19, 2013
monarch_dodra
Sep 19, 2013
Chris
Sep 19, 2013
bearophile
Sep 19, 2013
Chris
Sep 19, 2013
Ali Çehreli
Sep 19, 2013
Jos van Uden
Sep 19, 2013
Chris
Sep 24, 2013
Chris
Sep 24, 2013
Jos van Uden
Sep 24, 2013
Chris
September 19, 2013
Short question in case anyone knows the answer straight away:

How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?

Now I'm getting this:

[wow, ara, ába, marca]

===> sort(listAbove);

[ara, marca, wow, ába]

I'd like to get:

[ ába, ara, marca, wow]

Thanks.



September 19, 2013
On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote:
> Short question in case anyone knows the answer straight away:
>
> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>
> Now I'm getting this:
>
> [wow, ara, ába, marca]
>
> ===> sort(listAbove);
>
> [ara, marca, wow, ába]
>
> I'd like to get:
>
> [ ába, ara, marca, wow]
>
> Thanks.

Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm"
http://d.puremagic.com/issues/show_bug.cgi?id=10566

Unfortunately, I don't know of any workarounds for you :/
September 19, 2013
On Thursday, 19 September 2013 at 15:34:28 UTC, monarch_dodra wrote:
> On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote:
>> Short question in case anyone knows the answer straight away:
>>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>>
>> Now I'm getting this:
>>
>> [wow, ara, ába, marca]
>>
>> ===> sort(listAbove);
>>
>> [ara, marca, wow, ába]
>>
>> I'd like to get:
>>
>> [ ába, ara, marca, wow]
>>
>> Thanks.
>
> Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm"
> http://d.puremagic.com/issues/show_bug.cgi?id=10566
>
> Unfortunately, I don't know of any workarounds for you :/

Good that I asked! Imagine the time I would have wasted.

September 19, 2013
Chris:

> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?

The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered.
But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a  schwartzSort:

http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py

(If you translate that function to D, you could also add it to Dub later.)

Bye,
bearophile
September 19, 2013
On Thursday, 19 September 2013 at 15:42:52 UTC, bearophile wrote:
> Chris:
>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>
> The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered.
> But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a  schwartzSort:
>
> http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py
>
> (If you translate that function to D, you could also add it to Dub later.)
>
> Bye,
> bearophile

Ok, thanks. We'll see.
September 19, 2013
On 09/19/2013 08:18 AM, Chris wrote:
> Short question in case anyone knows the answer straight away:
>
> How do I sort text so that non-ascii characters like "á" are treated in
> the same way as "a"?
>
> Now I'm getting this:
>
> [wow, ara, ába, marca]
>
> ===> sort(listAbove);
>
> [ara, marca, wow, ába]
>
> I'd like to get:
>
> [ ába, ara, marca, wow]
>
> Thanks.
>
>
>

I have a project that tries to do exactly that:

  https://code.google.com/p/trileri/source/browse/trunk/tr/dizgi.d#823

However, it is in Turkish and in need of a rewrite. :/

For the whole thing to work, every character must be of a certain alphabet. Here is the English alphabet:

  https://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d#747

Here is how I define e.g. á to be an accented version of a:

  https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#23

However, some characters stand individually as they are not accents but proper letters themselves (e.g. ç of the Turkish alphabet):

  https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#44

Well... I hope to get back to it at some point, taking advantage of the new std.uni as well.

Ali

September 19, 2013
On 19-9-2013 17:18, Chris wrote:
> Short question in case anyone knows the answer straight away:
>
> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>
> Now I'm getting this:
>
> [wow, ara, ába, marca]
>
> ===> sort(listAbove);
>
> [ara, marca, wow, ába]
>
> I'd like to get:
>
> [ ába, ara, marca, wow]

If you only need to process extended ascii, then you could perhaps
make do with a transliterated sort, something like:

import std.stdio, std.string, std.algorithm, std.uni;

void main() {
    auto sa = ["wow", "ara", "ába", "Marca"];
    writeln(sa);
    trSort(sa);
    writeln(sa);
}

void trSort(C, alias less = "a < b")(C[] arr) {
    static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ";
    static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy";
    schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr);
}

September 19, 2013
On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden
wrote:
> On 19-9-2013 17:18, Chris wrote:
>> Short question in case anyone knows the answer straight away:
>>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>>
>> Now I'm getting this:
>>
>> [wow, ara, ába, marca]
>>
>> ===> sort(listAbove);
>>
>> [ara, marca, wow, ába]
>>
>> I'd like to get:
>>
>> [ ába, ara, marca, wow]
>
> If you only need to process extended ascii, then you could perhaps
> make do with a transliterated sort, something like:
>
> import std.stdio, std.string, std.algorithm, std.uni;
>
> void main() {
>     auto sa = ["wow", "ara", "ába", "Marca"];
>     writeln(sa);
>     trSort(sa);
>     writeln(sa);
> }
>
> void trSort(C, alias less = "a < b")(C[] arr) {
>     static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ";
>     static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy";
>     schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr);
> }

Ok, thanks, will try that. I'll let you know if it worked.
September 24, 2013
On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:
> On 19-9-2013 17:18, Chris wrote:
>> Short question in case anyone knows the answer straight away:
>>
>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>>
>> Now I'm getting this:
>>
>> [wow, ara, ába, marca]
>>
>> ===> sort(listAbove);
>>
>> [ara, marca, wow, ába]
>>
>> I'd like to get:
>>
>> [ ába, ara, marca, wow]
>
> If you only need to process extended ascii, then you could perhaps
> make do with a transliterated sort, something like:
>
> import std.stdio, std.string, std.algorithm, std.uni;
>
> void main() {
>     auto sa = ["wow", "ara", "ába", "Marca"];
>     writeln(sa);
>     trSort(sa);
>     writeln(sa);
> }
>
> void trSort(C, alias less = "a < b")(C[] arr) {
>     static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ";
>     static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy";
>     schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr);
> }

Thanks a million, Jos! This does the trick for me.
September 24, 2013
On 24-9-2013 11:26, Chris wrote:
> On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:
>> On 19-9-2013 17:18, Chris wrote:
>>> Short question in case anyone knows the answer straight away:
>>>
>>> How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?
>>>
>>> Now I'm getting this:
>>>
>>> [wow, ara, ába, marca]
>>>
>>> ===> sort(listAbove);
>>>
>>> [ara, marca, wow, ába]
>>>
>>> I'd like to get:
>>>
>>> [ ába, ara, marca, wow]
>>
>> If you only need to process extended ascii, then you could perhaps
>> make do with a transliterated sort, something like:
>>
>> import std.stdio, std.string, std.algorithm, std.uni;
>>
>> void main() {
>>     auto sa = ["wow", "ara", "ába", "Marca"];
>>     writeln(sa);
>>     trSort(sa);
>>     writeln(sa);
>> }
>>
>> void trSort(C, alias less = "a < b")(C[] arr) {
>>     static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ";
>>     static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy";
>>     schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr);
>> }
>
> Thanks a million, Jos! This does the trick for me.

Great.

Be aware that the above code does a case insensitive sort, if you need
case sensitive, you can use something like:


import std.stdio, std.string, std.algorithm, std.uni;

void main() {
    auto sa = ["wow", "ara", "ába", "Marca"];
    writeln(sa);
    trSort(sa, CaseSensitive.no);
    writeln(sa);
        writeln;
        sa = ["wow", "ara", "ába", "Marca"];
    writeln(sa);
    trSort(sa, CaseSensitive.yes);
    writeln(sa);
}

void trSort(C, alias less = "a < b")(C[] arr,
                            CaseSensitive cs = CaseSensitive.yes) {
                                static c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸ"d;
    static c2 = "aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUYY"d;
        if (cs == CaseSensitive.no)
        arr.schwartzSort!(a => a.toLower.tr(c1, c2), less);
    else
        arr.schwartzSort!(a => a.tr(c1, c2), less);
}
« First   ‹ Prev
1 2