Jump to page: 1 2
Thread overview
December 06
Hi,

I'm having some trouble sorting the individual characters in a string. Searching around, I found this thread (http://forum.dlang.org/post/mailman.612.1331659665.4860.digitalmars-d-learn@puremagic.com) about a similar issue, but it feels quite old so I wanted to check if there is a clear cut way of sorting the characters of a string nowadays?

I was expecting to be able to do something like this:

string word = "longword";
writeln(sort(word));

But that doesn't work because I guess a string is not the type of range required for sort?
I tried converting it to a char[] array (to!(char[])(word)) but that doesn't seem work either. I get:

Error: template std.algorithm.sorting.sort cannot deduce function from argument types !()(char[])

Best regards,
Fredrik
December 06
On Wednesday, December 06, 2017 08:59:09 Fredrik Boulund via Digitalmars-d- learn wrote:
> Hi,
>
> I'm having some trouble sorting the individual characters in a
> string. Searching around, I found this thread
> (http://forum.dlang.org/post/mailman.612.1331659665.4860.digitalmars-d-lea
> rn@puremagic.com) about a similar issue, but it feels quite old so I
> wanted to check if there is a clear cut way of sorting the characters of
> a string nowadays?
>
> I was expecting to be able to do something like this:
>
> string word = "longword";
> writeln(sort(word));
>
> But that doesn't work because I guess a string is not the type of
> range required for sort?
> I tried converting it to a char[] array (to!(char[])(word)) but
> that doesn't seem work either. I get:
>
> Error: template std.algorithm.sorting.sort cannot deduce function
> from argument types !()(char[])

Okay. "Narrow strings" are not considered random access and aren't considered to have length by the range API. This is because a code unit in UTF-8 or UTF-16 is not guaranteed to be a full code point (IIRC, 1 - 6 code units for UTF-8 and 1 - 2 for UTF-16), and slicing them risks cutting into the middle of a code point. UTF-32 on the other hand is guaranteed to have a code unit be a full code point. So, arrays of char and wchar are not considered random access or to have length by the range API, but arrays of dchar are. As part of that, front and back return a dchar for all string types, so front, popFront, back, and popBack do decoding of code units to code points and never cut up code points. The idea was to make it so that Unicode correctness would be guaranteed by default.

While this design was well-intended, it doesn't really work, and it was a huge mistake. While a code point is something that can actually be displayed on the screen, it's not guaranteed to be a full character (e.g. the letter a with a subscript of 2 would be multiple code points). You need a grapheme cluster to get a full character, and that potentially means multiple code points. So, having ranges of characters operate at the code point level does not guarantee Unicode correctness. You could still cut up a character (it would just be at the code point level rather than the code unit level). For full-on Unicode correctness by default, everything would have to operate at the grapheme level, which would be horribly inefficient (especially for ASCII).

In reality, whether an algorithm needs to operate at the code unit, code point, or grapheme level depends on what it's doing (e.g. so long as the Unicode is normalized properly, find can operate at the code unit level, but if you want to actually sort a range of full Unicode characters properly, you'd need to operate at the grapheme level). So, what the range API _should_ just treat strings as ranges of their element type and let the programmer wrap them as necessary to deal with code points or graphemes when appropriate. And ultimately, the programmer just plain needs to do what they're doing, and a solution that reasonably guarantees Unicode correctness really doesn't exist (not if you care about efficiency anyway). Unfortunately, changing the range API at this point would break a lot of code, and no one has figured out how to do it without breaking code. So, it currently seems likely that we'll forever be stuck with this design mistake.

So, the best way to work around it depends on what you're doing. std.utf.byCodeUnit creates a wrapper range that turns all strings into ranges of their actual element types rather than dchar. So, that's one solution. Another is std.string.representation, which casts a string to the corresponding integral type (e.g. typeof("foo".representation) is immutable(ubyte)[]). And std.uni has stuff for better handling code points or graphemes if need be.

If you have a string, and you _know_ that it's only ASCII, then either use representation or byCodeUnit to wrap it for the call to sort, but it _will_ have to be mutable, so string won't actually work. e.g.

char[] str = "hello world".dup;
sort(str.representation);
// str is now sorted

However, if the string could actually contain Unicode, then you'll have to use std.uni's grapheme support to wrap the string in a range that operates at the grapheme level; otherwise you could rip pieces of characters apart.

- Jonathan M Davis

December 06
On Wednesday, 6 December 2017 at 08:59:09 UTC, Fredrik Boulund wrote:
> string word = "longword";
> writeln(sort(word));
>
> But that doesn't work because I guess a string is not the type of range required for sort?

Yeah, narrow (non-UTF-32) strings are not random-access, since characters like 💩 take up more than one code unit, and so "💩"[0] returns an invalid piece of a character instead of a  full character.

In addition, sort does in-place sorting, so the input range is changed. Since D strings are immutable(char)[], changing the elements is disallowed. So in total, you'll need to convert from a string (immutable(char)[]) to a dchar[]. std.conv.to to the rescue:

    import std.stdio : writeln;
    import std.conv : to;
    import std.algorithm.sorting : sort;

    string word = "longword";
    writeln(sort(word.to!(dchar[]))); // dglnoorw

--
  Biotronic
December 06
On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote:
> UTF-32 on the other hand is guaranteed to have a code unit be a full code point.

I don't think the standard says that? Isn't this only because the current set is small enough to fit? So this may change as Unicode grows?


December 06
On Wednesday, December 06, 2017 09:34:48 Ola Fosheim Grøstad via Digitalmars-d-learn wrote:
> On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis
>
> wrote:
> > UTF-32 on the other hand is guaranteed to have a code unit be a full code point.
>
> I don't think the standard says that? Isn't this only because the current set is small enough to fit? So this may change as Unicode grows?

It's most definitely the case right now, and given how Unicode decoding works, I don't see how it could ever be the case that a UTF-32 code unit would not be a code point - not without breaking all of the Unicode handling in existence. And per wikipedia's short article on code points

----------------
The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.
----------------

And uint.max is 4,294,967,295, leaving about 3855x space to grow into even if they kept adding more code point values by adding more planes or however that works.

I'd have to go digging through the actual standard to know for sure what it actually guarantees though.

- Jonathan M Davis


December 06
On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote:

> If you have a string, and you _know_ that it's only ASCII, then either use representation or byCodeUnit to wrap it for the call to sort, but it _will_ have to be mutable, so string won't actually work. e.g.
>
> char[] str = "hello world".dup;
> sort(str.representation);
> // str is now sorted
>
> However, if the string could actually contain Unicode, then you'll have to use std.uni's grapheme support to wrap the string in a range that operates at the grapheme level; otherwise you could rip pieces of characters apart.

Thanks so much for such an elaborate reply! I was really close with str.representation in my code, and the little code example you gave helped me fix my problem! Thanks a lot
December 06
On Wednesday, 6 December 2017 at 09:25:20 UTC, Biotronic wrote:
> In addition, sort does in-place sorting, so the input range is changed. Since D strings are immutable(char)[], changing the elements is disallowed. So in total, you'll need to convert from a string (immutable(char)[]) to a dchar[]. std.conv.to to the rescue:
>
>     import std.stdio : writeln;
>     import std.conv : to;
>     import std.algorithm.sorting : sort;
>
>     string word = "longword";
>     writeln(sort(word.to!(dchar[]))); // dglnoorw


Also very useful information! Thanks. I was just realizing that sort was in-place as I finished writing my first post. It got me really confused as I expected it to return a sorted array (but I do realize that is a strange assumption to make).
December 06
On Wednesday, 6 December 2017 at 09:25:20 UTC, Biotronic wrote:
> On Wednesday, 6 December 2017 at 08:59:09 UTC, Fredrik Boulund wrote:
>> string word = "longword";
>> writeln(sort(word));
>>
>> But that doesn't work because I guess a string is not the type of range required for sort?
>
> Yeah, narrow (non-UTF-32) strings are not random-access, since characters like 💩 take up more than one code unit, and so "💩"[0] returns an invalid piece of a character instead of a  full character.
>
> In addition, sort does in-place sorting, so the input range is changed. Since D strings are immutable(char)[], changing the elements is disallowed. So in total, you'll need to convert from a string (immutable(char)[]) to a dchar[]. std.conv.to to the rescue:
>
>     import std.stdio : writeln;
>     import std.conv : to;
>     import std.algorithm.sorting : sort;
>
>     string word = "longword";
>     writeln(sort(word.to!(dchar[]))); // dglnoorw
>
> --
>   Biotronic

Or you simply do
----
writeln("longword".array.sort);
----
December 06
On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote:

>
> Or you simply do
> ----
> writeln("longword".array.sort);
> ----

This is so strange. I was dead sure I tried that but it failed for some reason. But after trying it just now it also seems to work just fine. Thanks! :)
December 06
On 12/6/17 4:34 AM, Ola Fosheim Grøstad wrote:
> On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote:
>> UTF-32 on the other hand is guaranteed to have a code unit be a full code point.
> 
> I don't think the standard says that? Isn't this only because the current set is small enough to fit? So this may change as Unicode grows?
> 
> 

The current unicode encoding has 2 million different code points. I'd say we'll all be dead and so will our great great great grandchildren by the time unicode amasses more than 2 billion codepoints :)

Also, UTF8 has been standardized to only have up to 4 code units per code point. The encoding scheme allows more, but the standard restricts it.

-Steve
« First   ‹ Prev
1 2