Sort characters in string (page 2)

On 12/06/2017 04:43 AM, Fredrik Boulund wrote: > On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote: > >> >> Or you simply do >> ---- >> writeln("longword".array.sort); >> ---- > > This is so strange. I was dead sure I tried that but it failed for some > reason. But after trying it just now it also seems to work just fine. > Thanks! :) As a general comment, sorting a string does not make sense in general when Unicode is involved. For example, there may be combining diacriticals: // Three characters: e, a, and combining acute accent U+0301 writeln("eá".array.sort); prints aé So, the accent moves from a to e, which probably is not the intention. Ali

December 06, 2017

Re: Sort characters in string

Posted by H. S. Teoh
in reply to Ali Çehreli

Permalink

H. S. Teoh

Posted in reply to Ali Çehreli

Permalink

On Wed, Dec 06, 2017 at 10:32:03AM -0800, Ali Çehreli via Digitalmars-d-learn wrote:
> On 12/06/2017 04:43 AM, Fredrik Boulund wrote:
> > On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote:
> >
> >>
> >> Or you simply do
> >> ----
> >> writeln("longword".array.sort);
> >> ----
> >
> > This is so strange. I was dead sure I tried that but it failed for some reason. But after trying it just now it also seems to work just fine.  Thanks! :)
> 
> As a general comment, sorting a string does not make sense in general when Unicode is involved.
[...]

Yeah... in general, you need to decide exactly what kind of sorting you intend.  If you intend to sort individual graphemes (i.e., what we normally think of as "characters"), you need to segment the string into graphemes with .byGrapheme and then sort it as an array/range of graphemes.  Sorting Unicode code points is probably not what you want, and sorting code units is probably never what you want (unless you're doing byte frequency analysis on UTF-8 or something :-P).

Unicode is a tricky beast.

T

-- 
What do you get if you drop a piano down a mineshaft? A flat minor.

On Wednesday, 6 December 2017 at 12:43:09 UTC, Fredrik Boulund wrote: > On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote: > >> >> Or you simply do >> ---- >> writeln("longword".array.sort); >> ---- > > This is so strange. I was dead sure I tried that but it failed for some reason. But after trying it just now it also seems to work just fine. Thanks! :) if you're like me, you probably forgot an import :)

On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote: > a full code point (IIRC, 1 - 6 code units for UTF-8 and 1 - 2 for UTF-16), YDNRC, 1 - 4 code units for UTF-8. Unicode is defined only up to U+10FFFF. Everything above is illegal.

On Wednesday, 6 December 2017 at 09:34:48 UTC, Ola Fosheim Grøstad wrote: > On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote: >> UTF-32 on the other hand is guaranteed to have a code unit be a full code point. > > I don't think the standard says that? Isn't this only because the current set is small enough to fit? So this may change as Unicode grows? No. Unicode uses only 21 bits and it is very unlikely to change anytime soon as barely 17 are really used. This means the current range can be grown by more than 16 times what it is now. So definitely, one UTF-32 codeunit is guaranted to hold any codepoint, forever.

On Wednesday, 6 December 2017 at 15:12:22 UTC, Steven Schveighoffer wrote: > On 12/6/17 4:34 AM, Ola Fosheim Grøstad wrote: >> On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote: >>> UTF-32 on the other hand is guaranteed to have a code unit be a full code point. >> >> I don't think the standard says that? Isn't this only because the current set is small enough to fit? So this may change as Unicode grows? >> >> > > The current unicode encoding has 2 million different code points. 2,097,152 possible codepoints. As of [Unicode 10] only 136,690 codepoints have been assigned. >I'd say we'll all be dead and so will our great great > great grandchildren by the time unicode amasses more than 2 billion codepoints :) So there's enough time even before the current range is even filled. > > Also, UTF8 has been standardized to only have up to 4 code units per code point. The encoding scheme allows more, but the standard restricts it. [Unicode 10]: http://www.unicode.org/versions/Unicode10.0.0/

Forums