Sorting char arrays - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Sorting char arrays

Thread overview

Sorting char arrays
Mar 12, 2012 Magnus Lie Hetland
Mar 12, 2012 Dmitry Olshansky
Mar 12, 2012 Magnus Lie Hetland
Mar 12, 2012 Jonathan M Davis
Mar 13, 2012 Magnus Lie Hetland
Mar 13, 2012 Jonathan M Davis
Mar 12, 2012 bearophile
Mar 12, 2012 Magnus Lie Hetland
Mar 12, 2012 Ali Çehreli
Mar 13, 2012 Magnus Lie Hetland

March 12, 2012

Sorting char arrays

Posted by Magnus Lie Hetland

Magnus Lie Hetland

The following fails, which I guess is a bug?

import std.algorithm;
void main() {
	char[] a = ['a', 'b', 'c'];
	sort(a);
}

I thought maybe I'd report it -- sort of surprises me that it hasn't been reported before, but I couldn't find it (although I found some similar reports) in the Bugzilla. (No biggie for me, though; the Phobos sort seems to fail with all kinds of things, so I have my own anyway... ;)

-- 
Magnus Lie Hetland
http://hetland.org

March 12, 2012

Re: Sorting char arrays

Posted by Dmitry Olshansky
in reply to Magnus Lie Hetland

Dmitry Olshansky

Posted in reply to Magnus Lie Hetland

On 12.03.2012 16:51, Magnus Lie Hetland wrote:
> The following fails, which I guess is a bug?
>
> import std.algorithm;
> void main() {
> char[] a = ['a', 'b', 'c'];
> sort(a);
> }
>
> I thought maybe I'd report it -- sort of surprises me that it hasn't
> been reported before, but I couldn't find it (although I found some
> similar reports) in the Bugzilla. (No biggie for me, though; the Phobos
> sort seems to fail with all kinds of things, so I have my own anyway... ;)
>
Mm it should perform sort on UTF-8 buffer? Tricky thing but worths an enhancement request.

If it's ASCII then try:
sort(a.representation)

representation is in std.string IRC.
-- 
Dmitry Olshansky

March 12, 2012

Re: Sorting char arrays

Posted by bearophile
in reply to Magnus Lie Hetland

bearophile

Posted in reply to Magnus Lie Hetland

Magnus Lie Hetland:

> The following fails, which I guess is a bug?
>
> import std.algorithm;
> void main() {
> 	char[] a = ['a', 'b', 'c'];
> 	sort(a);
> }

It's not a bug, char is meant to be a UTF-8. Two workarounds:

import std.algorithm;
void main() {
	dchar[] a1 = ['a', 'b', 'c'];
	sort(a1);
	char[] a2 = ['a', 'b', 'c'];
	sort(cast(ubyte[])a2);
}

Bye,
bearophile

March 12, 2012

Re: Sorting char arrays

Posted by Magnus Lie Hetland
in reply to Dmitry Olshansky

Magnus Lie Hetland

Posted in reply to Dmitry Olshansky

On 2012-03-12 13:09:18 +0000, Dmitry Olshansky said:

> Mm it should perform sort on UTF-8 buffer?

Humm -- dunno ;) The UTF-8-semantics of single characters sort of slipped my mind :)

> Tricky thing but worths an enhancement request.

I'm just thinking an array of anything that can be compared should probably be sort-able. But comparing individual UTF-8-bytes can be weird, indeed. So, yeah. I guess the weirdness follows from the fact that individual characters are considered UTF-8 :)

> If it's ASCII then try:
> sort(a.representation)
> 
> representation is in std.string IRC.

The thing is, I'm using sort() in a template, and I just happen to use char as the template parameter in a unit test. And since I have no special UTF-8 values in there, my own sort() works just fine. (Although maybe it shouldn't? ;)

-- 
Magnus Lie Hetland
http://hetland.org

March 12, 2012

Re: Sorting char arrays

Posted by Magnus Lie Hetland
in reply to bearophile

Magnus Lie Hetland

Posted in reply to bearophile

On 2012-03-12 13:56:15 +0000, bearophile said:

> It's not a bug, char is meant to be a UTF-8.

Right.

> Two workarounds:

Thanks. I'm doing the sorting in a template, so this won't work -- but I guess I just can't use char as a type parameter to my template either, then :)

-- 
Magnus Lie Hetland
http://hetland.org

March 12, 2012

Re: Sorting char arrays

Posted by Jonathan M Davis
in reply to Magnus Lie Hetland

Jonathan M Davis

Posted in reply to Magnus Lie Hetland

On Monday, March 12, 2012 16:04:37 Magnus Lie Hetland wrote:
> On 2012-03-12 13:09:18 +0000, Dmitry Olshansky said:
> > Mm it should perform sort on UTF-8 buffer?
> 
> Humm -- dunno ;) The UTF-8-semantics of single characters sort of
> slipped my mind :)
> 
> > Tricky thing but worths an enhancement request.
> 
> I'm just thinking an array of anything that can be compared should probably be sort-able. But comparing individual UTF-8-bytes can be weird, indeed. So, yeah. I guess the weirdness follows from the fact that individual characters are considered UTF-8 :)
> 
> > If it's ASCII then try:
> > sort(a.representation)
> > 
> > representation is in std.string IRC.
> 
> The thing is, I'm using sort() in a template, and I just happen to use
> char as the template parameter in a unit test. And since I have no
> special UTF-8 values in there, my own sort() works just fine. (Although
> maybe it shouldn't? ;)

The problem is that sort requires a random access range, and narrow string (string and wstring) aren't, because they're variable length encoded. I'm not sure that strings _can_ be efficiently sorted, because of that, but maybe there's a sorting algorthm that could do it reasonably efficiently, and we could special case sort for narrow strings to use that one, but it's a while since I messed with sorting algorithms, so I don't remember all of their characteristics off of the top of my head. Certainly, with how sort is currenly implemented, it can't handle any range which isn't a random access range.

- Jonathan M Davis

March 12, 2012

Re: Sorting char arrays

Posted by Ali Çehreli
in reply to Magnus Lie Hetland

Ali Çehreli

Posted in reply to Magnus Lie Hetland

On 03/12/2012 08:06 AM, Magnus Lie Hetland wrote:
> On 2012-03-12 13:56:15 +0000, bearophile said:
>
>> It's not a bug, char is meant to be a UTF-8.
>
> Right.
>
>> Two workarounds:
>
> Thanks. I'm doing the sorting in a template, so this won't work -- but I
> guess I just can't use char as a type parameter to my template either,
> then :)
>

You can use isNarrowString to either disallow or provide special implementation for narrow strings (char[] and wchar[]):

import std.traits;

void foo(T)(T t)
    if (!isNarrowString!T)
{
    // ...
}

void bar(T)(T t)
{
    static if (isNarrowString!T){
        // ...

    } else {
        // ...
    }
}

void main()
{
    char[] sc;
    dchar[] sd;
    // foo(sc);     // <-- compilation error
    foo(sd);

    bar(sc);
    bar(sd);
}

Ali

March 13, 2012

Re: Sorting char arrays

Posted by Magnus Lie Hetland
in reply to Jonathan M Davis

Magnus Lie Hetland

Posted in reply to Jonathan M Davis

On 2012-03-12 17:24:56 +0000, Jonathan M Davis said:

> The problem is that sort requires a random access range, and narrow string
> (string and wstring) aren't, because they're variable length encoded. I'm not
> sure that strings _can_ be efficiently sorted, because of that, but maybe
> there's a sorting algorthm that could do it reasonably efficiently,

I'd certainly think that'd be posible. (Might, in fact, be a nice problem for the students in my algorithms class ;)

I guess I'm just tripped up by the fact that char[] is treated as an "almost-string", and hence is "infected" by the variable-length encoding of string (i.e., const char[]).

This all makes sense, for sure -- at least as a practical tradeoff or the like. (After all, UTF-8 is a super-elegant solution to a very difficult problem.) It's just so easy to assume that T[] is a random-access range. It seems so *obvious*. And it is true for any type T except the variable-length chars... :)

In my case, I was just using char literals (and strings) as an easy way of encoding test cases for a template class, and the sorting (+ uniq) was just a way of removing duplicates. (Could've used a hash, of course.) So I was really just treating it as a ubyte[]. Slightly Evil[tm], I guess.

> and we could
> special case sort for narrow strings to use that one, but it's a while since I
> messed with sorting algorithms, so I don't remember all of their
> characteristics off of the top of my head. Certainly, with how sort is currenly
> implemented, it can't handle any range which isn't a random access range.

No, I get that. I was just assuming that any T[] could be treated as a random access range if I really wanted to ;)

-- 
Magnus Lie Hetland
http://hetland.org

March 13, 2012

Re: Sorting char arrays

Posted by Magnus Lie Hetland
in reply to Ali Çehreli

Magnus Lie Hetland

Posted in reply to Ali Çehreli

On 2012-03-12 17:33:35 +0000, Ali Çehreli said:

> You can use isNarrowString to either disallow or provide special implementation for narrow strings (char[] and wchar[]):

Ah -- useful, thanks!

-- 
Magnus Lie Hetland
http://hetland.org

March 13, 2012

Re: Sorting char arrays

Posted by Jonathan M Davis
in reply to Magnus Lie Hetland

Jonathan M Davis

Posted in reply to Magnus Lie Hetland

On Tuesday, March 13, 2012 10:13:23 Magnus Lie Hetland wrote:
> On 2012-03-12 17:24:56 +0000, Jonathan M Davis said:
> > The problem is that sort requires a random access range, and narrow string (string and wstring) aren't, because they're variable length encoded. I'm not sure that strings _can_ be efficiently sorted, because of that, but maybe there's a sorting algorthm that could do it reasonably efficiently,
> I'd certainly think that'd be posible. (Might, in fact, be a nice problem for the students in my algorithms class ;)
> 
> I guess I'm just tripped up by the fact that char[] is treated as an "almost-string", and hence is "infected" by the variable-length encoding of string (i.e., const char[]).
> 
> This all makes sense, for sure -- at least as a practical tradeoff or the like. (After all, UTF-8 is a super-elegant solution to a very difficult problem.) It's just so easy to assume that T[] is a random-access range. It seems so *obvious*. And it is true for any type T except the variable-length chars... :)
> 
> In my case, I was just using char literals (and strings) as an easy way
> of encoding test cases for a template class, and the sorting (+ uniq)
> was just a way of removing duplicates. (Could've used a hash, of
> course.) So I was really just treating it as a ubyte[]. Slightly
> Evil[tm], I guess.
> 
> > and we could
> > special case sort for narrow strings to use that one, but it's a while
> > since I messed with sorting algorithms, so I don't remember all of their
> > characteristics off of the top of my head. Certainly, with how sort is
> > currenly implemented, it can't handle any range which isn't a random
> > access range.
> No, I get that. I was just assuming that any T[] could be treated as a random access range if I really wanted to ;)

If you use ubyte[] instead of cast it to ubyte[], then _that_ is a random-
access range. So, as long as you don't mind operating on code units instead of
code points (which really only works with ASCII and nothing else for char),
then you can use functions like sort on it. But, of course, you're screwed as
soon as you have a non-ASCII character, so you have to be careful. Depending
on what your requirements are (with regards to efficiency and whatnot), it might
just be safer to either using dchar[] or to convert to dchar[] and back again
for operations that require it (such as sort). But if you _know_ that you're
only dealing with ASCII, it works just fine to cast to ubyte[] for operations
that need a random-access range.

- Jonathan M Davis

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation