Why std.algorithm.sort can't be applied to char[]? (page 2)

On Wed, 14 May 2014 08:27:45 +0000 monarch_dodra via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote: > On Monday, 12 May 2014 at 18:44:22 UTC, Jonathan M Davis via Digitalmars-d-learn wrote: > > Sure, you can cast char[] to ubyte[] and sort that if you know > > that the array > > only holds pure ASCII. In fact, you can use > > std.string.representation to do it > > - e.g. > > > > auto ascii = str.representation; > > > > and if str were mutable, then you could sort it. But that will > > only work if > > the string only contains ASCII characters. Regardless, he > > wanted to know why > > he couldn't sort char[], and I explained why - all strings are > > treated as > > ranges of dchar, making it so that if their element type is > > char or wchar, so > > they're not random access and thus can't be sorted. > > Arguably, a smart enough implementation should know how to sort a "char[]", while still preserving codepoint integrity. I don't think that that can be done at the same algorithmic complexity though. So, I don't know if that would be acceptable or not from the standpoint of std.algorithm.sort. But even if it's a good idea, someone would have to special case sort for char[], and no one has done that. > As a matter of fact, the built in "sort" property does it. > > void main() > { > char[] s = "éöeèûà".dup; > s.sort; > writeln(s); > } > //prints: > eàèéöû I'm surprised. I thought that one of Bearophile's favorite complaints was that it didn't sort unicode properly (and hence one of the reasons that it should be removed). Regardless, I do think that it should be removed. - Jonathan M Davis

May 15, 2014

Re: Why std.algorithm.sort can't be applied to char[]?

Posted by Steven Schveighoffer
in reply to Jonathan M Davis

Permalink

Steven Schveighoffer

Posted in reply to Jonathan M Davis

Permalink

On Wed, 14 May 2014 05:13:42 -0400, Jonathan M Davis via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote:

> On Wed, 14 May 2014 08:27:45 +0000
> monarch_dodra via Digitalmars-d-learn
> <digitalmars-d-learn@puremagic.com> wrote:
>
>> On Monday, 12 May 2014 at 18:44:22 UTC, Jonathan M Davis via
>> Digitalmars-d-learn wrote:
>> > Sure, you can cast char[] to ubyte[] and sort that if you know
>> > that the array
>> > only holds pure ASCII. In fact, you can use
>> > std.string.representation to do it
>> > - e.g.
>> >
>> > auto ascii = str.representation;
>> >
>> > and if str were mutable, then you could sort it. But that will
>> > only work if
>> > the string only contains ASCII characters. Regardless, he
>> > wanted to know why
>> > he couldn't sort char[], and I explained why - all strings are
>> > treated as
>> > ranges of dchar, making it so that if their element type is
>> > char or wchar, so
>> > they're not random access and thus can't be sorted.
>>
>> Arguably, a smart enough implementation should know how to sort a
>> "char[]", while still preserving codepoint integrity.
>
> I don't think that that can be done at the same algorithmic complexity though.
> So, I don't know if that would be acceptable or not from the standpoint of
> std.algorithm.sort. But even if it's a good idea, someone would have to
> special case sort for char[], and no one has done that.

I think it can be done at O(nlgn) complexity, but you must allocate a block of scratch space to do the sorting. You can't do it in-place because swapping isn't available.

Might as well convert to dchar and back explicitly.

Merge sort may allow more promising speedups, but I still think you will need to allocate extra space.

>
>> As a matter of fact, the built in "sort" property does it.
>>
>> void main()
>> {
>>      char[] s = "éöeèûà".dup;
>>      s.sort;
>>      writeln(s);
>> }
>> //prints:
>> eàèéöû
>
> I'm surprised. I thought that one of Bearophile's favorite complaints was that
> it didn't sort unicode properly (and hence one of the reasons that it should
> be removed). Regardless, I do think that it should be removed.

I can't believe this worked. I want to say that it's a freak accident for that set of characters. Looking in druntime, I don't see where the special case is.

-Steve

On Wednesday, 14 May 2014 at 09:01:23 UTC, John Colvin wrote: > Why would anyone ever want to sort code-points? Why not? To remove duplicate characters? > They might want to sort graphemes, but that's difficult to do in-place (needs O(n) memory, I think...). If out-of-place is good enough > > someStr.byGrapheme.array.sort(); The current "status quo" in D is that a dchar basically represents a character.

On Thursday, 15 May 2014 at 13:26:45 UTC, Steven Schveighoffer wrote: > On Wed, 14 May 2014 05:13:42 -0400, Jonathan M Davis via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote: > >> On Wed, 14 May 2014 08:27:45 +0000 >> monarch_dodra via Digitalmars-d-learn >> <digitalmars-d-learn@puremagic.com> wrote: >>> As a matter of fact, the built in "sort" property does it. >>> >>> void main() >>> { >>> char[] s = "éöeèûà".dup; >>> s.sort; >>> writeln(s); >>> } >>> //prints: >>> eàèéöû >> >> I'm surprised. I thought that one of Bearophile's favorite complaints was that >> it didn't sort unicode properly (and hence one of the reasons that it should >> be removed). Regardless, I do think that it should be removed. > > I can't believe this worked. I want to say that it's a freak accident for that set of characters. Looking in druntime, I don't see where the special case is. > > -Steve Must be a hell of a freak accident ;) auto s = "é東öe京ûタèワà".dup; writeln(s.sort); => eàèéöûタワ京東 It's in rt/adi.d extern (C) char[] _adSortChar(char[] a) It's basically: string=>dstring=>sort=>dstring=>string. BTW, the built in reverse also works with char[], and so does std.algorithm.reverse (and it does it pretty cleverly too, might I say). As far as I'm concerned, if we *can* do it in n.log(n), and somebody provides the implementation, then there is no reason to not offer dchar sorting for char[]/wchar.

On Thu, 15 May 2014 11:37:07 -0400, monarch_dodra <monarchdodra@gmail.com> wrote: > On Thursday, 15 May 2014 at 13:26:45 UTC, Steven Schveighoffer > wrote: >> On Wed, 14 May 2014 05:13:42 -0400, Jonathan M Davis via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote: >> >>> On Wed, 14 May 2014 08:27:45 +0000 >>> monarch_dodra via Digitalmars-d-learn >>> <digitalmars-d-learn@puremagic.com> wrote: >>>> As a matter of fact, the built in "sort" property does it. >>>> >>>> void main() >>>> { >>>> char[] s = "éöeèûà".dup; >>>> s.sort; >>>> writeln(s); >>>> } >>>> //prints: >>>> eàèéöû >>> >>> I'm surprised. I thought that one of Bearophile's favorite complaints was that >>> it didn't sort unicode properly (and hence one of the reasons that it should >>> be removed). Regardless, I do think that it should be removed. >> >> I can't believe this worked. I want to say that it's a freak accident for that set of characters. Looking in druntime, I don't see where the special case is. >> >> -Steve > > Must be a hell of a freak accident ;) > > auto s = "é東öe京ûタèワà".dup; > writeln(s.sort); > > => eàèéöûタワ京東 > > It's in rt/adi.d Seriously, I fucking hate the file names in druntime. I looked in qsort.d. > extern (C) char[] _adSortChar(char[] a) > > It's basically: string=>dstring=>sort=>dstring=>string. OK, that's what I would have expected. I don't think we should support this. > BTW, the built in reverse also works with char[], and so does std.algorithm.reverse (and it does it pretty cleverly too, might I say). This is a different algorithm. Sort requires random-access swapping. Reverse does not. > As far as I'm concerned, if we *can* do it in n.log(n), and somebody provides the implementation, then there is no reason to not offer dchar sorting for char[]/wchar. I think there is nothing wrong with requiring the steps to be explicit. We should not hide such bloat behind sort. -Steve

On Thursday, 15 May 2014 at 17:46:52 UTC, Steven Schveighoffer wrote: >> As far as I'm concerned, if we *can* do it in n.log(n), and somebody provides the implementation, then there is no reason to not offer dchar sorting for char[]/wchar. > > I think there is nothing wrong with requiring the steps to be explicit. We should not hide such bloat behind sort. > > -Steve Of course, but I meant if we could find an algoirthm O(1) (or O(log(N)) space overhead. If the only algorithm we are capable of providing just temporarily dstrings, then no.

Forums