Jump to page: 1 2 3
Thread overview
Randomness in built-in .sort
Jan 05, 2009
Bill Baxter
Jan 05, 2009
dsimcha
Jan 05, 2009
Bill Baxter
Jan 05, 2009
dsimcha
Jan 05, 2009
Don
Jan 05, 2009
dsimcha
Jan 05, 2009
Bill Baxter
Jan 05, 2009
dsimcha
Jan 05, 2009
Bill Baxter
Jan 06, 2009
bearophile
Jan 06, 2009
bearophile
Jan 06, 2009
Bill Baxter
Mar 23, 2021
mw
Mar 23, 2021
jmh530
Jan 06, 2009
Benji Smith
Jan 06, 2009
Walter Bright
Jan 05, 2009
Stewart Gordon
Jan 05, 2009
Bill Baxter
Jan 06, 2009
Walter Bright
Jan 06, 2009
Sean Kelly
Jan 06, 2009
Ary Borenszweig
Jan 06, 2009
Walter Bright
Jan 06, 2009
Bill Baxter
January 05, 2009
It seems to me that the built-in .sort uses a randomized algorithm that leads to results that differ from run-to-run when there are elements with identical keys.

This seems like a bad idea to me for the built-in algorithm to have non-deterministic behavior because it means you can have bugs that aren't consistently repeatable.

I think the built-in sort should be some kind of stable sort.   Also the stability or lack thereof is not mentioned in the spec, and it probably should be because stability is critical for some uses of sorting.  One shouldn't have to guess whether a given implementation of D will use a stable sort or not.

--bb
January 05, 2009
== Quote from Bill Baxter (wbaxter@gmail.com)'s article
> It seems to me that the built-in .sort uses a randomized algorithm
> that leads to results that differ from run-to-run when there are
> elements with identical keys.
> This seems like a bad idea to me for the built-in algorithm to have
> non-deterministic behavior because it means you can have bugs that
> aren't consistently repeatable.
> I think the built-in sort should be some kind of stable sort.   Also
> the stability or lack thereof is not mentioned in the spec, and it
> probably should be because stability is critical for some uses of
> sorting.  One shouldn't have to guess whether a given implementation
> of D will use a stable sort or not.
> --bb

IDK why sorting is builtin anyhow.  I did some benchmarks a while back, and the builtin sort is slower than std.algorithm.  IIRC the builtin sort uses function pointers for comparison, while the std.algorithm sort uses template parameters that can be inlined at compile time.  I know this was mentioned before, but it didn't really get fully discussed.  Why is sorting builtin?  With property syntax, there's not even any syntactic sugar advantage.  Furthermore, the builtin sort is substantially less flexible than a library sort.

On another note, it would be nice if std.algorithm implemented a stable O(N log N) sort, like a merge sort.  Right now, IIRC it uses an interesting stable swapping algorithm on top of a quick sort for O(N log^2 N) performance.  This is good when space is tight, but I think in most cases a merge sort is better as a stable sort.

On the other hand, if builtin sort is kept, then I agree with you:  It should
follow the D convention of doing the safe thing by default (stable sort with no
O(N^2) corner cases) and allowing less safe behavior (unstable but possibly faster
sort that may or may not have some O(N^2) corner cases) in libraries.
January 05, 2009
On Mon, Jan 5, 2009 at 12:05 PM, dsimcha <dsimcha@yahoo.com> wrote:
> == Quote from Bill Baxter (wbaxter@gmail.com)'s article
>> It seems to me that the built-in .sort uses a randomized algorithm
>> that leads to results that differ from run-to-run when there are
>> elements with identical keys.
>> This seems like a bad idea to me for the built-in algorithm to have
>> non-deterministic behavior because it means you can have bugs that
>> aren't consistently repeatable.
>> I think the built-in sort should be some kind of stable sort.   Also
>> the stability or lack thereof is not mentioned in the spec, and it
>> probably should be because stability is critical for some uses of
>> sorting.  One shouldn't have to guess whether a given implementation
>> of D will use a stable sort or not.
>> --bb
>
> IDK why sorting is builtin anyhow.  I did some benchmarks a while back, and the builtin sort is slower than std.algorithm.  IIRC the builtin sort uses function pointers for comparison, while the std.algorithm sort uses template parameters that can be inlined at compile time.  I know this was mentioned before, but it didn't really get fully discussed.  Why is sorting builtin?  With property syntax, there's not even any syntactic sugar advantage.  Furthermore, the builtin sort is substantially less flexible than a library sort.

I agree.  Builtin sort should probably just go away in D2.
But it's there in D1 and can't be removed now.  The best we can hope
for there is for the specification to be tightened up to specify that
.sort must be a stable algorithm.

> On another note, it would be nice if std.algorithm implemented a stable O(N log N) sort, like a merge sort.  Right now, IIRC it uses an interesting stable swapping algorithm on top of a quick sort for O(N log^2 N) performance.  This is good when space is tight, but I think in most cases a merge sort is better as a stable sort.
>
> On the other hand, if builtin sort is kept, then I agree with you:  It should
> follow the D convention of doing the safe thing by default (stable sort with no
> O(N^2) corner cases) and allowing less safe behavior (unstable but possibly faster
> sort that may or may not have some O(N^2) corner cases) in libraries.


For my purposes, I just found Oskar Linde's stable sort implementation
(based on merge sort), and am using that now.
It supports a delegate as a predicate, too, which made my code a lot
simpler (got rid of a dummy struct who's only purpose was to provide
the necessary opCmp I wanted).

So yeh, there's not much point for built-in .sort (and .reverse)... except the ever-present "it it's easier to get CTFE working if it's built-in", because the compiler can have a special case CTFE implementation of those operations.

--bb
January 05, 2009
dsimcha wrote:
> On another note, it would be nice if std.algorithm implemented a stable O(N log N)
> sort, like a merge sort.  Right now, IIRC it uses an interesting stable swapping
> algorithm on top of a quick sort for O(N log^2 N) performance.  This is good when
> space is tight, but I think in most cases a merge sort is better as a stable sort.

I agree. I didn't have the time to implement a very good stable sort, but I also think the extra slowness is not critical. If anyone comes with a better implementation, I'd be glad to integrate it.

Andrei
January 05, 2009
Bill Baxter wrote:
<snip>
> I think the built-in sort should be some kind of stable sort.   Also
> the stability or lack thereof is not mentioned in the spec, and it
> probably should be because stability is critical for some uses of
> sorting.  One shouldn't have to guess whether a given implementation
> of D will use a stable sort or not.
> 
> --bb

A stable sort can be slower than an unstable sort.  There are many cases where the order of equally-ranking keys does not matter or it has no effect whatsoever (e.g. sorting an array of integers).  Then why take the extra overhead of making the sort stable?

We could have an extra property .stableSort, for those cases where you need stability.  Of course, it would be valid to implement .sort and .stableSort with the same algorithm, though to do so would be naive unless someone comes up with a stable sorting algorithm with O(n log n) or better time complexity and O(log n) or better extra memory requirement.  But in the absence of such an algorithm, the compiler could still replace .stableSort with .sort in those cases where it can guarantee it will make no difference to the end result.

Or you could argue that requiring sort to be stable is sufficiently uncommon that it might as well be left to a library function.

Stewart.
January 05, 2009
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s article
> dsimcha wrote:
> > On another note, it would be nice if std.algorithm implemented a stable O(N log N) sort, like a merge sort.  Right now, IIRC it uses an interesting stable swapping algorithm on top of a quick sort for O(N log^2 N) performance.  This is good when space is tight, but I think in most cases a merge sort is better as a stable sort.
> I agree. I didn't have the time to implement a very good stable sort,
> but I also think the extra slowness is not critical. If anyone comes
> with a better implementation, I'd be glad to integrate it.
> Andrei

You could try the merge sort implementation from my dstats library at http://dsource.org/projects/dstats/browser/trunk/sort.d.  This is very heavily tested and optimized because some important higher level dstats functionality depends on it.  You'd basically just have to add a template parameter to allow a user-provided swap function to make it conform to std.algorithm conventions. Obviously, since it's by its nature a stable sort, you don't really need the swapStrategy alias.

One thing to note, though, is that this function is designed to allow for sorting parallel arrays in lockstep by simply passing them in as additional parameters. This adds some complexity to the implementation.  Also, it uses the TempAlloc region allocator for temp space.  You could put this in Phobos too (This allocator was actually your idea a while back, though you called it a SuperStack), or you could change the temp space allocation to simply use the regular heap allocation scheme.
January 05, 2009
dsimcha wrote:
> == Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s article
>> dsimcha wrote:
>>> On another note, it would be nice if std.algorithm implemented a stable O(N log N)
>>> sort, like a merge sort.  Right now, IIRC it uses an interesting stable swapping
>>> algorithm on top of a quick sort for O(N log^2 N) performance.  This is good when
>>> space is tight, but I think in most cases a merge sort is better as a stable sort.
>> I agree. I didn't have the time to implement a very good stable sort,
>> but I also think the extra slowness is not critical. If anyone comes
>> with a better implementation, I'd be glad to integrate it.
>> Andrei
> 
> You could try the merge sort implementation from my dstats library at
> http://dsource.org/projects/dstats/browser/trunk/sort.d.  This is very heavily
> tested and optimized because some important higher level dstats functionality
> depends on it.  You'd basically just have to add a template parameter to allow a
> user-provided swap function to make it conform to std.algorithm conventions.
> Obviously, since it's by its nature a stable sort, you don't really need the
> swapStrategy alias.
> 
> One thing to note, though, is that this function is designed to allow for sorting
> parallel arrays in lockstep by simply passing them in as additional parameters.
> This adds some complexity to the implementation.  Also, it uses the TempAlloc
> region allocator for temp space.  You could put this in Phobos too (This allocator
> was actually your idea a while back, though you called it a SuperStack), or you
> could change the temp space allocation to simply use the regular heap allocation
> scheme.

I want TempAlloc in dcore! Eventually, anyway.
January 05, 2009
== Quote from Don (nospam@nospam.com)'s article
> I want TempAlloc in dcore! Eventually, anyway.

The only problem is that, in its current form, TempAlloc is kind of dangerous because it puts the onus on the programmer to not store the only reference to a reference type in a TempAlloc-allocated block.  If TempAlloc were more tightly integrated with the GC, this could be solved easily with negligible overhead.  One would simply have to keep a thread-local variable that tracks how many blocks are currently in use, kind of like the stack pointer in the call stack, and a bit array that tracks whether each block that is currently in use should be scanned. However, without GC integration, just making everything scanned would cause so many false pointer issues and slowdowns when GC is run that I decided that, for now, given the anticipated use cases, making it a little dangerous and scanning nothing was the lesser of two evils.
January 05, 2009
On Tue, Jan 6, 2009 at 12:16 AM, dsimcha <dsimcha@yahoo.com> wrote:
> == Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s article
>> dsimcha wrote:
>> > On another note, it would be nice if std.algorithm implemented a stable O(N log N) sort, like a merge sort.  Right now, IIRC it uses an interesting stable swapping algorithm on top of a quick sort for O(N log^2 N) performance.  This is good when space is tight, but I think in most cases a merge sort is better as a stable sort.
>> I agree. I didn't have the time to implement a very good stable sort,
>> but I also think the extra slowness is not critical. If anyone comes
>> with a better implementation, I'd be glad to integrate it.
>> Andrei
>
> You could try the merge sort implementation from my dstats library at http://dsource.org/projects/dstats/browser/trunk/sort.d.  This is very heavily tested and optimized because some important higher level dstats functionality depends on it.  You'd basically just have to add a template parameter to allow a user-provided swap function to make it conform to std.algorithm conventions. Obviously, since it's by its nature a stable sort, you don't really need the swapStrategy alias.

Thanks for the pointer.  Actually when I said I "found" Oskar Linde's sorting code, that really meant I found it sitting in my source tree where I had incorporated it long ago after Oskar made it available. So I just used that.

> One thing to note, though, is that this function is designed to allow for sorting parallel arrays in lockstep by simply passing them in as additional parameters. This adds some complexity to the implementation.  Also, it uses the TempAlloc region allocator for temp space.  You could put this in Phobos too (This allocator was actually your idea a while back, though you called it a SuperStack), or you could change the temp space allocation to simply use the regular heap allocation scheme.

Actually, a function to sort multiple arrays in parallel was exactly what I was implementing using .sort.  So that doesn't sound like a limitation to me at all.   :-)

--bb
January 05, 2009
On Mon, Jan 5, 2009 at 8:40 PM, Stewart Gordon <smjg_1998@yahoo.com> wrote:
> Bill Baxter wrote:
> <snip>
>>
>> I think the built-in sort should be some kind of stable sort.   Also the stability or lack thereof is not mentioned in the spec, and it probably should be because stability is critical for some uses of sorting.  One shouldn't have to guess whether a given implementation of D will use a stable sort or not.
>>
>> --bb

> A stable sort can be slower than an unstable sort.  There are many cases where the order of equally-ranking keys does not matter or it has no effect whatsoever (e.g. sorting an array of integers).  Then why take the extra overhead of making the sort stable?

> We could have an extra property .stableSort, for those cases where you need stability.  Of course, it would be valid to implement .sort and .stableSort with the same algorithm, though to do so would be naive unless someone comes up with a stable sorting algorithm with O(n log n) or better time complexity and O(log n) or better extra memory requirement.  But in the absence of such an algorithm, the compiler could still replace .stableSort with .sort in those cases where it can guarantee it will make no difference to the end result.

Actually yeh, I don't really care so much if .sort is stable or not (as long as it's clear in the spec what to expect).  But I do care that it is deterministic.  The current implementation seems to use some randomness (random partitions?) so if I run my program 5 times I get 5 different sort orders when there are duplicated keys.  This makes it really hard to debug things that only happen with certain orders.

> Or you could argue that requiring sort to be stable is sufficiently uncommon that it might as well be left to a library function.

There is no one sort algorithm that is fastest for all inputs.  So it would make more sense to me to argue that if you want the fastest possible sort for your problem then you should go with a library solution.  Built-ins should be efficient, yes, but to argue that efficiency is more important than generality in a built-in seems wrong to me.  A least for .sort.   There's no performance benefit to having a built-in sort, so it's really only there for convenience.   Using a stable sort would make .sort applicable to more problems, thus increasing the integral of convenience.

--bb
« First   ‹ Prev
1 2 3