November 17, 2013
On 11/16/13 6:42 PM, Xinok wrote:
> On Sunday, 17 November 2013 at 00:56:46 UTC, Andrei Alexandrescu wrote:
>> On 11/16/13 4:18 PM, Chris Cain wrote:
>>> You have not shown how much faster it might be (it could be
>>> only 1% faster) nor how much work it would take to discover (even an
>>> ideal pivot choice for quicksort actually cannot be as fast as Timsort
>>> on an already sorted sequence, and quicksort is not an appropriate
>>> algorithm for accepting presorted subsequences).
>>
>> There are well known tweaks to quicksort that make it operate well on
>> sorted or almost sorted sequences.
>
> If you tweak for one scenario, you're only losing in other scenarios.

I am not so sure about that.

> Timsort already performs ideally in these cases, as I've demonstrated.

Yes, it is also my understanding that Timsort does well on almost sorted data. The issue is one does not know a priori what the distribution of the data is.

> Rather than optimizing quicksort for cases in which Timsort will still
> be the victor, we should optimize for those cases in which quicksort is
> already faster.

We should have a "good unstable sort" in addition to the "good stable sort" we already have. Seeing this as a competition between quicksort and timsort would be the wrong frame.

Probably a good step forward would be to hook a sort benchmarking corpus to our unittests. What are the consecrated corpora?


Andrei

November 17, 2013
On Saturday, 16 November 2013 at 16:10:56 UTC, Andrei Alexandrescu wrote:
> On 11/16/13 6:20 AM, Jean Christophe wrote:
>> On Friday, 15 November 2013 at 21:46:26 UTC, Vladimir Panteleev wrote:
>>
>>> getPivot(0..10)
>>> 8,7,6,5,4,3,2,1,0,9 <- getPivot - before swap
>>> 9,7,6,5,4,8,2,1,0,3 <- getPivot - after swap
>>> 9,7,6,5,4,3,2,1,0,8 <- quickSortImpl - after swap
>>> 9,8,6,5,4,3,2,1,0,7 <- quickSortImpl - after partition
>>> getPivot(2..10)
>>> 6,5,4,3,2,1,0,7 <- getPivot - before swap
>>> 7,5,4,3,6,1,0,2 <- getPivot - after swap
>>> 7,5,4,3,2,1,0,6 <- quickSortImpl - after swap
>>> 7,6,4,3,2,1,0,5 <- quickSortImpl - after partition
>>> (...)
>>
>> One possible implementation suggests to swap Left and Right immediatly
>> after choosing the Pivot (if Left > Right), then place the Pivot at
>> Right-1. It seems that this option was not taken. Any reason?
>
> That may help this particular situation, but does it do anything interesting in general?

Yes. This has the extra advantage that the smallest of the three winds up in A[left], which is where the partitioning routine would put it anyway. The largest winds up at A[right] which is also the correct place, since it is larger than the Pivot. Therefore you can place the Pivot in A[right -1] and initialize i and j to (left+1) and (right-2) in the partition phase. Another benefit is that because A[left] is smaller than the Pivot, it will act as a sentinel for j. We do not need to worry about j running past the end. Same for i. Thus you assert:

A[left] <= A[center] <= A[right]

even before you hide the Pivot.

-- JC


November 17, 2013
On Sunday, 17 November 2013 at 04:14:22 UTC, Andrei Alexandrescu wrote:
> Probably a good step forward would be to hook a sort benchmarking corpus to our unittests. What are the consecrated corpora?

I'll kick off the suggestions with this:

File to provide "real" data:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.fna

From http://www.genome.jp/kegg-bin/show_organism?org=eco -> RefSeq -> NC_000913.fna

(despite this being real data, typically you don't actually sort genomes like this ... it's simply "more interesting" than a pseudo-randomized dataset)

Code:

---
import std.range, std.algorithm, std.stdio, std.array, std.conv, std.datetime;

enum strategy = SwapStrategy.stable;

void main() {
    auto arr = File("NC_000913.fna").byLine
            .drop(1) // Trim the first line because it's just metadata
            /*.take(50)*/
            .join
            .chunks(50)
            .map!text
            .array;
    StopWatch sw = StopWatch(AutoStart.yes);
    arr.sort!("a < b", strategy)();
    sw.stop();
    writeln(text(arr.length, " elements sorted in ", sw.peek.msecs, " msecs"));
}
---

On my system, this shows unstable sort to be ~2x faster than stable sort (which is to be expected of data with no discernible patterns).

The "key" thing this is testing is the effectiveness of the sort on data that often takes longer to compare (with such a small alphabet, many strings have a common prefix).

That said, it might also be reproduced "well enough" using a random generator to create similar strings to sort, but the basic idea is there. I just like using real genomes for performance testing things :)
November 17, 2013
>> BTW I'm very interested in finding a library which could Quicksort an array of pointers, where each pointer points to a class object (or a structure) address. The library would make possible, for example, to sort the `class objects` using one of their members as the key. Because the swaps are done on the actual pointers (and not the Objects pointed), the Quicksort should be very efficient. However, the algorithm woulnd't be so trivial to implement because each comparison shall be done using the Object's member's value :
>>
>> eg. Obj1.foo < Obj2.foo.
>>
>> Of course, the programmer can choose any relevant member property to sort the collection. And it should be as easy to use as:
>>
>> class SomeClass { string foo; double bar;}
>> SomeClass[] a;
>> // put 100 000 objects into a
>> a.sort("foo");
>>
>> Do we already have something like that available somewhere or is it possible to make one eventually?
>
> You mean, sort!`a.foo < b.foo` ?

Yes.

An indirect sorting, assuming a and b to be ojects of class SomePotentialyLargeClass.

Because the array to sort contains pointers only, all the data movement is essentially the same as if we were sorting integer.

-- JC
November 17, 2013
On 11/16/13 9:30 PM, Jean Christophe wrote:
> An indirect sorting, assuming a and b to be ojects of class
> SomePotentialyLargeClass.
>
> Because the array to sort contains pointers only, all the data movement
> is essentially the same as if we were sorting integer.

Maybe makeIndex may be of help.

http://dlang.org/phobos/std_algorithm.html#makeIndex


Andrei

November 17, 2013
On 11/16/13 9:21 PM, Chris Cain wrote:
> That said, it might also be reproduced "well enough" using a random
> generator to create similar strings to sort, but the basic idea is
> there. I just like using real genomes for performance testing things :)

I am hoping for some more representative corpora, along the lines of http://sortbenchmark.org/. Some data that we can use as good proxies for typical application usage.

Andrei

November 17, 2013
On Sunday, 17 November 2013 at 02:43:05 UTC, Vladimir Panteleev
wrote:
> On Sunday, 17 November 2013 at 01:48:14 UTC, Craig Dillabaugh wrote:
>> 1. Find the median value in the array. This can be done
>> deterministically in linear time,
>
> My understanding that for unordered data, there is no algorithm that runs in worst-case O(n):
>
> http://en.wikipedia.org/wiki/Selection_algorithm
> http://en.wikipedia.org/wiki/Quickselect

Unless I misunderstand something I am pretty sure there is a
deterministic algorithm for finding the median in unsorted data:

http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15451-s07/www/lecture_notes/lect0125.pdf


>
>> so from a theoretical point of
>> view is just as fast as picking a random element since you use
>> linear time.
>
> Picking a random element is constant time, though. You mean that O(1) and O(n) are equivalent here because they're both smaller than O(n log n), QuickSort's complexity?

Yeah, since you have to do O(n) work to partition the array,
whether you do O(1) work, or O(n) work to find your pivot really
doesn't matter. From a theoretical perspective anyway :o)

November 17, 2013
On Sunday, 17 November 2013 at 03:58:58 UTC, Andrei Alexandrescu wrote:
> On 11/16/13 5:07 PM, Ivan Kazmenko wrote:
>> The above is just my retelling of a great short article "A Killer
>> Adversary for Quicksort" by M. D. McIlroy here:
>> http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
>
> Nice story, but the setup is a tad tenuous (albeit indeed theoretically interesting). For starters, if the attacker has a hook into the comparison function, they could trivially do a lot worse...

Actually, I was thinking about a two phase attack, and it is not at all unlikely.

0. Say, the Sorter presents a library quicksort solution.  It may be closed source, but can take comparison function as argument.

1. On the preparation phase, the Attacker uses the tricky comparison function described previously.  What's important is that, besides observing Theta(n^2) behavior once, he now gets a real array a[] such that this behavior can be reproduced.

2. On the attack phase, the Attacker does not need access to the comparison function.  He just feeds the array obtained on the previous step as plain data.

Ivan Kazmenko.
November 17, 2013
On Sunday, 17 November 2013 at 01:48:14 UTC, Craig Dillabaugh wrote:
> On Sunday, 17 November 2013 at 01:07:20 UTC, Ivan Kazmenko wrote:
>> Regarding an ideal pivot choice for quicksort, I'd like to emphasize that it is simply non-existent.  Here's why.
>>
>> Let us measure quicksort performance as the number of comparisons it makes.
>>
>> Let us make some assumptions:
>> quicksort goes as
>> --(1) choose a pivot p in O(1) comparisons;
>> --(2) partition the segment into two;
>> --(3) run quicksort recursively on the two resulting segments.
>>
>> Now, (2) is effectively done by comparing every element with p, thus in Theta(n) (on the order of n) comparisons where n is the length of the current segment.
>>
>> Under these assumptions, we can construct the killer array for an arbitrary quicksort of such kind, even if we don't know how exactly the pivot is chosen (say, closed source or obfuscation), but we have the following interface to it:
>>
>> Instead of accessing the array a[], quicksort calls the function less(i,j) which tells it whether a[i]<a[j], and we control that function.
>>
>> (snip)
>
> Woah, that sounds complicated.  Not 100% sure I entirely
> understand but it seems to rely on changing the elements while
> you sort.  How would that work in real life?
>
> For example, how would this defeat the following pivoting method.
>
> 1. Find the median value in the array. This can be done
> deterministically in linear time, so from a theoretical point of
> view is just as fast as picking a random element since you use
> linear time.
> 2. Use that as the pivot.

Right, this method (median of medians is an example implementation) will not be defeated by the above algorithm.  There's nothing wrong in it, since it takes Theta(n) comparisons to choose the pivot and so violates our attack's working conditions.  In practice however, it means that, on an average case, the constant factor is very likely to be higher than for a quicksort vulnerable to our attack.

Ivan Kazmenko.
November 17, 2013
On 11/17/2013 02:07 AM, Ivan Kazmenko wrote:
> The random pick fails in the following sense: if we seed the RNG,
> construct a killer case, and then start with the same seed, we get
> Theta(n^2) behavior reproduced.

Hence, in no sense. This does not perform independent uniform random picks.