November 17, 2013
On Sunday, 17 November 2013 at 01:07:20 UTC, Ivan Kazmenko wrote:
> Regarding an ideal pivot choice for quicksort, I'd like to emphasize that it is simply non-existent.  Here's why.

Oh, of course. I did that in an algorithms class taught by a Professor who is into Cryptography. I agree that essentially there is no way to pick pivots that avoid worst case performance when an attacker is involved. When I mentioned "ideal pivot" I mean an actual good one on a sorted sequence

Given [1,2,3,4,5,6,7]
Picking the best pivot, 4 would result in scanning the entire array to assure that it is partitioned appropriately around the 4 (if a quicksort were designed wise to this sort of trick, but most, including D's quicksort, would actually shuffle everything around). By that point, Timsort is already done and drinking the victory champagne. And the quicksort still has to recur on the subsequences to sort them as well!
November 17, 2013
On Sunday, 17 November 2013 at 01:07:20 UTC, Ivan Kazmenko wrote:
> On Sunday, 17 November 2013 at 00:18:24 UTC, Chris Cain wrote:
>> I think it's more complicated than that. Let's assume for a moment that you've proven that such an unstable sort must exist that is faster (I'm not convinced that it necessarily must take extra work to maintain stability). You have not shown how much faster it might be (it could be only 1% faster) nor how much work it would take to discover (even an ideal pivot choice for quicksort actually cannot be as fast as Timsort on an already sorted sequence, and quicksort is not an appropriate algorithm for accepting presorted subsequences). If it takes 2 years to come up with an unstable sort faster than Timsort, then it seems like a long time to be using something that isn't the best that we currently have. Especially if D is to be used in benchmarks against other languages.
>
> Regarding an ideal pivot choice for quicksort, I'd like to emphasize that it is simply non-existent.  Here's why.
>
> Let us measure quicksort performance as the number of comparisons it makes.
>
> Let us make some assumptions:
> quicksort goes as
> --(1) choose a pivot p in O(1) comparisons;
> --(2) partition the segment into two;
> --(3) run quicksort recursively on the two resulting segments.
>
> Now, (2) is effectively done by comparing every element with p, thus in Theta(n) (on the order of n) comparisons where n is the length of the current segment.
>
> Under these assumptions, we can construct the killer array for an arbitrary quicksort of such kind, even if we don't know how exactly the pivot is chosen (say, closed source or obfuscation), but we have the following interface to it:
>
> Instead of accessing the array a[], quicksort calls the function less(i,j) which tells it whether a[i]<a[j], and we control that function.
>
> Now, we start with all values in array a[] undefined.  With each a[i], we associate a counter c[i]: how many times it took part in a comparison.  As we can see from the above, during a single call to quicksort, in steps (1) and (2) the pivot will eventually get Theta(n) comparisons, while other elements will all get only O(1) comparisons each.
>
> So here's what we'll do.
>
> 1. When a comparison asks to relate two undefined values, we pick one of them with the larger c[i] and make it the lowest number possible.  That is, the first fixed value will be 0, the next 1, and so on.  (In the end, a[] will be a permutation of 0, 1, ..., n-1.)
>
> 2. When we are asked to relate a defined value to an undefined one, the defined value will always be lower (because, when we eventually define the undefined one, it will be higher than all the values defined before it).
>
> 3. When we are asked about two known values, just tell the truth.
>
> This simple algorithm ensures that, on each call to quicksort, we quickly fix the pivot as one of the O(1) lowest values on the segment.  So, one of the next two segments will have length n-O(1), and the recursion gives us Theta(n) segments of linearly decreasing lengths, and thus Theta(n^2) total running time.
>
> Now, the assumption of picking a pivot in O(1) comparisons covers a broad variety of pivot choices, including first/last/middle/random element, median of three or five, median of medians, or any combination of these.  The random pick fails in the following sense: if we seed the RNG, construct a killer case, and then start with the same seed, we get Theta(n^2) behavior reproduced.
>
> Double-pivot quicksort can be forced to go quadratic in a similar fashion.
>
> Now, introsort escapes this fate by switching to a guaranteed-n-log-n algorithm instead of (3) when it goes too deep.  This comes at a (small) cost of checking the current depth on every call to quicksort.
>
> The above is just my retelling of a great short article "A Killer Adversary for Quicksort" by M. D. McIlroy here: http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
>
> Ivan Kazmenko.

Woah, that sounds complicated.  Not 100% sure I entirely
understand but it seems to rely on changing the elements while
you sort.  How would that work in real life?

For example, how would this defeat the following pivoting method.

1. Find the median value in the array. This can be done
deterministically in linear time, so from a theoretical point of
view is just as fast as picking a random element since you use
linear time.
2. Use that as the pivot.
November 17, 2013
On Sunday, 17 November 2013 at 01:39:43 UTC, Chris Cain wrote:
> (if a quicksort were designed wise to this sort of trick, but most, including D's quicksort, would actually shuffle everything around)

Actually, not everything. It'd swap the pivot and the last element (4 and 7 for the first iteration) and then swap it back after it figures out the two partitions are good. There was a case where it ended up swapping everything around unnecessarily, but it wasn't this case... But I digress, the point remains that it would be slower than Timsort despite having a good pivot.
November 17, 2013
On Sunday, 17 November 2013 at 00:56:46 UTC, Andrei Alexandrescu wrote:
> On 11/16/13 4:18 PM, Chris Cain wrote:
>> You have not shown how much faster it might be (it could be
>> only 1% faster) nor how much work it would take to discover (even an
>> ideal pivot choice for quicksort actually cannot be as fast as Timsort
>> on an already sorted sequence, and quicksort is not an appropriate
>> algorithm for accepting presorted subsequences).
>
> There are well known tweaks to quicksort that make it operate well on sorted or almost sorted sequences.

If you tweak for one scenario, you're only losing in other scenarios. Timsort already performs ideally in these cases, as I've demonstrated. Rather than optimizing quicksort for cases in which Timsort will still be the victor, we should optimize for those cases in which quicksort is already faster.
November 17, 2013
On Sunday, 17 November 2013 at 01:48:14 UTC, Craig Dillabaugh wrote:
> 1. Find the median value in the array. This can be done
> deterministically in linear time,

My understanding that for unordered data, there is no algorithm that runs in worst-case O(n):

http://en.wikipedia.org/wiki/Selection_algorithm
http://en.wikipedia.org/wiki/Quickselect

> so from a theoretical point of
> view is just as fast as picking a random element since you use
> linear time.

Picking a random element is constant time, though. You mean that O(1) and O(n) are equivalent here because they're both smaller than O(n log n), QuickSort's complexity?
November 17, 2013
On Saturday, 16 November 2013 at 22:11:46 UTC, Xinok wrote:
> And the results (last number is predicate calls):
>
> Current Unstable Sort  50ms  32783474
> New Unstable Sort      69ms  21503542
> Timsort                35ms  3905887

For the record, I tried both SwapStragegy options with my data (the data that got me to start this thread), and although TimSort did fewer comparisons (predicate calls), it ran about 20% slower.
November 17, 2013
On Friday, 15 November 2013 at 22:56:58 UTC, Craig Dillabaugh wrote:
> What are the arguments against using a randomized algorithm?

(1) Sort is capable of being marked pure, depending on the type being sorted and the predicate. But choosing random pivots means introducing side effects.

(2) Random pivots invoke average performance, whereas choosing the pivot from a median of N has the potential to achieve better performance on partially sorted lists (fewer reads and writes).

(3) I've tested random pivots and found that choosing from a median of three or five is often the better choice.
November 17, 2013
On Sunday, 17 November 2013 at 02:44:45 UTC, Vladimir Panteleev wrote:
> On Saturday, 16 November 2013 at 22:11:46 UTC, Xinok wrote:
>> And the results (last number is predicate calls):
>>
>> Current Unstable Sort  50ms  32783474
>> New Unstable Sort      69ms  21503542
>> Timsort                35ms  3905887
>
> For the record, I tried both SwapStragegy options with my data (the data that got me to start this thread), and although TimSort did fewer comparisons (predicate calls), it ran about 20% slower.

Could you try running a benchmark on the same data using this instead?

https://github.com/Xinok/XSort/blob/master/unstablesort.d

My implementation of quick sort has some odd tweaks which makes it slower in some cases, but I've never invoked worst-case behavior by accident.
November 17, 2013
On 11/16/13 5:07 PM, Ivan Kazmenko wrote:
> The above is just my retelling of a great short article "A Killer
> Adversary for Quicksort" by M. D. McIlroy here:
> http://www.cs.dartmouth.edu/~doug/mdmspe.pdf

Nice story, but the setup is a tad tenuous (albeit indeed theoretically interesting). For starters, if the attacker has a hook into the comparison function, they could trivially do a lot worse...

Andrei

November 17, 2013
On 11/16/13 5:39 PM, Chris Cain wrote:
> Given [1,2,3,4,5,6,7]
> Picking the best pivot, 4 would result in scanning the entire array to
> assure that it is partitioned appropriately around the 4 (if a quicksort
> were designed wise to this sort of trick, but most, including D's
> quicksort, would actually shuffle everything around).

I keep on thinking of a sort with delayed pivot selection that starts scanning the array from left and right and only stops when it figures it's unpartitioned. At that point it proceeds with pivot selection, but if the array is already sorted there's no need to partition the left and right portions that were already explored. Essentially the pivot selection would be preceded by a sampling of the left and right of the array.

> By that point,
> Timsort is already done and drinking the victory champagne. And the
> quicksort still has to recur on the subsequences to sort them as well!

Yah, already sorted data is ideal for Timsort.


Andrei