topN using a heap (page 3)

On Monday, 18 January 2016 at 01:38:16 UTC, Andrei Alexandrescu wrote: > On 1/17/16 8:07 PM, deadalnix wrote: >> A common way to do it is to go quicksort, but only recurse on one side >> of the set. That should give log(n)^2 complexity on average. > > Yah, that's quickselect (which this work started from). It's linear, and you can't get top n in sublinear time because you need to look at all elements. -- Andrei Yeah; forget about me, I was dumb on that one.

On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu wrote: > On 01/17/2016 03:32 PM, Ivan Kazmenko wrote: >> Here is a more verbose version. > > OK, very nice. Thanks! I've modified topN to work as follows. In a loop: > > * If nth <= r.length / log2(r.length)^^2 (or is similarly close to r.length), use topNHeap with one heap and stop > * If nth <= r.length / log2(r.length) (or is similarly close to r.length), use topNHeap with two heaps and stop > * Take a quickselect step, adjust r and nth, and continue > > That seems to work nicely for all values involved. > > All - let me know how things can be further improved. Thx! > > > Andrei Here goes the test which shows quadratic behavior for the new version: http://dpaste.dzfl.pl/e4b3bc26c3cf (dpaste kills the slow code before it completes the task) My local timings with "-O -release -noboundscheck -inline": building Element array: 4989 msecs checking Element array: 5018 msecs checking int array: 626 msecs With "-debug -unittest": building Element array: 8384 msecs checking Element array: 8380 msecs checking int array: 2987 msecs If we change the length MAX_N to something observable, say, 50, the array is: [0, 536870911, 2, 536870911, 4, 536870911, 6, 36, 8, 33, 10, 35, 12, 536870911, 14, 32, 16, 34, 18, 536870911, 536870911, 22, 23, 21, 20, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 31, 30, 536870911, 29, 536870911, 28, 536870911, 27, 536870911, 26, 536870911, 25, 536870911, 24, 536870911] The old version (2.070.0-b2) could not be tricked with it, does it use random? The inspiration is the paper "A Killer Adversary for Quicksort": http://www.cs.dartmouth.edu/~doug/mdmspe.pdf (I already mentioned it on the forums a while ago) Ivan Kazmenko.

January 18, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Ivan Kazmenko

Permalink

Ivan Kazmenko

Posted in reply to Ivan Kazmenko

Permalink

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
> On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu wrote:
>> All - let me know how things can be further improved. Thx!
>>
> Here goes the test which shows quadratic behavior for the new version:
> http://dpaste.dzfl.pl/e4b3bc26c3cf
> (dpaste kills the slow code before it completes the task)
>
> The inspiration is the paper "A Killer Adversary for Quicksort":
> http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
> (I already mentioned it on the forums a while ago)
>
> Ivan Kazmenko.

Perhaps I should include a textual summary as well.

The code on DPaste starts by constructing an array of Elements of size MAX_N; in the code, MAX_N is 50_000.  After that, we run the victim function on our array.  Here, the victim is topN (array, MAX_N / 2); it could be sort (array) or something else.

An Element contains, or rather, pretends to contain, an int value.  Here is how Element is special: the result of comparison for two Elements is decided on-the-fly.  An Element can be either UNDECIDED or have a fixed value.  Initially, all elements are UNDECIDED.  When we compare two Elements and at least one of them has a fixed value, the comparison is resolved naturally, and UNDECIDED element is greater than any fixed element.  When we compare two UNDECIDED elements, the one which participated more in the comparisons so far gets a fixed value: greater than any other value fixed so far, but still less than UNDECIDED.  This way, the results of old comparisons are consistent with the new fixed value.

Now, what do we achieve by running the victim function?  Turns out that the algorithms using the idea of QuickSort or QuickSelect tend to make most comparisons against their current pivot value.  Our Element responds to that by fixing the pivot to one of the lowest available values.  After that, a partition using such pivot will have only few, O(1), elements before the pivot, and the rest after the pivot.  In total, this will lead to quadratic performance.

After running the victim function on our array of Elements (which - careful here - already takes quadratic time), we reorder them in their original order (to do that, each Element also stores its original index).

Now, we can re-run the algorithm on the array obtained so far.  If the victim function is (strongly) pure, it will inevitably make the same comparisons in the same order.  The only difference is that their result will already be decided.

Alternatively, we can make an int array of the current values in our array of Elements (also in their original order).  Running the victim function on the int array must also make the same (quadratic number of) comparisons in the same order.

Ivan Kazmenko.

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote: > Here goes the test which shows quadratic behavior for the new version: > http://dpaste.dzfl.pl/e4b3bc26c3cf > (dpaste kills the slow code before it completes the task) Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583.

On 01/18/2016 01:00 PM, Ivan Kazmenko wrote: > > The old version (2.070.0-b2) could not be tricked with it, does it use > random? Yes, it selected the pivot uniformly at random using the global RNG. (This is also why the documentation says topN is O(n) in expectation.)

On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote: > On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote: >> Here goes the test which shows quadratic behavior for the new version: >> http://dpaste.dzfl.pl/e4b3bc26c3cf >> (dpaste kills the slow code before it completes the task) > > Correction: this is the result of removing a uniform call in pull 3921. Since we are supposed to discuss the improvements related to pull 3934 here, I created a separate entry for the issue: https://issues.dlang.org/show_bug.cgi?id=15583. A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote: > A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression.

On 1/18/16 6:18 PM, Ilya wrote: > On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote: >> On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote: >>> Here goes the test which shows quadratic behavior for the new version: >>> http://dpaste.dzfl.pl/e4b3bc26c3cf >>> (dpaste kills the slow code before it completes the task) >> >> Correction: this is the result of removing a uniform call in pull >> 3921. Since we are supposed to discuss the improvements related to >> pull 3934 here, I created a separate entry for the issue: >> https://issues.dlang.org/show_bug.cgi?id=15583. > > A RNGs don't improve worst case. It only changes an permutation for > worst case. --Ilya Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input. I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite. The nice thing about not relying on randomness is that pure functions can call sort, topN etc. As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address. Thoughts? Andrei

On 1/18/16 6:27 PM, Ivan Kazmenko wrote: > On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote: >> A RNGs don't improve worst case. It only changes an permutation for >> worst case. --Ilya > > Still, use of RNG makes it impossible to construct the worst case > beforehand, once and for all. In that sense, this is a regression. BTW I forgot to thank you for this analysis. This is good stuff. -- Andrei

On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote: > On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote: >> A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya > > Still, use of RNG makes it impossible to construct the worst case beforehand, once and for all. In that sense, this is a regression. No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya

Forums