topN using a heap (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » topN using a heap (page 2)

January 16, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Timon Gehr

Andrei Alexandrescu

Posted in reply to Timon Gehr

On 1/16/16 9:37 PM, Timon Gehr wrote:
> Ivan's analysis suggests that even something significantly larger, like
> n/log(n)² might work as an upper bound for k.

I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei

January 17, 2016

Re: topN using a heap

Posted by Ilya Yaroshenko
in reply to Andrei Alexandrescu

Ilya Yaroshenko

Posted in reply to Andrei Alexandrescu

On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu wrote:
> On 1/16/16 9:37 PM, Timon Gehr wrote:
>> Ivan's analysis suggests that even something significantly larger, like
>> n/log(n)² might work as an upper bound for k.
>
> I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei

size_t.sizeof * 8 - bsr(n) should be good approximation for sorting. --Ilya

January 17, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Andrei Alexandrescu

Ivan Kazmenko

Posted in reply to Andrei Alexandrescu

On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu wrote:
> On 1/16/16 9:37 PM, Timon Gehr wrote:
>> Ivan's analysis suggests that even something significantly larger, like
>> n/log(n)² might work as an upper bound for k.
>
> I'm not clear on how you got to that boundary. There are a few implementation details to be minded as well (quickly and carefully approximating the log etc). Could you please make a commented pull request against my own? Thanks! -- Andrei

The average case is O(n + (k log n log k)) for small enough k.  So, any k below roughly n / log^2 (n) makes the second summand less than the first.

Of course, the constant not present in asymptotic analysis, as well as handling the worst case, may make this notion impractical.  Measure is the way :) .  The worst case is just a decreasing sequence, or almost decreasing if we want to play against branch prediction as well.

January 17, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Timon Gehr

Ivan Kazmenko

Posted in reply to Timon Gehr

On Sunday, 17 January 2016 at 02:37:48 UTC, Timon Gehr wrote:
> On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
>> On 1/16/16 8:00 PM, Timon Gehr wrote:
>>> The implementation falls back to topNHeap whenever k is within the first
>>> or last ~n/8 elements and therefore is Ω(n log n) on average.
>>
>> Correct. The pedantic approach would be to only do that when k is up to
>> log(n). -- Andrei
>>
>
> Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k.
> I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)

I'd suggest being able to switch between implementations.

Recall that, when sorting, we have SwapStrategy.stable which is, well, stable, but additionally guarantees n log n with a larger constant (MergeSort).

And we have SwapStrategy.unstable which uses Introsort, which, in turn, has smaller constant on sane inputs.

Here, Andrei's suggestion is to also have two implementations, let us call them TopNStrategy.quickSelect and TopNStrategy.heap.

The quickSelect one is O(n) on sane inputs but O(n^2) on an artificial worst case.

The heap implementation is also O(n) when k is close to 0 or n, but O(n log n) otherwise.  What's important is that it is also O(n log n) worst case.

The current proposal switches between strategies based on a heuristic (k < n/8 or k > 7n/8 if I understand correctly), which may not be the best strategy in all cases.

So, what can be done is to introduce TopNStrategy.auto which is the default and uses the (current or better) heuristic to switch between strategies, but also leave a way to explicitly select one of the two strategies, just like one can do now with sort!(..., SwapStrategy.whatWeExplicitlyWant).

All the code for both strategies is already there.  Each strategy has its benefits not covered by the other (quickSelect is faster for k~n and sane inputs, heap is faster for k close to boundary and special inputs).  So, provide the default but let the user choose.

Ivan Kazmenko.

January 17, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Ivan Kazmenko

Andrei Alexandrescu

Posted in reply to Ivan Kazmenko

On 01/17/2016 06:55 AM, Ivan Kazmenko wrote:
> So, what can be done is to introduce TopNStrategy.auto which is the
> default and uses the (current or better) heuristic to switch between
> strategies, but also leave a way to explicitly select one of the two
> strategies, just like one can do now with sort!(...,
> SwapStrategy.whatWeExplicitlyWant).

A nice idea, but it seems a bit overengineered. The differences among approaches are rather subtle and explaining the circumstances under which one does better than the other is about as difficult as making the choice in the implementation.

BTW there's yet another approach:

1. Create a max heap for r[0 .. nth]

2. Create a min heap for r[nth .. $]

3. As long as r[nth] < r[0], swap them and restore the heap property in the two heaps

At the end of the process we have the smallest element in r[nth .. $], which is exactly in the r[nth] position, greater than or equal to the largest element in r[0 .. nth], which is exactly what the doctor prescribed.

Complexity of this turns rather nice. Step 1 is O(k), step 2 is O(n - k). In the worst case (r is sorted descending), step 3 will stop after k iterations and does k heap insertions in the two heaps. So overall complexity is O(n + k log(k) + k log(n)). Since k <= n, keep the largest terms: O(n + k log(n)). So if k is proportional to n / log(n), we get O(n). And that's worst case!

BTW I figured how to do stable partition. That'll come in a distinct PR.

Andrei

January 17, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Ivan Kazmenko

Andrei Alexandrescu

Posted in reply to Ivan Kazmenko

On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
> The average case is O(n + (k log n log k)) for small enough k. So, any k
> below roughly n / log^2 (n) makes the second summand less than the first.

I don't understand how you derived the average case complexity, and I don't understand how you derived the bound for k.

Regarding the complexity: for all k (small or large), it takes O(k) to build a heap. Then in the worst case we do (n - k) heap insertions, i.e. total complexity is O(k + (n - k) * log(k)). Is that correct?

If k is any fraction of n, the worst case complexity comes to O(n + n log n) i.e. O(n log n). By your calculation, the average complexity comes to O(n log^2 n), i.e. greater than the worst case. So there's a mistake somewhere.

Could you please clarify matters? Thanks!

Andrei

January 17, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Andrei Alexandrescu

Ivan Kazmenko

Posted in reply to Andrei Alexandrescu

On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu wrote:
> On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
>> The average case is O(n + (k log n log k)) for small enough k. So, any k
>> below roughly n / log^2 (n) makes the second summand less than the first.
>
> I don't understand how you derived the average case complexity, and I don't understand how you derived the bound for k.
>
> Regarding the complexity: for all k (small or large), it takes O(k) to build a heap. Then in the worst case we do (n - k) heap insertions, i.e. total complexity is O(k + (n - k) * log(k)). Is that correct?

Yeah, I have the same notion.

> If k is any fraction of n, the worst case complexity comes to O(n + n log n) i.e. O(n log n). By your calculation, the average complexity comes to O(n log^2 n), i.e. greater than the worst case. So there's a mistake somewhere.

The upper bound O(n + k log k log n) is correct, but it's tight only for small k.  I'll explain below.

> Could you please clarify matters? Thanks!

Here is a more verbose version.

Suppose for simplicity that the array consists real numbers of the same continuous random distribution.
(A discrete random distribution would complicate the analysis a bit, since the probability of two elements being equal would become greater than zero.)
Then, element a_i is the minimum of a_1, a_2, ..., a_i with probability 1/i.
Moreover, it is the second minimum with probability 1/i, the third minimum with probability 1/i, ..., the maximum of a_1, a_2, ..., a_i with the same probability 1/i.
To prove this, we can show that, if we look at the ordering permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} < ... a_{p_i}, each of the i! possible permutations is equally probable.

What immediately follows is that when we track k smallest elements and consider element i>k, the probability that it will change the k smallest elements encountered so far is k/i.
Summing that for i from k+1 to n, we get the expected number of heap insertions:
(k/(k+1)) + (k/(k+2)) + ... + k/n.

Now, this is just k multiplied by:
(1/1 + 1/2 + 1/3 + ... + 1/n) *minus*
(1/1 + 1/2 + 1/3 + ... + 1/k).
As these are harmonic series, the first line is asymptotically equal to log n, and the second line asymptotically equal to log k.
To summarize, the expected number of heap insertions for an array of random reals is k (log n - log k), and the complexity is O(n + k log k (log n - log k)).

What I did here for simplicity is just omit the "minus logarithm of k" part to get an upper bound.
For small enough k, the bound is tight.
For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) / 2.
On the other hand, when k=Theta(n), log n - log k = log(n/k) = Theta(1).

Ivan Kazmenko.

January 17, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Ivan Kazmenko

Andrei Alexandrescu

Posted in reply to Ivan Kazmenko

On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:
> Here is a more verbose version.

OK, very nice. Thanks! I've modified topN to work as follows. In a loop:

* If nth <= r.length / log2(r.length)^^2 (or is similarly close to r.length), use topNHeap with one heap and stop
* If nth <= r.length / log2(r.length) (or is similarly close to r.length), use topNHeap with two heaps and stop
* Take a quickselect step, adjust r and nth, and continue

That seems to work nicely for all values involved.

All - let me know how things can be further improved. Thx!


Andrei

January 18, 2016

Re: topN using a heap

Posted by deadalnix
in reply to Andrei Alexandrescu

deadalnix

Posted in reply to Andrei Alexandrescu

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
> https://github.com/D-Programming-Language/phobos/pull/3934
>
> So, say you're looking for the smallest 10 elements out of 100_000. The quickselect algorithm (which topN currently uses) will successively partition the set in (assuming the pivot choice works well) 50_000, 25_000, etc chunks all the way down to finding the smallest 10.
>
> That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10:
>
> 1. Organize the first 11 elements into a max heap
>
> 2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property.
>
> 3. At the end, swap the largest in the heap with the 10th and you're done!
>
> This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often.
>
>
> Destroy!
>
> Andrei

A common way to do it is to go quicksort, but only recurse on one side of the set. That should give log(n)^2 complexity on average.

January 17, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to deadalnix

Andrei Alexandrescu

Posted in reply to deadalnix

On 1/17/16 8:07 PM, deadalnix wrote:
> A common way to do it is to go quicksort, but only recurse on one side
> of the set. That should give log(n)^2 complexity on average.

Yah, that's quickselect (which this work started from). It's linear, and you can't get top n in sublinear time because you need to look at all elements. -- Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation