topN using a heap

https://github.com/D-Programming-Language/phobos/pull/3934 So, say you're looking for the smallest 10 elements out of 100_000. The quickselect algorithm (which topN currently uses) will successively partition the set in (assuming the pivot choice works well) 50_000, 25_000, etc chunks all the way down to finding the smallest 10. That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10: 1. Organize the first 11 elements into a max heap 2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property. 3. At the end, swap the largest in the heap with the 10th and you're done! This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often. Destroy! Andrei

January 16, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Andrei Alexandrescu

Permalink

Ivan Kazmenko

Posted in reply to Andrei Alexandrescu

Permalink

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
> That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10:
>
> 1. Organize the first 11 elements into a max heap
>
> 2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property.
>
> 3. At the end, swap the largest in the heap with the 10th and you're done!
>
> This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often.

Yeah, this sounds nice.  Suppose we have an array of length n and want to find the smallest k elements.  The first step (heapify) is O(k) operations.

1. Let us analyze the rest for "small enough" k.

1a. If the array is "random", the expected number of insertions is the sum of probabilities that each new element is inserted.  This number is (k/(k+1)) + (k/(k+2)) + ... + k/n, which is k times a part of harmonic series, that is, on the order of k log n.  In each case, O(log k) operations are required to process the new element.  In total, the "average" case takes (n-k) + (k log n log k), a total of O(n) for small enough values of k.

1b. In the worst case (each element is less than the current top), this still requires (n-k) log k operations, so we can say the total is O(n log k).

2. For k of the same order as n, the algorithm is O(n log n), just as a regular HeapSort.

To summarize:
For k<<n we have O(n) average case and O(n log k) worst case.
For k~n we have O(n log n) as HeapSort.

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote: > 3. At the end, swap the largest in the heap with the 10th and you're done! And why this? Do we additionally require the k-th element to arrive exactly on k-th place?

On 01/16/2016 10:28 PM, Ivan Kazmenko wrote: > On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote: >> ... > > To summarize: > For k<<n we have O(n) average case and O(n log k) worst case. > For k~n we have O(n log n) as HeapSort. > The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average.

On 1/16/16 8:00 PM, Timon Gehr wrote: > On 01/16/2016 10:28 PM, Ivan Kazmenko wrote: >> On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote: >>> ... >> >> To summarize: >> For k<<n we have O(n) average case and O(n log k) worst case. >> For k~n we have O(n log n) as HeapSort. >> > > The implementation falls back to topNHeap whenever k is within the first > or last ~n/8 elements and therefore is Ω(n log n) on average. Correct. The pedantic approach would be to only do that when k is up to log(n). -- Andrei

On 1/16/16 5:27 PM, Ivan Kazmenko wrote: > On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote: >> 3. At the end, swap the largest in the heap with the 10th and you're >> done! > > And why this? Do we additionally require the k-th element to arrive > exactly on k-th place? Yes. -- Andrei

On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote: > On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d > <digitalmars-d@puremagic.com <mailto:digitalmars-d@puremagic.com>> wrote: > > > 1. Organize the first 11 elements into a max heap > > > Why not the first 10? So you get to put the appropriate element in the 11th position. -- Andrei

On 1/16/16 9:12 PM, Andrei Alexandrescu wrote: > On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote: >> On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d >> <digitalmars-d@puremagic.com <mailto:digitalmars-d@puremagic.com>> wrote: >> >> >> 1. Organize the first 11 elements into a max heap >> >> >> Why not the first 10? > > So you get to put the appropriate element in the 11th position. -- Andrei To clarify this using a degenerate case: say someone calls topN(r, 0) i.e. find the minimum. Then you'd need a mini-heap of exactly one element to track the smallest element found so far. -- Andrei

On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote: > On 1/16/16 8:00 PM, Timon Gehr wrote: >> On 01/16/2016 10:28 PM, Ivan Kazmenko wrote: >>> On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote: >>>> ... >>> >>> To summarize: >>> For k<<n we have O(n) average case and O(n log k) worst case. >>> For k~n we have O(n log n) as HeapSort. >>> >> >> The implementation falls back to topNHeap whenever k is within the first >> or last ~n/8 elements and therefore is Ω(n log n) on average. > > Correct. The pedantic approach would be to only do that when k is up to > log(n). -- Andrei > Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k. I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)

Forums