Jump to page: 1 28  
Page
Thread overview
topN using a heap
Jan 16, 2016
Ivan Kazmenko
Jan 17, 2016
Timon Gehr
Jan 17, 2016
Timon Gehr
Jan 17, 2016
Ilya Yaroshenko
Jan 17, 2016
Ivan Kazmenko
Jan 17, 2016
Ivan Kazmenko
Jan 18, 2016
Ivan Kazmenko
Jan 18, 2016
Ivan Kazmenko
Jan 18, 2016
Ivan Kazmenko
Jan 18, 2016
Ilya
Jan 18, 2016
Ivan Kazmenko
Jan 18, 2016
Ilya
Jan 18, 2016
Ilya
Jan 18, 2016
Timon Gehr
Jan 18, 2016
Ilya
Jan 19, 2016
Ilya
Jan 19, 2016
Ivan Kazmenko
Jan 20, 2016
Ivan Kazmenko
Jan 19, 2016
Xinok
Jan 19, 2016
Ivan Kazmenko
Jan 19, 2016
Timon Gehr
Sep 21, 2016
Jon Degenhardt
Sep 21, 2016
Jon Degenhardt
Sep 23, 2016
Stefan Koch
Sep 23, 2016
Jon Degenhardt
Sep 25, 2016
Jon Degenhardt
Jan 18, 2016
deadalnix
Jan 19, 2016
deadalnix
Jan 18, 2016
Ilya
Jan 18, 2016
Timon Gehr
Jan 18, 2016
Ilya
Jan 19, 2016
Timon Gehr
Jan 19, 2016
Ilya
Jan 19, 2016
Timon Gehr
Jan 19, 2016
Ilya
Jan 19, 2016
Ilya
Jan 19, 2016
Ilya
Jan 19, 2016
Ilya
Jan 18, 2016
Timon Gehr
Jan 19, 2016
Ivan Kazmenko
Jan 18, 2016
Timon Gehr
Jan 17, 2016
Ivan Kazmenko
Jan 16, 2016
Ziad Hatahet
Jan 16, 2016
Ivan Kazmenko
Jan 18, 2016
deadalnix
Jan 18, 2016
deadalnix
January 16, 2016
https://github.com/D-Programming-Language/phobos/pull/3934

So, say you're looking for the smallest 10 elements out of 100_000. The quickselect algorithm (which topN currently uses) will successively partition the set in (assuming the pivot choice works well) 50_000, 25_000, etc chunks all the way down to finding the smallest 10.

That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10:

1. Organize the first 11 elements into a max heap

2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property.

3. At the end, swap the largest in the heap with the 10th and you're done!

This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often.


Destroy!

Andrei
January 16, 2016
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
> That's quite a bit of work, so 3934 uses an alternate strategy for finding the smallest 10:
>
> 1. Organize the first 11 elements into a max heap
>
> 2. Scan all other elements progressively. Whenever an element is found that is smaller than the largest in the heap, swap it with the largest in the heap then restore the heap property.
>
> 3. At the end, swap the largest in the heap with the 10th and you're done!
>
> This is very effective, and is seldom referred in the selection literature. In fact, a more inefficient approach (heapify the entire range) is discussed more often.

Yeah, this sounds nice.  Suppose we have an array of length n and want to find the smallest k elements.  The first step (heapify) is O(k) operations.

1. Let us analyze the rest for "small enough" k.

1a. If the array is "random", the expected number of insertions is the sum of probabilities that each new element is inserted.  This number is (k/(k+1)) + (k/(k+2)) + ... + k/n, which is k times a part of harmonic series, that is, on the order of k log n.  In each case, O(log k) operations are required to process the new element.  In total, the "average" case takes (n-k) + (k log n log k), a total of O(n) for small enough values of k.

1b. In the worst case (each element is less than the current top), this still requires (n-k) log k operations, so we can say the total is O(n log k).

2. For k of the same order as n, the algorithm is O(n log n), just as a regular HeapSort.

To summarize:
For k<<n we have O(n) average case and O(n log k) worst case.
For k~n we have O(n log n) as HeapSort.

January 16, 2016
On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d < digitalmars-d@puremagic.com> wrote:

>
> 1. Organize the first 11 elements into a max heap
>
>
Why not the first 10?


January 16, 2016
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
> 3. At the end, swap the largest in the heap with the 10th and you're done!

And why this?  Do we additionally require the k-th element to arrive exactly on k-th place?

January 17, 2016
On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
> On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
>> ...
>
> To summarize:
> For k<<n we have O(n) average case and O(n log k) worst case.
> For k~n we have O(n log n) as HeapSort.
>

The implementation falls back to topNHeap whenever k is within the first or last ~n/8 elements and therefore is Ω(n log n) on average.
January 16, 2016
On 1/16/16 8:00 PM, Timon Gehr wrote:
> On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
>> On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
>>> ...
>>
>> To summarize:
>> For k<<n we have O(n) average case and O(n log k) worst case.
>> For k~n we have O(n log n) as HeapSort.
>>
>
> The implementation falls back to topNHeap whenever k is within the first
> or last ~n/8 elements and therefore is Ω(n log n) on average.

Correct. The pedantic approach would be to only do that when k is up to log(n). -- Andrei

January 16, 2016
On 1/16/16 5:27 PM, Ivan Kazmenko wrote:
> On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
>> 3. At the end, swap the largest in the heap with the 10th and you're
>> done!
>
> And why this?  Do we additionally require the k-th element to arrive
> exactly on k-th place?

Yes. -- Andrei

January 16, 2016
On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:
> On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
> <digitalmars-d@puremagic.com <mailto:digitalmars-d@puremagic.com>> wrote:
>
>
>     1. Organize the first 11 elements into a max heap
>
>
> Why not the first 10?

So you get to put the appropriate element in the 11th position. -- Andrei

January 16, 2016
On 1/16/16 9:12 PM, Andrei Alexandrescu wrote:
> On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:
>> On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
>> <digitalmars-d@puremagic.com <mailto:digitalmars-d@puremagic.com>> wrote:
>>
>>
>>     1. Organize the first 11 elements into a max heap
>>
>>
>> Why not the first 10?
>
> So you get to put the appropriate element in the 11th position. -- Andrei

To clarify this using a degenerate case: say someone calls topN(r, 0) i.e. find the minimum. Then you'd need a mini-heap of exactly one element to track the smallest element found so far. -- Andrei

January 17, 2016
On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
> On 1/16/16 8:00 PM, Timon Gehr wrote:
>> On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
>>> On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
>>>> ...
>>>
>>> To summarize:
>>> For k<<n we have O(n) average case and O(n log k) worst case.
>>> For k~n we have O(n log n) as HeapSort.
>>>
>>
>> The implementation falls back to topNHeap whenever k is within the first
>> or last ~n/8 elements and therefore is Ω(n log n) on average.
>
> Correct. The pedantic approach would be to only do that when k is up to
> log(n). -- Andrei
>

Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k.
I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)
« First   ‹ Prev
1 2 3 4 5 6 7 8