Jump to page: 1 25  
Page
Thread overview
how to get top N distinct elements from range?
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
Ary Borenszweig
Mar 08, 2013
bearophile
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
jerro
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
Ivan Kazmenko
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
bearophile
Mar 08, 2013
bearophile
Mar 08, 2013
Chris Cain
Mar 09, 2013
bearophile
Mar 08, 2013
bearophile
Mar 08, 2013
bearophile
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
bearophile
Mar 08, 2013
bearophile
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
bearophile
Mar 08, 2013
bearophile
Mar 09, 2013
Andrea Fontana
Mar 09, 2013
bearophile
Mar 09, 2013
Andrea Fontana
Mar 09, 2013
bearophile
Mar 09, 2013
bearophile
Mar 09, 2013
bearophile
Mar 09, 2013
Andrea Fontana
Mar 09, 2013
Andrea Fontana
Mar 09, 2013
Andrea Fontana
Mar 09, 2013
bearophile
Mar 09, 2013
Timon Gehr
Mar 09, 2013
bearophile
Mar 08, 2013
bearophile
Mar 08, 2013
Andrea Fontana
Mar 08, 2013
jerro
Mar 08, 2013
Ivan Kazmenko
Mar 08, 2013
Chris Cain
Mar 08, 2013
Ivan Kazmenko
Mar 08, 2013
Jacob Carlborg
March 08, 2013
I wonder if exists a way to get top n distinct elements from a range (infinite too!)

A (not efficient) way to to this is range.array.sort.uniq.take(n) but it's a bit overkill, it sorts elements, and of course doesn't work with infinite ranges. Am i missing any function?



March 08, 2013
On 3/8/13 10:33 AM, Andrea Fontana wrote:
> I wonder if exists a way to get top n distinct elements from a range
> (infinite too!)

Top from an infinite range? o_O
March 08, 2013
Andrea Fontana:

> I wonder if exists a way to get top n distinct elements from a range (infinite too!)
>
> A (not efficient) way to to this is range.array.sort.uniq.take(n) but it's a bit overkill, it sorts elements, and of course doesn't work with infinite ranges. Am i missing any function?

What does it mean "top"?

I think you are not missing functions.

One solution (not tested):


bool[ForeachType!(typeof(myRange))] mySet;
foreach (item; myRange) {
    mySet[item] = true;
    if (mySet.length >= n)
        break;
}
auto top = mySet.byKey;


Bye,
bearophile
March 08, 2013
On Friday, 8 March 2013 at 14:10:55 UTC, bearophile wrote:
> Andrea Fontana:
>
>> I wonder if exists a way to get top n distinct elements from a range (infinite too!)
>>
>> A (not efficient) way to to this is range.array.sort.uniq.take(n) but it's a bit overkill, it sorts elements, and of course doesn't work with infinite ranges. Am i missing any function?
>
> What does it mean "top"?
>
> I think you are not missing functions.
>
> One solution (not tested):
>
>
> bool[ForeachType!(typeof(myRange))] mySet;
> foreach (item; myRange) {
>     mySet[item] = true;
>     if (mySet.length >= n)
>         break;
> }
> auto top = mySet.byKey;
>
>
> Bye,
> bearophile


Yes I wonder if a lazy function like this function was in phobos. It's quite common, and I didn't want to reinvent wheel :)

March 08, 2013
On Friday, 8 March 2013 at 13:33:24 UTC, Andrea Fontana wrote:
> I wonder if exists a way to get top n distinct elements from a range (infinite too!)

It's impossible to do that for infinite ranges

> A (not efficient) way to to this is range.array.sort.uniq.take(n) but it's a bit overkill, it sorts elements, and of course doesn't work with infinite ranges. Am i missing any function?

You could do something like (untested code):

for(size_t m = 0, mnext = n; ;
    n = mnext, mnext = min(2 * mnext, range.length))
{
    partialSort!condition(range[m .. $], mnext - m);
    if(range[0 .. mnext].uniq.count >= n)
        break;

    assert(mnext < range.length);
}

auto topN = range.uniq.take(n);

You will need a random access range with slicing for that, and it will modify it. If I am not mistaken, this has time complexity O(log(ndup / n) * range.length), where ndup is the number of all elements (including duplicates) with values among the top n.
March 08, 2013
> I wonder if exists a way to get top n distinct elements from a range (infinite too!)

> A (not efficient) way to to this is range.array.sort.uniq.take(n) but it's a bit overkill, it sorts elements, and of course doesn't work with infinite ranges. Am i missing any function?

For an arbitrary infinite range, it is impossible from the mathematical point of view.  An impossible example would be doing that with the infinite range of natural numbers, stored in bignums.

For a finite range, you can iterate over the range while maintaining a collection of the top n elements and discarding duplicates as you find them.  For example, using a RedBlackTree as the collection, you will get O(log(n) * range.length) total time and O(n) total memory used.  If n is small (on the order of 10s), an array will suffice, resulting in O(n * range.length) total time but with lower constant factor.
March 08, 2013
On Friday, 8 March 2013 at 15:04:32 UTC, Ivan Kazmenko wrote:
>
> For a finite range, you can iterate over the range while maintaining a collection of the top n elements and discarding duplicates as you find them.  For example, using a RedBlackTree as the collection, you will get O(log(n) * range.length) total time and O(n) total memory used.  If n is small (on the order of 10s), an array will suffice, resulting in O(n * range.length) total time but with lower constant factor.

You can also use an array if you use a BinaryHeap. If you use a heap then it should have a much lower constant factor than a red-black tree so, even though they would both have O(m*log(n)) (where n is the number of elements you want the top of and m is the size of the collection) running time, a heap would be faster.
March 08, 2013
On Friday, 8 March 2013 at 14:43:29 UTC, jerro wrote:
> On Friday, 8 March 2013 at 13:33:24 UTC, Andrea Fontana wrote:
>> I wonder if exists a way to get top n distinct elements from a range (infinite too!)
>
> It's impossible to do that for infinite ranges

Why?

sequence!"n*2".myTopDistinct!"a==b"(3);

will give [2,4,6]

We just need to keep a list of "just-seen" elements
But I don't know how to make it lazy
March 08, 2013
>>> I wonder if exists a way to get top n distinct elements from a range (infinite too!)
>>
>> It's impossible to do that for infinite ranges
>
> Why?
>
> sequence!"n*2".myTopDistinct!"a==b"(3);
>
> will give [2,4,6]

I don't quite get it, do you want the least n elements?  What result would you want if the sequence was the following:

sequence!"-n*2".myTopDistinct!"a==b"(3);

The sequence goes like [-2, -4, -6, -8, -10, ...] ad infinitum.

To clarify a doubt, do you mean "top" as in "front", or "top" as the "greatest" or "least" in terms of comparison?  Or maybe "the first n distinct elements found"?  If so, why "sort" in the original post?  It could bring middle elements to front.
March 08, 2013
>> For a finite range, you can iterate over the range while maintaining a collection of the top n elements and discarding duplicates as you find them.  For example, using a RedBlackTree as the collection, you will get O(log(n) * range.length) total time and O(n) total memory used.  If n is small (on the order of 10s), an array will suffice, resulting in O(n * range.length) total time but with lower constant factor.
>
> You can also use an array if you use a BinaryHeap. If you use a heap then it should have a much lower constant factor than a red-black tree so, even though they would both have O(m*log(n)) (where n is the number of elements you want the top of and m is the size of the collection) running time, a heap would be faster.

My first thought was about a heap, too, but it does not allow to search for a duplicate in O(log(n)) as it guarantees only partial ordering.
« First   ‹ Prev
1 2 3 4 5