March 08, 2013
On Friday, 8 March 2013 at 16:12:37 UTC, Ivan Kazmenko wrote:
>>>> I wonder if exists a way to get top n distinct elements from a range (infinite too!)
>>>
>>> It's impossible to do that for infinite ranges
>>
>> Why?
>>
>> sequence!"n*2".myTopDistinct!"a==b"(3);
>>
>> will give [2,4,6]
>
> I don't quite get it, do you want the least n elements?  What
> result would you want if the sequence was the following:
>
> sequence!"-n*2".myTopDistinct!"a==b"(3);

Infinite loop as:

sequence!"n*2".filter!"a>1";
and a lot of other cases.

>
> The sequence goes like [-2, -4, -6, -8, -10, ...] ad infinitum.
>
> To clarify a doubt, do you mean "top" as in "front", or "top" as the "greatest" or "least" in terms of comparison?  Or maybe "the first n distinct elements found"?  If so, why "sort" in the original post?  It could bring middle elements to front.

I mean "the first n distinct elements found". Example was a mistake. Uniq works with sorted range (if you want to get distinct element) so I sort it to make it works. But I should not :) Whoops. Bearophile got it btw.

So, is there any lazy way to do it?


March 08, 2013
On Friday, 8 March 2013 at 15:53:56 UTC, Andrea Fontana wrote:
> On Friday, 8 March 2013 at 14:43:29 UTC, jerro wrote:
>> On Friday, 8 March 2013 at 13:33:24 UTC, Andrea Fontana wrote:
>>> I wonder if exists a way to get top n distinct elements from a range (infinite too!)
>>
>> It's impossible to do that for infinite ranges
>
> Why?
>
> sequence!"n*2".myTopDistinct!"a==b"(3);
>
> will give [2,4,6]

I didn't understand you correctly then. I thought that top n distinct elements meant n largest distinct elements. Now, I'm assuming that you just need first n distinct elements.

> We just need to keep a list of "just-seen" elements
> But I don't know how to make it lazy

You could write your own range that uses an associative array to check for duplicates:

struct Unique(R)
{
    R r;
    bool[elementType!R] seen;

    @property front(){ return r.front; }
    @property empty(){ return r.empty; }
    void popFront()
    {
        seen[r] = bool.init;
        while(r.front in seen)
            r.popFront();
    }
}

I don't think there's anything like this in Phobos.
March 08, 2013
Andrea Fontana:

> So, is there any lazy way to do it?

Take my code and put it inside a struct, add empty, popFront and front.

Otherwise return an impure take of a filter closure from a function that keeps the set. Something like (untested):

auto firstDistinct(Range)(Range r, size_t n) {
    bool[ForeachType!(Range)] mySet;
    return f.filter!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    }).take(n);
}

Bye,
bearophile
March 08, 2013
> Otherwise return an impure take of a filter closure from a function that keeps the set. Something like (untested):

It seems to work:



import std.stdio, std.range, std.algorithm, std.traits;

auto firstDistinct(Range)(Range r, in size_t n) {
    bool[ForeachType!Range] mySet;

    return r.filter!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    }).take(n);
}

void main() {
    immutable data = [1, 0, 0, 0, 0, 3, 3, 0, 3, 3, 0, 2, 3, 2, 3, 2,
                      3, 1, 1, 3, 2, 1, 0, 0, 0, 1, 0, 3, 0, 3];
    foreach (i; 0 .. 6)
        data.firstDistinct(i).writeln;
}


Bye,
bearophile
March 08, 2013
On Friday, 8 March 2013 at 18:17:22 UTC, bearophile wrote:
>> Otherwise return an impure take of a filter closure from a function that keeps the set. Something like (untested):
>
> It seems to work:
>
>
>
> import std.stdio, std.range, std.algorithm, std.traits;
>
> auto firstDistinct(Range)(Range r, in size_t n) {
>     bool[ForeachType!Range] mySet;
>
>     return r.filter!((k) {
>         if (k in mySet)
>             return false;
>         mySet[k] = true;
>         return true;
>     }).take(n);
> }
>

I have to say, that's a great example of beautiful simplicity in programming. Bravo.

March 08, 2013
It needs a constraint:


auto firstDistinct(Range)(Range r, in size_t n)
if (isInputRange!Range) {
    bool[ForeachType!Range] mySet;

    return r.filter!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    }).take(n);
}


Bye,
bearophile

March 08, 2013
> auto firstDistinct(Range)(Range r, in size_t n)
> if (isInputRange!Range) {
>     bool[ForeachType!Range] mySet;
>
>     return r.filter!((k) {
>         if (k in mySet)
>             return false;
>         mySet[k] = true;
>         return true;
>     }).take(n);
> }

I think the standard library of Ada2012 has bounded containers, so you can use them to stack-allocate a hash set that can contain up to n items (plus some more free cells to keep its load factor low enough).

In Rust there are fully safe stack-allocated closures. Combining the two things it's possible to create a function like that firstDistinct() with zero heap allocations, that is probably fast.

Bye,
bearophile
March 08, 2013
On 2013-03-08 14:33, Andrea Fontana wrote:
> I wonder if exists a way to get top n distinct elements from a range
> (infinite too!)
>
> A (not efficient) way to to this is range.array.sort.uniq.take(n) but
> it's a bit overkill, it sorts elements, and of course doesn't work with
> infinite ranges. Am i missing any function?

There's a function called "topN" in std.algorithm, I don't know if that's what you're looking for.

http://dlang.org/phobos/std_algorithm.html#topN

-- 
/Jacob Carlborg
March 08, 2013
On Friday, 8 March 2013 at 19:36:04 UTC, bearophile wrote:
>> auto firstDistinct(Range)(Range r, in size_t n)
>> if (isInputRange!Range) {
>>    bool[ForeachType!Range] mySet;
>>
>>    return r.filter!((k) {
>>        if (k in mySet)
>>            return false;
>>        mySet[k] = true;
>>        return true;
>>    }).take(n);
>> }
>
> I think the standard library of Ada2012 has bounded containers, so you can use them to stack-allocate a hash set that can contain up to n items (plus some more free cells to keep its load factor low enough).
>
> In Rust there are fully safe stack-allocated closures. Combining the two things it's possible to create a function like that firstDistinct() with zero heap allocations, that is probably fast.
>
> Bye,
> bearophile

Something goes wrong by the way. I removed "n" template params and "take(n)" (i can do after distinct() call, isn't it the same?).

Try this (sorted for better reading):

iota(100).map!((x) => uniform(0,10))().distinct().array.sort.writeln;
iota(100).map!((x) => uniform(0,10))().array.distinct().array.sort.writeln;

output:
[0, 0, 3, 3, 4, 4, 4, 6, 7, 8]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

March 08, 2013
Andrea Fontana:

> Something goes wrong by the way. I removed "n" template params and "take(n)" (i can do after distinct() call, isn't it the same?).

I think take() is not the cause.

See this program:



import std.stdio, std.range, std.algorithm, std.traits, std.random;

auto distinct(Range)(Range r) if (isInputRange!Range) {
    bool[ForeachType!Range] mySet;

    return r.filter!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    });
}

void main() {
    5.iota.map!((_) {
        auto x = uniform(0, 10);
        write("*");
        return x;
    }).distinct.writeln;
}


It outputs something like:

*[*2*, *0*, *2**, *1]


You expect that output to contain only 5 stars.

Maybe map() or filter() is the cause of this.

Bye,
bearophile