October 09, 2008
bearophile wrote:
> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
> 
> from sys import argv
> from random import random, randrange
> # randrange gives a random integer in [0, n)
> 
> filename = argv[1]
> k = int(argv[2])
> assert k > 0
> 
> chosen_lines = []
> for i, line in enumerate(file(filename)):
>     if i < k:
>         chosen_lines.append(line)
>     else:
>         if random() < (1.0 / (i+1)):
>             chosen_lines[randrange(k)] = line
> 
> print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why this works.

Andrei

October 09, 2008
Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>>
>> Andrei
> 
> This is a classic interview question. The solution for k == 1 is easy:
> 
> from random import randint
> 
> chosen_line = None
> 
> for i, line in enumerate(open('filename')):
>     if randint(0, i) == 0:
>         chosen_line = line
> 
> print chosen_line
> 
> (It is worth noting that randint() operates over an inclusive range.)
> 
> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
> 
> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
> 
> import itertools
> from random import randint
> 
> f = open('filename')
> 
> k_lines = list(itertools.islice(f, k))
> 
> # Next we just iterate over the rest of the file. If we have exhausted
> # the file, then the loop terminates immediately.
> for i, line in enumerate(f, start=k):
>     if randint(0, i) == 0:
>         k_lines[randint(0, k-1)] = line
> 
> for line in k_lines:
>     print line
> 
> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.

This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1.

But where's the D code you guys? D is better than Python at scripting. Ahem.


Andrei
October 09, 2008
Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>>
>> Andrei
> 
> This is a classic interview question. The solution for k == 1 is easy:
> 
> from random import randint
> 
> chosen_line = None
> 
> for i, line in enumerate(open('filename')):
>     if randint(0, i) == 0:
>         chosen_line = line
> 
> print chosen_line
> 
> (It is worth noting that randint() operates over an inclusive range.)
> 
> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
> 
> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
> 
> import itertools
> from random import randint
> 
> f = open('filename')
> 
> k_lines = list(itertools.islice(f, k))
> 
> # Next we just iterate over the rest of the file. If we have exhausted
> # the file, then the loop terminates immediately.
> for i, line in enumerate(f, start=k):
>     if randint(0, i) == 0:
>         k_lines[randint(0, k-1)] = line
> 
> for line in k_lines:
>     print line
> 
> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
> 

Oh forgot to mention this is superior numerically to bearophile's
because it's not affected by floating point quantization vagaries. But it came later too...

Andrei

October 09, 2008
Andrei Alexandrescu wrote:
> Kirk McDonald wrote:
>> Andrei Alexandrescu wrote:
>>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>>
>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>>
>>>
>>> Andrei
>>
>> This is a classic interview question. The solution for k == 1 is easy:
>>
>> from random import randint
>>
>> chosen_line = None
>>
>> for i, line in enumerate(open('filename')):
>>     if randint(0, i) == 0:
>>         chosen_line = line
>>
>> print chosen_line
>>
>> (It is worth noting that randint() operates over an inclusive range.)
>>
>> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
>>
>> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
>>
>> import itertools
>> from random import randint
>>
>> f = open('filename')
>>
>> k_lines = list(itertools.islice(f, k))
>>
>> # Next we just iterate over the rest of the file. If we have exhausted
>> # the file, then the loop terminates immediately.
>> for i, line in enumerate(f, start=k):
>>     if randint(0, i) == 0:
>>         k_lines[randint(0, k-1)] = line
>>
>> for line in k_lines:
>>     print line
>>
>> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
> 
> This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1.
> 
> But where's the D code you guys? D is better than Python at scripting. Ahem.
> 
> 
> Andrei

No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.)

islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.)

-- 
Kirk McDonald
October 09, 2008
Andrei Alexandrescu wrote:
> Kirk McDonald wrote:
>> Andrei Alexandrescu wrote:
>>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>>
>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>>
>>>
>>> Andrei
>>
>> This is a classic interview question. The solution for k == 1 is easy:
>>
>> from random import randint
>>
>> chosen_line = None
>>
>> for i, line in enumerate(open('filename')):
>>     if randint(0, i) == 0:
>>         chosen_line = line
>>
>> print chosen_line
>>
>> (It is worth noting that randint() operates over an inclusive range.)
>>
>> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
>>
>> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
>>
>> import itertools
>> from random import randint
>>
>> f = open('filename')
>>
>> k_lines = list(itertools.islice(f, k))
>>
>> # Next we just iterate over the rest of the file. If we have exhausted
>> # the file, then the loop terminates immediately.
>> for i, line in enumerate(f, start=k):
>>     if randint(0, i) == 0:
>>         k_lines[randint(0, k-1)] = line
>>
>> for line in k_lines:
>>     print line
>>
>> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
>>
> 
> Oh forgot to mention this is superior numerically to bearophile's
> because it's not affected by floating point quantization vagaries. But it came later too...
> 
> Andrei
> 

To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-)

-- 
Kirk McDonald
October 09, 2008
Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> Kirk McDonald wrote:
>>> Andrei Alexandrescu wrote:
>>>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>>>
>>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>>>
>>>>
>>>> Andrei
>>>
>>> This is a classic interview question. The solution for k == 1 is easy:
>>>
>>> from random import randint
>>>
>>> chosen_line = None
>>>
>>> for i, line in enumerate(open('filename')):
>>>     if randint(0, i) == 0:
>>>         chosen_line = line
>>>
>>> print chosen_line
>>>
>>> (It is worth noting that randint() operates over an inclusive range.)
>>>
>>> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
>>>
>>> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
>>>
>>> import itertools
>>> from random import randint
>>>
>>> f = open('filename')
>>>
>>> k_lines = list(itertools.islice(f, k))
>>>
>>> # Next we just iterate over the rest of the file. If we have exhausted
>>> # the file, then the loop terminates immediately.
>>> for i, line in enumerate(f, start=k):
>>>     if randint(0, i) == 0:
>>>         k_lines[randint(0, k-1)] = line
>>>
>>> for line in k_lines:
>>>     print line
>>>
>>> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
>>
>> This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1.
>>
>> But where's the D code you guys? D is better than Python at scripting. Ahem.
>>
>>
>> Andrei
> 
> No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.)
> 
> islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.)
> 

Sounds good.

Andrei
October 09, 2008
Kirk McDonald wrote:
> To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-)

I think you forgot to attach the D solution. ;o)

Andrei
October 09, 2008
Andrei Alexandrescu:
> But where's the D code you guys?

My third converted to D, using my libs:

import std.conv, d.all;

void main(string[] args) {
    assert(args.length == 3);
    string filename = args[1];
    int k = toInt(args[2]);
    assert (k > 0);

    string[] chosen_lines;
    foreach (i, line; xfile(filename))
        if (i < k)
            chosen_lines ~= line;
        else
            if (fastRandom() < (1.0 / (i+1)))
                chosen_lines[randInt(k-1)] = line;

    putr(chosen_lines);
}


> D is better than Python at scripting. Ahem.

I know many languages, and so far I have never found something better than Python to create working prototypes.

Bye,
bearophile
October 09, 2008
bearophile wrote:
> Andrei Alexandrescu:
>> But where's the D code you guys?
> 
> My third converted to D, using my libs:
> 
> import std.conv, d.all;
> 
> void main(string[] args) { assert(args.length == 3); string filename
> = args[1]; int k = toInt(args[2]); assert (k > 0);
> 
> string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line;
> 
> putr(chosen_lines); }
> 
> 
>> D is better than Python at scripting. Ahem.
> 
> I know many languages, and so far I have never found something better
> than Python to create working prototypes.

True men only write D. Your solution looks eerily similar to mine.

#!/usr/bin/rdmd

import std.stdio, std.contracts, std.conv, std.random;

void main(string[] args) {
    invariant k = parse!(uint)(args[1]);
    enforce(k > 0, "Must pass a strictly positive selection size");
    string[] selection;
    auto gen = Random(unpredictableSeed);

    foreach (ulong tally, char[] line; lines(stdin)) {
        if (selection.length < k) {
            // Selection not full; add to it
            selection ~= line.idup;
        } else {
            auto t = uniform(gen, 0, tally + 1);
            if (t > k) continue; // no luck
            // Replace a random element in the selection
            selection[uniform(gen, 0, k - 1)] = line.idup;
        }
    }

    // Print selection
    foreach (s; selection) write(s);
}
October 09, 2008
Reply to Andrei,

> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
> 
> The challenge, should you accept it, is to write a program that given
> a number k and a file, outputs k lines picked uniformly at random from
> the file. (If the file has less than k lines, it should all be
> output.) The file's size is unbounded so loading it in memory is not
> an option. How'd you go about it?
> 
> Andrei
> 

struct S
{
char[] line;
real v;
int opCmp(S s)
{
 if(s.v < this.v) return -1;
 else if(s.v > this.v) return +1;
 return 0;
}
}

int k;
S[] buf = new S[k+1];

foreach(s;buf) s.v = real.max;

foreach(char[] line; stream)
{
   delete buf[$-1].line;
   buf[$-1].line = line.dup;;
   buf[$-1].v = reaqlRand();  // returns v > real.min and v < real.max
   buf.sort;
}

foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);