October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote:
> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>
> from sys import argv
> from random import random, randrange
> # randrange gives a random integer in [0, n)
>
> filename = argv[1]
> k = int(argv[2])
> assert k > 0
>
> chosen_lines = []
> for i, line in enumerate(file(filename)):
> if i < k:
> chosen_lines.append(line)
> else:
> if random() < (1.0 / (i+1)):
> chosen_lines[randrange(k)] = line
>
> print chosen_lines
We have a winner!!! There is actually a very simple proof on how and why this works.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kirk McDonald | Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>>
>> Andrei
>
> This is a classic interview question. The solution for k == 1 is easy:
>
> from random import randint
>
> chosen_line = None
>
> for i, line in enumerate(open('filename')):
> if randint(0, i) == 0:
> chosen_line = line
>
> print chosen_line
>
> (It is worth noting that randint() operates over an inclusive range.)
>
> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
>
> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
>
> import itertools
> from random import randint
>
> f = open('filename')
>
> k_lines = list(itertools.islice(f, k))
>
> # Next we just iterate over the rest of the file. If we have exhausted
> # the file, then the loop terminates immediately.
> for i, line in enumerate(f, start=k):
> if randint(0, i) == 0:
> k_lines[randint(0, k-1)] = line
>
> for line in k_lines:
> print line
>
> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1.
But where's the D code you guys? D is better than Python at scripting. Ahem.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kirk McDonald | Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>>
>> Andrei
>
> This is a classic interview question. The solution for k == 1 is easy:
>
> from random import randint
>
> chosen_line = None
>
> for i, line in enumerate(open('filename')):
> if randint(0, i) == 0:
> chosen_line = line
>
> print chosen_line
>
> (It is worth noting that randint() operates over an inclusive range.)
>
> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
>
> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
>
> import itertools
> from random import randint
>
> f = open('filename')
>
> k_lines = list(itertools.islice(f, k))
>
> # Next we just iterate over the rest of the file. If we have exhausted
> # the file, then the loop terminates immediately.
> for i, line in enumerate(f, start=k):
> if randint(0, i) == 0:
> k_lines[randint(0, k-1)] = line
>
> for line in k_lines:
> print line
>
> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
>
Oh forgot to mention this is superior numerically to bearophile's
because it's not affected by floating point quantization vagaries. But it came later too...
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote: > Kirk McDonald wrote: >> Andrei Alexandrescu wrote: >>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. >>> >>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? >>> >>> >>> Andrei >> >> This is a classic interview question. The solution for k == 1 is easy: >> >> from random import randint >> >> chosen_line = None >> >> for i, line in enumerate(open('filename')): >> if randint(0, i) == 0: >> chosen_line = line >> >> print chosen_line >> >> (It is worth noting that randint() operates over an inclusive range.) >> >> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. >> >> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) >> >> import itertools >> from random import randint >> >> f = open('filename') >> >> k_lines = list(itertools.islice(f, k)) >> >> # Next we just iterate over the rest of the file. If we have exhausted >> # the file, then the loop terminates immediately. >> for i, line in enumerate(f, start=k): >> if randint(0, i) == 0: >> k_lines[randint(0, k-1)] = line >> >> for line in k_lines: >> print line >> >> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to. > > This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. > > But where's the D code you guys? D is better than Python at scripting. Ahem. > > > Andrei No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.) islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.) -- Kirk McDonald | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote: > Kirk McDonald wrote: >> Andrei Alexandrescu wrote: >>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. >>> >>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? >>> >>> >>> Andrei >> >> This is a classic interview question. The solution for k == 1 is easy: >> >> from random import randint >> >> chosen_line = None >> >> for i, line in enumerate(open('filename')): >> if randint(0, i) == 0: >> chosen_line = line >> >> print chosen_line >> >> (It is worth noting that randint() operates over an inclusive range.) >> >> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. >> >> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) >> >> import itertools >> from random import randint >> >> f = open('filename') >> >> k_lines = list(itertools.islice(f, k)) >> >> # Next we just iterate over the rest of the file. If we have exhausted >> # the file, then the loop terminates immediately. >> for i, line in enumerate(f, start=k): >> if randint(0, i) == 0: >> k_lines[randint(0, k-1)] = line >> >> for line in k_lines: >> print line >> >> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to. >> > > Oh forgot to mention this is superior numerically to bearophile's > because it's not affected by floating point quantization vagaries. But it came later too... > > Andrei > To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-) -- Kirk McDonald | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kirk McDonald | Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> Kirk McDonald wrote:
>>> Andrei Alexandrescu wrote:
>>>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>>>
>>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>>>
>>>>
>>>> Andrei
>>>
>>> This is a classic interview question. The solution for k == 1 is easy:
>>>
>>> from random import randint
>>>
>>> chosen_line = None
>>>
>>> for i, line in enumerate(open('filename')):
>>> if randint(0, i) == 0:
>>> chosen_line = line
>>>
>>> print chosen_line
>>>
>>> (It is worth noting that randint() operates over an inclusive range.)
>>>
>>> If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line.
>>>
>>> Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.)
>>>
>>> import itertools
>>> from random import randint
>>>
>>> f = open('filename')
>>>
>>> k_lines = list(itertools.islice(f, k))
>>>
>>> # Next we just iterate over the rest of the file. If we have exhausted
>>> # the file, then the loop terminates immediately.
>>> for i, line in enumerate(f, start=k):
>>> if randint(0, i) == 0:
>>> k_lines[randint(0, k-1)] = line
>>>
>>> for line in k_lines:
>>> print line
>>>
>>> This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
>>
>> This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1.
>>
>> But where's the D code you guys? D is better than Python at scripting. Ahem.
>>
>>
>> Andrei
>
> No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.)
>
> islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.)
>
Sounds good.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kirk McDonald | Kirk McDonald wrote:
> To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-)
I think you forgot to attach the D solution. ;o)
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu: > But where's the D code you guys? My third converted to D, using my libs: import std.conv, d.all; void main(string[] args) { assert(args.length == 3); string filename = args[1]; int k = toInt(args[2]); assert (k > 0); string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line; putr(chosen_lines); } > D is better than Python at scripting. Ahem. I know many languages, and so far I have never found something better than Python to create working prototypes. Bye, bearophile | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote:
> Andrei Alexandrescu:
>> But where's the D code you guys?
>
> My third converted to D, using my libs:
>
> import std.conv, d.all;
>
> void main(string[] args) { assert(args.length == 3); string filename
> = args[1]; int k = toInt(args[2]); assert (k > 0);
>
> string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line;
>
> putr(chosen_lines); }
>
>
>> D is better than Python at scripting. Ahem.
>
> I know many languages, and so far I have never found something better
> than Python to create working prototypes.
True men only write D. Your solution looks eerily similar to mine.
#!/usr/bin/rdmd
import std.stdio, std.contracts, std.conv, std.random;
void main(string[] args) {
invariant k = parse!(uint)(args[1]);
enforce(k > 0, "Must pass a strictly positive selection size");
string[] selection;
auto gen = Random(unpredictableSeed);
foreach (ulong tally, char[] line; lines(stdin)) {
if (selection.length < k) {
// Selection not full; add to it
selection ~= line.idup;
} else {
auto t = uniform(gen, 0, tally + 1);
if (t > k) continue; // no luck
// Replace a random element in the selection
selection[uniform(gen, 0, k - 1)] = line.idup;
}
}
// Print selection
foreach (s; selection) write(s);
}
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Reply to Andrei,
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given
> a number k and a file, outputs k lines picked uniformly at random from
> the file. (If the file has less than k lines, it should all be
> output.) The file's size is unbounded so loading it in memory is not
> an option. How'd you go about it?
>
> Andrei
>
struct S
{
char[] line;
real v;
int opCmp(S s)
{
if(s.v < this.v) return -1;
else if(s.v > this.v) return +1;
return 0;
}
}
int k;
S[] buf = new S[k+1];
foreach(s;buf) s.v = real.max;
foreach(char[] line; stream)
{
delete buf[$-1].line;
buf[$-1].line = line.dup;;
buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max
buf.sort;
}
foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply