October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory:
from sys import argv
from random import random
filename = argv[1]
k = int(argv[2])
nlines = sum(1 for _ in file(filename))
if k >= nlines:
for line in file(filename):
print line
else:
select = k
remaining = nlines
for line in file(filename):
if random() < float(select) / remaining:
print line
select -= 1
remaining -= 1
I'll think for a solution that avoids reading the file twice then...
Bye,
bearophile
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Reply to Andrei,
> Walter Bright wrote:
>
>> Andrei Alexandrescu wrote:
>>
>>> The challenge, should you accept it, is to write a program that
>>> given a number k and a file, outputs k lines picked uniformly at
>>> random from the file. (If the file has less than k lines, it should
>>> all be output.) The file's size is unbounded so loading it in memory
>>> is not an option. How'd you go about it?
>>>
>> 1. read the file, counting lines => nlines
>> 2. select k random numbers 0..nlines
>> 3. read the file line by line again, outputting the line if it's one
>> of
>> the selected lines
> No go if the file is a stream. Can it be done in one pass?
>
> Andrei
>
read and store the first k lines with there line number, if no lines remain, output stored lines
read another line replace an existing line with a probability based on the current line and the replaced line's location
if no lines remain ...
Now all that remains is to figure out that function
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | dsimcha wrote: > == Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s article >> I just ran across a nice little problem that I wanted to share as an >> exercise for the interested in futile pastimes. >> The challenge, should you accept it, is to write a program that given a >> number k and a file, outputs k lines picked uniformly at random from the >> file. (If the file has less than k lines, it should all be output.) The >> file's size is unbounded so loading it in memory is not an option. How'd >> you go about it? >> Andrei > > Pretty simple, actually. Read the file one character at a time. Get the offset > of the beginning of each line as a ulong or something. Store this information in > another file essentially as an on-disk array, where, for example, the first > ulong.sizeof bytes represent the bit pattern for the offset of the first line, the > next ulong.sizeof bytes represent the bit pattern for the offset of the second > line, etc. Use plenty of casting tricks. So this is essentially an index of the file's lines? > Shuffle the ulong.sizeof chunks of this > file the same way you would shuffle an ordinary array, using seeking as the > equivalent to array indexing. So this would shuffle that index. > (I never said this was going to be efficient.) For > each of the first k offsets after shuffling, print everything in the original file > from that offset to the first EOL character encountered. Can it be done faster and maybe in one pass? Andrei | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote:
> I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory:
>
> from sys import argv
> from random import random
>
> filename = argv[1]
> k = int(argv[2])
> nlines = sum(1 for _ in file(filename))
>
> if k >= nlines:
> for line in file(filename):
> print line
> else:
> select = k
> remaining = nlines
>
> for line in file(filename):
> if random() < float(select) / remaining:
> print line
> select -= 1
> remaining -= 1
>
> I'll think for a solution that avoids reading the file twice then...
>
> Bye,
> bearophile
This is unfair. First line in the original file is included with probability k / n. The probability of including the second line is dependent on the event of including the first line. If it wasn't included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines should be given equal chance.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu Wrote:
> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>
>
> Andrei
Based on your later stated requirements, use a heap. Insert each line into the heap using a random number as the key. After pre-loading the first k elements, each insert into the heap is followed by the removal of the head element.
Performance is O(characters in file + log2(k)*lines in file)
Obviously, the random number generator needs a range of values that exceeds the lines in the file by a good margin, or else you'll get biases in what gets kept.
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | Third solution, this requires a storage of k lines (but you can keep this storage on disk):
from sys import argv
from random import random, randrange
# randrange gives a random integer in [0, n)
filename = argv[1]
k = int(argv[2])
assert k > 0
chosen_lines = []
for i, line in enumerate(file(filename)):
if i < k:
chosen_lines.append(line)
else:
if random() < (1.0 / (i+1)):
chosen_lines[randrange(k)] = line
print chosen_lines
Now I'll look for possible bugs in this third version and to the problem Andrei has just told me :-)
Bye,
bearophile
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | "Andrei Alexandrescu" wrote
> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>> 1. read the file, counting lines => nlines
>> 2. select k random numbers 0..nlines
>> 3. read the file line by line again, outputting the line if it's one of
>> the selected lines
>
> No go if the file is a stream. Can it be done in one pass?
Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k.
-Steve
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | Steven Schveighoffer wrote:
> "Andrei Alexandrescu" wrote
>> Walter Bright wrote:
>>> Andrei Alexandrescu wrote:
>>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>> 1. read the file, counting lines => nlines
>>> 2. select k random numbers 0..nlines
>>> 3. read the file line by line again, outputting the line if it's one of the selected lines
>> No go if the file is a stream. Can it be done in one pass?
>
> Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k.
k is a "small" number in the sense that you can keep something proportional to k in working memory. You shouldn't keep something proportional to n in working memory.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu:
> This is unfair. First line in the original file is included with probability k / n. The probability of including the second line is dependent on the event of including the first line. If it wasn't included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines should be given equal chance.
If each of them have the same chance, then you can't guarantee the result has k items. So you must have a biased choice.
You can run this code:
from random import random
from collections import defaultdict
def main(freqs, nrepeats):
for i in xrange(nrepeats):
chosen = []
select = k
remaining = n
for value in values:
if random() < (float(select) / remaining):
chosen.append(value)
select -= 1
remaining -= 1
for c in chosen:
freqs[c] += 1
return freqs
import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]
Its output:
[1549, 1519, 1494, 1467, 1485, 1482, 1481, 1526, 1473, 1523, 1466, 1451, 1475, 1482, 1504, 1551, 1504, 1514, 1510, 1541, 1525, 1550, 1531, 1488, 1529, 1467, 1460, 1506, 1544, 1535, 1455, 1468, 1507, 1505, 1490, 1519, 1489, 1507, 1439, 1561, 1499, 1497, 1534, 1512, 1529, 1442, 1437, 1545, 1547, 1621, 1477, 1525, 1492, 1497, 1492, 1533, 1504, 1482, 1489, 1463, 1479, 1488, 1498, 1502, 1503, 1496, 1468, 1458, 1538, 1536, 1479, 1496, 1477, 1514, 1464, 1526, 1451, 1507, 1527, 1415, 1520, 1476, 1526, 1508, 1479, 1504, 1493, 1466, 1490, 1497, 1472, 1548, 1512, 1535, 1503, 1533, 1467, 1471, 1463, 1526]
It shows the values are balanced.
If that little benchmark isn't enough, my code is based on algorithm S by Knuth, you can take a look on its Art of computer programming for a demonstration that it work correctly.
Bye,
bearophile
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote: > I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. > > The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? > > > Andrei This is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) # Next we just iterate over the rest of the file. If we have exhausted # the file, then the loop terminates immediately. for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to. -- Kirk McDonald | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply