October 09, 2008 random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andrei | ||||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | == Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s article > I just ran across a nice little problem that I wanted to share as an > exercise for the interested in futile pastimes. > The challenge, should you accept it, is to write a program that given a > number k and a file, outputs k lines picked uniformly at random from the > file. (If the file has less than k lines, it should all be output.) The > file's size is unbounded so loading it in memory is not an option. How'd > you go about it? > Andrei Pretty simple, actually. Read the file one character at a time. Get the offset of the beginning of each line as a ulong or something. Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc. Use plenty of casting tricks. Shuffle the ulong.sizeof chunks of this file the same way you would shuffle an ordinary array, using seeking as the equivalent to array indexing. (I never said this was going to be efficient.) For each of the first k offsets after shuffling, print everything in the original file from that offset to the first EOL character encountered. | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
1. read the file, counting lines => nlines
2. select k random numbers 0..nlines
3. read the file line by line again, outputting the line if it's one of the selected lines
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>
> 1. read the file, counting lines => nlines
> 2. select k random numbers 0..nlines
> 3. read the file line by line again, outputting the line if it's one of the selected lines
No go if the file is a stream. Can it be done in one pass?
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>
> 1. read the file, counting lines => nlines
> 2. select k random numbers 0..nlines
> 3. read the file line by line again, outputting the line if it's one of the selected lines
Can't this be done in one single pass?
My thoughts were:
1. read the first k lines and assign them to the ones to return (the "stored lines")
2. probability of replacing a line: p = ??
3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??)
If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero.
I don't know the values of ??... maybe someone else can find them?
(I don't know how to randomly choose k elements of n, n >= k).
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Ary Borenszweig | On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary@esperanto.org.ar> wrote: > Walter Bright wrote: >> Andrei Alexandrescu wrote: >>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? >> 1. read the file, counting lines => nlines >> 2. select k random numbers 0..nlines >> 3. read the file line by line again, outputting the line if it's one of the selected lines > > Can't this be done in one single pass? > > My thoughts were: > > 1. read the first k lines and assign them to the ones to return (the "stored lines") > What if there is only one line in a file? Andrei told that > The file's size is unbounded so loading it in memory is not an option. | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu:
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from the
> file. (If the file has less than k lines, it should all be output.) The
> file's size is unbounded so loading it in memory is not an option. How'd
> you go about it?
It's an interesting problem, I can offer two solutions, the first one is the one I use in practice, that is use a scripting language (but the same exact program can be written in D, using my libs it becomes very similar) that loads the file two times, the first to count the lines, then creates an array of randomly chosen indexes (without replacement!) and uses them to print the chosen ones:
from sys import argv
from random import sample
filename = argv[1]
k = int(argv[2])
nlines = sum(1 for _ in file(filename))
if k >= nlines:
for line in file(filename):
print line
else:
samples = set(sample(xrange(nlines), k))
for i, line in enumerate(file(filename)):
if i in samples:
print line
If your practical tests show you that reading your file twice is too much slow, then you may want to look for a more complex solution.
I'll think about the more complex problem in the next post.
Later,
bearophile
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Denis Koroskin | Denis Koroskin wrote:
> On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary@esperanto.org.ar> wrote:
>
>> Walter Bright wrote:
>>> Andrei Alexandrescu wrote:
>>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>> 1. read the file, counting lines => nlines
>>> 2. select k random numbers 0..nlines
>>> 3. read the file line by line again, outputting the line if it's one of the selected lines
>>
>> Can't this be done in one single pass?
>>
>> My thoughts were:
>>
>> 1. read the first k lines and assign them to the ones to return (the "stored lines")
>>
>
> What if there is only one line in a file? Andrei told that
>
>> The file's size is unbounded so loading it in memory is not an option.
Keeping k of the lines in memory is fine.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Ary Borenszweig | Ary Borenszweig wrote:
> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>> 1. read the file, counting lines => nlines
>> 2. select k random numbers 0..nlines
>> 3. read the file line by line again, outputting the line if it's one of the selected lines
>
> Can't this be done in one single pass?
>
> My thoughts were:
>
> 1. read the first k lines and assign them to the ones to return (the "stored lines")
> 2. probability of replacing a line: p = ??
> 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??)
>
> If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero.
>
> I don't know the values of ??... maybe someone else can find them?
>
> (I don't know how to randomly choose k elements of n, n >= k).
I guess p depends on the number of lines unfortunately.
Because p needs to be decreasing with line count, otherwise the first few line will almost always be replaced (not uniform then).
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Denis Koroskin | Denis Koroskin wrote: > On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary@esperanto.org.ar> wrote: > >> Walter Bright wrote: >>> Andrei Alexandrescu wrote: >>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? >>> 1. read the file, counting lines => nlines >>> 2. select k random numbers 0..nlines >>> 3. read the file line by line again, outputting the line if it's one of the selected lines >> >> Can't this be done in one single pass? >> >> My thoughts were: >> >> 1. read the first k lines and assign them to the ones to return (the "stored lines") >> > > What if there is only one line in a file? Andrei told that If you can't read more lines in this step, the answer are all the lines. > >> The file's size is unbounded so loading it in memory is not an option. | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply