Jump to page: 1 28  
Page
Thread overview
random k-sample of a file
Oct 09, 2008
dsimcha
Oct 09, 2008
Walter Bright
Oct 09, 2008
BCS
Oct 09, 2008
Walter Bright
Oct 09, 2008
Walter Bright
Oct 10, 2008
BCS
Oct 09, 2008
Ary Borenszweig
Oct 09, 2008
Denis Koroskin
Oct 09, 2008
Ary Borenszweig
Oct 09, 2008
KennyTM~
Oct 09, 2008
bearophile
Oct 09, 2008
bearophile
Oct 09, 2008
bearophile
Oct 09, 2008
bearophile
Oct 09, 2008
Ary Borenszweig
Oct 09, 2008
Ary Borenszweig
Oct 10, 2008
Ary Borenszweig
Oct 10, 2008
Ary Borenszweig
Oct 10, 2008
bearophile
Oct 10, 2008
Sergey Gromov
Oct 09, 2008
Kirk McDonald
Oct 09, 2008
Jason House
Oct 09, 2008
Kirk McDonald
Oct 09, 2008
Kirk McDonald
Oct 09, 2008
bearophile
Oct 10, 2008
Christopher Wright
Oct 11, 2008
Christopher Wright
Oct 09, 2008
Kirk McDonald
Oct 09, 2008
BCS
Oct 09, 2008
Jason House
Oct 09, 2008
BCS
Oct 09, 2008
Russell Lewis
Re: CORRECTION: random k-sample of a file
Oct 09, 2008
Russell Lewis
Oct 10, 2008
Christopher Wright
Oct 10, 2008
Carlos
Oct 10, 2008
Sergey Gromov
Oct 10, 2008
Denis Koroskin
Oct 10, 2008
Denis Koroskin
Oct 10, 2008
Sergey Gromov
Oct 10, 2008
bearophile
Oct 10, 2008
Michel Fortin
October 09, 2008
I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.

The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?


Andrei
October 09, 2008
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s article
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from the
> file. (If the file has less than k lines, it should all be output.) The
> file's size is unbounded so loading it in memory is not an option. How'd
> you go about it?
> Andrei

Pretty simple, actually.  Read the file one character at a time.  Get the offset of the beginning of each line as a ulong or something.  Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc.  Use plenty of casting tricks.  Shuffle the ulong.sizeof chunks of this file the same way you would shuffle an ordinary array, using seeking as the equivalent to array indexing.  (I never said this was going to be efficient.)  For each of the first k offsets after shuffling, print everything in the original file from that offset to the first EOL character encountered.

October 09, 2008
Andrei Alexandrescu wrote:
> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?

1. read the file, counting lines => nlines
2. select k random numbers 0..nlines
3. read the file line by line again, outputting the line if it's one of the selected lines
October 09, 2008
Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
> 
> 1. read the file, counting lines => nlines
> 2. select k random numbers 0..nlines
> 3. read the file line by line again, outputting the line if it's one of the selected lines

No go if the file is a stream. Can it be done in one pass?

Andrei
October 09, 2008
Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
> 
> 1. read the file, counting lines => nlines
> 2. select k random numbers 0..nlines
> 3. read the file line by line again, outputting the line if it's one of the selected lines

Can't this be done in one single pass?

My thoughts were:

1. read the first k lines and assign them to the ones to return (the "stored lines")
2. probability of replacing a line: p = ??
3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??)

If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero.

I don't know the values of ??... maybe someone else can find them?

(I don't know how to randomly choose k elements of n, n >= k).
October 09, 2008
On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary@esperanto.org.ar> wrote:

> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>  1. read the file, counting lines => nlines
>> 2. select k random numbers 0..nlines
>> 3. read the file line by line again, outputting the line if it's one of the selected lines
>
> Can't this be done in one single pass?
>
> My thoughts were:
>
> 1. read the first k lines and assign them to the ones to return (the "stored lines")
>

What if there is only one line in a file? Andrei told that

> The file's size is unbounded so loading it in memory is not an option.
October 09, 2008
Andrei Alexandrescu:
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from the
> file. (If the file has less than k lines, it should all be output.) The
> file's size is unbounded so loading it in memory is not an option. How'd
> you go about it?

It's an interesting problem, I can offer two solutions, the first one is the one I use in practice, that is use a scripting language (but the same exact program can be written in D, using my libs it becomes very similar) that loads the file two times, the first to count the lines, then creates an array of randomly chosen indexes (without replacement!) and uses them to print the chosen ones:

from sys import argv
from random import sample

filename = argv[1]
k = int(argv[2])
nlines = sum(1 for _ in file(filename))

if k >= nlines:
    for line in file(filename):
        print line
else:
    samples = set(sample(xrange(nlines), k))
    for i, line in enumerate(file(filename)):
        if i in samples:
            print line

If your practical tests show you that reading your file twice is too much slow, then you may want to look for a more complex solution.

I'll think about the more complex problem in the next post.

Later,
bearophile
October 09, 2008
Denis Koroskin wrote:
> On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary@esperanto.org.ar> wrote:
> 
>> Walter Bright wrote:
>>> Andrei Alexandrescu wrote:
>>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>>  1. read the file, counting lines => nlines
>>> 2. select k random numbers 0..nlines
>>> 3. read the file line by line again, outputting the line if it's one of the selected lines
>>
>> Can't this be done in one single pass?
>>
>> My thoughts were:
>>
>> 1. read the first k lines and assign them to the ones to return (the "stored lines")
>>
> 
> What if there is only one line in a file? Andrei told that
> 
>> The file's size is unbounded so loading it in memory is not an option.

Keeping k of the lines in memory is fine.

Andrei
October 09, 2008
Ary Borenszweig wrote:
> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>> 1. read the file, counting lines => nlines
>> 2. select k random numbers 0..nlines
>> 3. read the file line by line again, outputting the line if it's one of the selected lines
> 
> Can't this be done in one single pass?
> 
> My thoughts were:
> 
> 1. read the first k lines and assign them to the ones to return (the "stored lines")
> 2. probability of replacing a line: p = ??
> 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??)
> 
> If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero.
> 
> I don't know the values of ??... maybe someone else can find them?
> 
> (I don't know how to randomly choose k elements of n, n >= k).

I guess p depends on the number of lines unfortunately.
Because p needs to be decreasing with line count, otherwise the first few line will almost always be replaced (not uniform then).
October 09, 2008
Denis Koroskin wrote:
> On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary@esperanto.org.ar> wrote:
> 
>> Walter Bright wrote:
>>> Andrei Alexandrescu wrote:
>>>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>>  1. read the file, counting lines => nlines
>>> 2. select k random numbers 0..nlines
>>> 3. read the file line by line again, outputting the line if it's one of the selected lines
>>
>> Can't this be done in one single pass?
>>
>> My thoughts were:
>>
>> 1. read the first k lines and assign them to the ones to return (the "stored lines")
>>
> 
> What if there is only one line in a file? Andrei told that

If you can't read more lines in this step, the answer are all the lines.

> 
>> The file's size is unbounded so loading it in memory is not an option.
« First   ‹ Prev
1 2 3 4 5 6 7 8