October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu escribió: > bearophile wrote: >> Third solution, this requires a storage of k lines (but you can keep this storage on disk): >> >> from sys import argv >> from random import random, randrange >> # randrange gives a random integer in [0, n) >> >> filename = argv[1] >> k = int(argv[2]) >> assert k > 0 >> >> chosen_lines = [] >> for i, line in enumerate(file(filename)): >> if i < k: >> chosen_lines.append(line) >> else: >> if random() < (1.0 / (i+1)): >> chosen_lines[randrange(k)] = line >> >> print chosen_lines > > We have a winner!!! There is actually a very simple proof on how and why this works. Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-( > > Andrei > | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS Wrote:
> Reply to Andrei,
>
> > I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
> >
> > The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
> >
> > Andrei
> >
>
> struct S
> {
> char[] line;
> real v;
> int opCmp(S s)
> {
> if(s.v < this.v) return -1;
> else if(s.v > this.v) return +1;
> return 0;
> }
> }
>
> int k;
> S[] buf = new S[k+1];
>
> foreach(s;buf) s.v = real.max;
>
> foreach(char[] line; stream)
> {
> delete buf[$-1].line;
> buf[$-1].line = line.dup;;
> buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max
> buf.sort;
> }
>
> foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
>
>
You should check if your new random number is good enough to keep before doing the array manipulation and string dup... That small change should yield an near-optimal big O. A heap should give slightly faster performance for moderate values of k.
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Ary Borenszweig | Ary Borenszweig wrote:
> Andrei Alexandrescu escribió:
>> bearophile wrote:
>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>
>>> from sys import argv
>>> from random import random, randrange
>>> # randrange gives a random integer in [0, n)
>>>
>>> filename = argv[1]
>>> k = int(argv[2])
>>> assert k > 0
>>>
>>> chosen_lines = []
>>> for i, line in enumerate(file(filename)):
>>> if i < k:
>>> chosen_lines.append(line)
>>> else:
>>> if random() < (1.0 / (i+1)):
>>> chosen_lines[randrange(k)] = line
>>>
>>> print chosen_lines
>>
>> We have a winner!!! There is actually a very simple proof on how and why this works.
>
> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>
> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>
> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>
> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>
> What am I doing wrong? :-(
Nothing except you stop the induction at step 3...
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu escribió: > Ary Borenszweig wrote: >> Andrei Alexandrescu escribió: >>> bearophile wrote: >>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk): >>>> >>>> from sys import argv >>>> from random import random, randrange >>>> # randrange gives a random integer in [0, n) >>>> >>>> filename = argv[1] >>>> k = int(argv[2]) >>>> assert k > 0 >>>> >>>> chosen_lines = [] >>>> for i, line in enumerate(file(filename)): >>>> if i < k: >>>> chosen_lines.append(line) >>>> else: >>>> if random() < (1.0 / (i+1)): >>>> chosen_lines[randrange(k)] = line >>>> >>>> print chosen_lines >>> >>> We have a winner!!! There is actually a very simple proof on how and why this works. >> >> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. >> >> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. >> >> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. >> >> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. >> >> What am I doing wrong? :-( > > Nothing except you stop the induction at step 3... ... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not? > > Andrei | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jason House | Reply to Jason,
> BCS Wrote:
>
>> Reply to Andrei,
>>
>>> I just ran across a nice little problem that I wanted to share as an
>>> exercise for the interested in futile pastimes.
>>>
>>> The challenge, should you accept it, is to write a program that
>>> given a number k and a file, outputs k lines picked uniformly at
>>> random from the file. (If the file has less than k lines, it should
>>> all be output.) The file's size is unbounded so loading it in memory
>>> is not an option. How'd you go about it?
>>>
>>> Andrei
>>>
>> struct S
>> {
>> char[] line;
>> real v;
>> int opCmp(S s)
>> {
>> if(s.v < this.v) return -1;
>> else if(s.v > this.v) return +1;
>> return 0;
>> }
>> }
>> int k;
>> S[] buf = new S[k+1];
>> foreach(s;buf) s.v = real.max;
>>
>> foreach(char[] line; stream)
>> {
>> delete buf[$-1].line;
>> buf[$-1].line = line.dup;;
>> buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max
>> buf.sort;
>> }
>> foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
>>
> You should check if your new random number is good enough to keep
> before doing the array manipulation and string dup... That small
> change should yield an near-optimal big O. A heap should give slightly
> faster performance for moderate values of k.
>
I was going for simplicity and ease of reading within a given big-O class.
Switching to insertion sort or reusing data buffers would also do better.
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Ary Borenszweig | Ary Borenszweig wrote:
> Andrei Alexandrescu escribió:
>> Ary Borenszweig wrote:
>>> Andrei Alexandrescu escribió:
>>>> bearophile wrote:
>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>>>
>>>>> from sys import argv
>>>>> from random import random, randrange
>>>>> # randrange gives a random integer in [0, n)
>>>>>
>>>>> filename = argv[1]
>>>>> k = int(argv[2])
>>>>> assert k > 0
>>>>>
>>>>> chosen_lines = []
>>>>> for i, line in enumerate(file(filename)):
>>>>> if i < k:
>>>>> chosen_lines.append(line)
>>>>> else:
>>>>> if random() < (1.0 / (i+1)):
>>>>> chosen_lines[randrange(k)] = line
>>>>>
>>>>> print chosen_lines
>>>>
>>>> We have a winner!!! There is actually a very simple proof on how and why this works.
>>>
>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>>>
>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>>>
>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>>>
>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>
>>> What am I doing wrong? :-(
>>
>> Nothing except you stop the induction at step 3...
>
> ... which is the last step in this case. There are only three lines.
>
> p(a) = 5/6
> p(b) = 5/6
> p(c) = 1/3
>
> That doesn't seem uniform.
>
> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it.
Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Ary Borenszweig | Ary Borenszweig wrote: > Andrei Alexandrescu escribió: >> Ary Borenszweig wrote: >>> Andrei Alexandrescu escribió: >>>> bearophile wrote: >>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk): >>>>> >>>>> from sys import argv >>>>> from random import random, randrange >>>>> # randrange gives a random integer in [0, n) >>>>> >>>>> filename = argv[1] >>>>> k = int(argv[2]) >>>>> assert k > 0 >>>>> >>>>> chosen_lines = [] >>>>> for i, line in enumerate(file(filename)): >>>>> if i < k: >>>>> chosen_lines.append(line) >>>>> else: >>>>> if random() < (1.0 / (i+1)): >>>>> chosen_lines[randrange(k)] = line >>>>> >>>>> print chosen_lines >>>> >>>> We have a winner!!! There is actually a very simple proof on how and why this works. >>> >>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. >>> >>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. >>> >>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. >>> >>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. >>> >>> What am I doing wrong? :-( >> >> Nothing except you stop the induction at step 3... > > ... which is the last step in this case. There are only three lines. > > p(a) = 5/6 > p(b) = 5/6 > p(c) = 1/3 > > That doesn't seem uniform. > > In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not? > To reiterate the code I was talking about (the solution for k == 1): from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state: i == 2 chosen_line = <one of the first two lines> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2). It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line. The math continues to work out in this way as the number of lines increases without bound. -- Kirk McDonald | |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kirk McDonald | Kirk McDonald wrote:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu escribió:
>>> Ary Borenszweig wrote:
>>>> Andrei Alexandrescu escribió:
>>>>> bearophile wrote:
>>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>>>>
>>>>>> from sys import argv
>>>>>> from random import random, randrange
>>>>>> # randrange gives a random integer in [0, n)
>>>>>>
>>>>>> filename = argv[1]
>>>>>> k = int(argv[2])
>>>>>> assert k > 0
>>>>>>
>>>>>> chosen_lines = []
>>>>>> for i, line in enumerate(file(filename)):
>>>>>> if i < k:
>>>>>> chosen_lines.append(line)
>>>>>> else:
>>>>>> if random() < (1.0 / (i+1)):
>>>>>> chosen_lines[randrange(k)] = line
>>>>>>
>>>>>> print chosen_lines
>>>>>
>>>>> We have a winner!!! There is actually a very simple proof on how and why this works.
>>>>
>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>>>>
>>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>>>>
>>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>
>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>>
>>>> What am I doing wrong? :-(
>>>
>>> Nothing except you stop the induction at step 3...
>>
>> ... which is the last step in this case. There are only three lines.
>>
>> p(a) = 5/6
>> p(b) = 5/6
>> p(c) = 1/3
>>
>> That doesn't seem uniform.
>>
>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
>>
>
> To reiterate the code I was talking about (the solution for k == 1):
>
> from random import randint
>
> chosen_line = None
>
> for i, line in enumerate(open('filename')):
> if randint(0, i) == 0:
> chosen_line = line
>
> print chosen_line
>
> Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state:
>
> i == 2
> chosen_line = <one of the first two lines>
>
> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2).
>
> It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line.
>
> The math continues to work out in this way as the number of lines increases without bound.
That's why I thought it's such a beautiful problem. My proof is inductive over n. The invariant is that the k lines held represent a uniform selection of the n lines seen so far at any moment.
Base: For n <= k, vacuously satisfied.
Inductive step: We have a random selection of n lines so far in array selection, and we add one more line. We need to prove that the new selection is also uniform. The new line has probability k / (n + 1) of being part of the selection. We toss a skewed coin with that probability. If it turns heads, we pick the new element for keeping. Any element in the array should go with equal probability (by the inductive assumption), so we roll a k-sided dice to choose one element to throw away.
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | BEGIN CODE
char[][] choose(int k)
{
char[][] bufs; bufs.length = k;
for(int i=0; i<k; i++)
bufs[i] = ReadALine();
while(!feof(stdin))
{
int i = random(0,k);
bufs[i] = ReadALine();
}
return bufs;
}
END CODE
Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>
>
> Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Russell Lewis | Russell Lewis wrote:
> BEGIN CODE
> char[][] choose(int k)
> {
> char[][] bufs; bufs.length = k;
>
> for(int i=0; i<k; i++)
> bufs[i] = ReadALine();
>
> while(!feof(stdin))
> {
> int i = random(0,k);
> bufs[i] = ReadALine();
> }
>
> return bufs;
> }
>
> END CODE
Then the last line will be part of the selection with probability 1.
Andrei
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply