October 09, 2008 Re: CORRECTION: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Russell Lewis | Sorry, I overlooked the (obvious) fact that this biases things towards the end of the file. Change the while loop as follows:
BEGIN CODE
int count = k;
while(!feof(stdin))
{
if(random(0,count) >= k)
ReadALine(); // this discards the line
else
{
int i = random(0,k);
bufs[i] = ReadALine();
}
count++;
}
END CODE
Note that I'm assuming that random(a,b) is a function that randomly returns an integer in the range [a,b).
This new version works because the most-recently-read line has a k/count chance of getting picked up into the current working set, and when we do choose to discard some line, we give all of the old lines an equal chance of survival.
I don't have to to formally work out the math right now, but I'm pretty sure that that would give each line in the entire file a k/count chance of being in the set at any given point in the process.
Russ
Russell Lewis wrote:
> BEGIN CODE
> char[][] choose(int k)
> {
> char[][] bufs; bufs.length = k;
>
> for(int i=0; i<k; i++)
> bufs[i] = ReadALine();
>
> while(!feof(stdin))
> {
> int i = random(0,k);
> bufs[i] = ReadALine();
> }
>
> return bufs;
> }
>
> END CODE
>
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>>
>>
>> Andrei
| |||
October 09, 2008 Re: CORRECTION: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Russell Lewis | Russell Lewis wrote:
> Sorry, I overlooked the (obvious) fact that this biases things towards the end of the file. Change the while loop as follows:
>
> BEGIN CODE
> int count = k;
> while(!feof(stdin))
> {
> if(random(0,count) >= k)
> ReadALine(); // this discards the line
> else
> {
> int i = random(0,k);
> bufs[i] = ReadALine();
> }
> count++;
> }
> END CODE
Yah that works. Nice!
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> No go if the file is a stream. Can it be done in one pass?
No, because without knowing the number of lines, you cannot produce a uniform distribution.
Unless you're willing to make a guess based on the file size.
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> No go if the file is a stream. Can it be done in one pass?
>
> No, because without knowing the number of lines, you cannot produce a uniform distribution.
>
> Unless you're willing to make a guess based on the file size.
I am getting revenge for "I never use anything in std.algorithm" :o).
Andrei
| |||
October 09, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>> No go if the file is a stream. Can it be done in one pass?
>>
>> No, because without knowing the number of lines, you cannot produce a uniform distribution.
>>
>> Unless you're willing to make a guess based on the file size.
>
> I am getting revenge for "I never use anything in std.algorithm" :o).
Get in line for everyone getting their revenge on me!
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>
>
> Andrei
You can't do a uniform random distribution without knowing the length.
Your options are:
- Hold the entire file in memory.
- Hold the entire file so far in memory. Once you get over a certain threshold, remove lines randomly.
You have to tweak the random distribution for the axing method to make sure it's uniform. If this is going to be used often on huge data sets, that is a reasonable effort.
How would you need to tweak the random distribution? The easiest way is to get two random numbers: what section to remove lines from and which line in that section to remove stuff from. You can work out some annoying recurrence to find out how to get the probability of axing from one section given that it has been available for k rounds.
I might work out the recurrence tonight, but I don't have a chalkboard.
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Reply to Andrei,
> Walter Bright wrote:
>
>> Andrei Alexandrescu wrote:
>>
>>> No go if the file is a stream. Can it be done in one pass?
>>>
>> No, because without knowing the number of lines, you cannot produce a
>> uniform distribution.
>>
>> Unless you're willing to make a guess based on the file size.
>>
> I am getting revenge for "I never use anything in std.algorithm" :o).
>
> Andrei
>
"Et tu, Brute!" - (Julius Caesar, Act III, Scene I).
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Christopher Wright | : You can't do a uniform random distribution without knowing the length. Probably true for other distribution. Most certainly not true for uniform distribution (with a raisonable k) You can work on a subset of the file. Let say 1000 records. The distribution being uniform, you can select (or eliminate), a percentage of each subset and the results for the whole file will be ok. | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Carlos | Carlos wrote:
> : You can't do a uniform random distribution without knowing the length.
>
> Probably true for other distribution.
> Most certainly not true for uniform distribution (with a raisonable k)
>
> You can work on a subset of the file. Let say 1000 records.
> The distribution being uniform, you can select (or eliminate),
> a percentage of each subset and the results for the whole file
> will be ok.
I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision.
Andrei
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> Carlos wrote:
>> : You can't do a uniform random distribution without knowing the length.
>>
>> Probably true for other distribution.
>> Most certainly not true for uniform distribution (with a raisonable k)
>>
>> You can work on a subset of the file. Let say 1000 records.
>> The distribution being uniform, you can select (or eliminate),
>> a percentage of each subset and the results for the whole file
>> will be ok.
>
> I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision.
I think "the number of total samples..." is the correct statement.
Andrei
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply