October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Thu, 09 Oct 2008 13:56:48 -0500,
Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
Here's mine
module k_sample;
import std.conv: to;
import std.stream: BufferedFile;
import std.random: Random, uniform, unpredictableSeed;
import std.stdio: writefln;
void main(string[] args)
{
auto f = new BufferedFile(args[1]);
auto k = to!(int)(args[2]);
assert(k > 0);
struct Sample
{
string text;
ulong line;
this(string s, ulong n)
{
text = s;
line = n;
}
}
Sample[] samples;
auto gen = Random(unpredictableSeed);
foreach (ulong n, char[] s; f)
{
if (samples.length < k)
samples ~= Sample(s.idup, n+1);
else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
}
foreach (sample; samples)
writefln("%8s: %s", sample.line, sample.text);
}
Usage: k-sample <file> <k>
It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov wrote:
> Thu, 09 Oct 2008 13:56:48 -0500,
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?
>
> Here's mine
>
> module k_sample;
>
> import std.conv: to;
> import std.stream: BufferedFile;
> import std.random: Random, uniform, unpredictableSeed;
> import std.stdio: writefln;
>
> void main(string[] args)
> {
> auto f = new BufferedFile(args[1]);
> auto k = to!(int)(args[2]);
> assert(k > 0);
>
> struct Sample
> {
> string text;
> ulong line;
> this(string s, ulong n)
> {
> text = s;
> line = n;
> }
> }
>
> Sample[] samples;
> auto gen = Random(unpredictableSeed);
>
> foreach (ulong n, char[] s; f)
> {
> if (samples.length < k)
> samples ~= Sample(s.idup, n+1);
> else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
> samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
> }
>
> foreach (sample; samples)
> writefln("%8s: %s", sample.line, sample.text);
> }
>
> Usage: k-sample <file> <k>
>
> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.
Sweet! Suggestion: replace Sample's definition with:
alias Tuple!(string, "text", ulong, "line") Sample;
On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?
Andrei
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts? ..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be. | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Jarrett Billingsley wrote:
> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On a different topic, your use of std.stream reminded me. I was thinking of
>> yanking it from Phobos. Thoughts?
>
> ..what? You want to remove most of Phobos' IO capabilities?
>
> I'm guessing there's something else you'd like to replace it with, but
> I wonder what it could be.
Well I'd like to extend std.stdio with capabilities that meet and exceed those in std.stream. The design of std.stream is ok, but makes it hard to compose streams, and requires a lot of hand coding to define new streams.
Andrei
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote: > On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu > <SeeWebsiteForEmail@erdani.org> wrote: > >> On a different topic, your use of std.stream reminded me. I was thinking of >> yanking it from Phobos. Thoughts? > > ..what? You want to remove most of Phobos' IO capabilities? > > I'm guessing there's something else you'd like to replace it with, but > I wonder what it could be. Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;) | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 2008-10-09 14:56:48 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said: > I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. > > The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? My attempt. Works in one pass, and distributes uniformly but only when k is a factor of the number of lines in the file. At least, I believe the distribution is uniform, but feel free to prove me wrong. /* Uniform random value r, with r >= min && r < max. */ real random(uint min, uint max) { ... } /* Shuffle elements uniformly in the given string array. */ void shuffle(ref string[] array) { ... } string[] challenge(alias R)(R input, uint k) { string[] result; uint line; real pickFactor = 1; // Initial stage: fill initial k strings while (!input.eof && line < k) { result ~= input.readLine; ++line; } shuffle(result); while (!input.eof) { // Next stage: half the chances of picking a candidate. pickFactor /= 2; string[] candidates; line = 0; while (!input.eof && line < k) { candidates ~= input.readLine; ++line; } shuffle(candidates); for (uint i; i < candidates.length; ++i) if (random(0, 1) < pickFactor) result[i] = candidates[i]; // Note: if candidate.length != result.length // we don't have uniformity. } return result; } -- Michel Fortin michel.fortin@michelf.com http://michelf.com/ | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Thu, 09 Oct 2008 22:28:51 -0500, Andrei Alexandrescu wrote: > Sergey Gromov wrote: > > Thu, 09 Oct 2008 13:56:48 -0500, > > Andrei Alexandrescu wrote: > >> I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. > >> > >> The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? > > > > Here's mine > > > > module k_sample; > > > > import std.conv: to; > > import std.stream: BufferedFile; > > import std.random: Random, uniform, unpredictableSeed; > > import std.stdio: writefln; > > > > void main(string[] args) > > { > > auto f = new BufferedFile(args[1]); > > auto k = to!(int)(args[2]); > > assert(k > 0); > > > > struct Sample > > { > > string text; > > ulong line; > > this(string s, ulong n) > > { > > text = s; > > line = n; > > } > > } > > > > Sample[] samples; > > auto gen = Random(unpredictableSeed); > > > > foreach (ulong n, char[] s; f) > > { > > if (samples.length < k) > > samples ~= Sample(s.idup, n+1); > > else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) > > samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); > > } > > > > foreach (sample; samples) > > writefln("%8s: %s", sample.line, sample.text); > > } > > > > Usage: k-sample <file> <k> > > > > It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text. > > Sweet! Suggestion: replace Sample's definition with: > > alias Tuple!(string, "text", ulong, "line") Sample; Good advice, thanks. The Sample definition is unimportant for the solution but eats up a lot of visual space. > On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts? While experimenting with Bearophile's loader > http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402 I've found out that BufferedFile is the fastest available way to process a file line-by-line. So the alternative should be as fast. | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> enforce(k > 0, "Must pass a strictly positive selection size");
Out of curiosity, why are you using enforce rather than assert?
I've written an assertions framework, but it provides some more options than assert:
expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0);
If that failed, it would produce:
AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection size
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Christopher Wright | Christopher Wright wrote: > Andrei Alexandrescu wrote: >> enforce(k > 0, "Must pass a strictly positive selection size"); > > Out of curiosity, why are you using enforce rather than assert? Enforce is never ignored, even in production builds. You wouldn't want a release build of your app to skip checking command line arguments. > I've written an assertions framework, but it provides some more options than assert: > > expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); > > If that failed, it would produce: > AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection size That looks pretty darn nice. Andrei | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Denis Koroskin | On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden@gmail.com> wrote:
> On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote:
>
>> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>
>>> On a different topic, your use of std.stream reminded me. I was thinking
>>> of
>>> yanking it from Phobos. Thoughts?
>>
>> ..what? You want to remove most of Phobos' IO capabilities?
>>
>> I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
>
> Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)
>
That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply