October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Jarrett Billingsley wrote:
> On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden@gmail.com> wrote:
>> On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
>> <jarrett.billingsley@gmail.com> wrote:
>>
>>> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> wrote:
>>>
>>>> On a different topic, your use of std.stream reminded me. I was thinking
>>>> of
>>>> yanking it from Phobos. Thoughts?
>>> ..what? You want to remove most of Phobos' IO capabilities?
>>>
>>> I'm guessing there's something else you'd like to replace it with, but
>>> I wonder what it could be.
>> Yeah, why need them when Tango is soon to be used with Phobos side-by-side
>> ;)
>>
>
> That's kind of what I'm thinking. It seems like a waste of time for
> _another_ IO framework to be developed when we have two already.
> Priorities, priorities. How about fixing frontend bugs instead?
Well Walter wouldn't be working on it anyway. The thing is, if we want
e.g. std.algorithm to work on I/O, there must be integration of I/O with
the range concept. As far as I can see things, ranges will be quite a
powerful currency allowing many std components to trade data with one
another.
Andrei
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov:
> While experimenting with Bearophile's loader
> > http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402
> I've found out that BufferedFile is the fastest available way to process a file line-by-line. So the alternative should be as fast.
BufferedFile seems the faster in Phobos, but it's 2-3 times slower than the line-by-line reading you have in Python, and I think it has 4 bugs, so I don't use it much, that's why I use the xfile() of my libs, that is a little slower, but without those bugs.
Few bugs of BufferedFile can be seen if you try to load a very large file made of a single line (crashes D every time), or you try reading lines with \0 at the end of the line, etc. They are corner cases, usually.
I suggest reading the C sources of the I/O of Python (or Perl, if you are able to read the sources. I think the differences in clarity between Python and Perl can be seen even in their C sources).
Bye,
bearophile
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > Jarrett Billingsley wrote: >> On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden@gmail.com> wrote: >>> On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley >>> <jarrett.billingsley@gmail.com> wrote: >>> >>>> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu >>>> <SeeWebsiteForEmail@erdani.org> wrote: >>>> >>>>> On a different topic, your use of std.stream reminded me. I was thinking >>>>> of >>>>> yanking it from Phobos. Thoughts? >>>> ..what? You want to remove most of Phobos' IO capabilities? >>>> >>>> I'm guessing there's something else you'd like to replace it with, but >>>> I wonder what it could be. >>> Yeah, why need them when Tango is soon to be used with Phobos side-by-side >>> ;) >>> >> That's kind of what I'm thinking. It seems like a waste of time for >> _another_ IO framework to be developed when we have two already. >> Priorities, priorities. How about fixing frontend bugs instead? > > Well Walter wouldn't be working on it anyway. The thing is, if we want > e.g. std.algorithm to work on I/O, there must be integration of I/O with > the range concept. As far as I can see things, ranges will be quite a > powerful currency allowing many std components to trade data with one > another. > > Andrei > Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;) | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote: > Ary Borenszweig wrote: >> Andrei Alexandrescu escribió: >>> Ary Borenszweig wrote: >>>> Andrei Alexandrescu escribió: >>>>> bearophile wrote: >>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk): >>>>>> >>>>>> from sys import argv >>>>>> from random import random, randrange >>>>>> # randrange gives a random integer in [0, n) >>>>>> >>>>>> filename = argv[1] >>>>>> k = int(argv[2]) >>>>>> assert k > 0 >>>>>> >>>>>> chosen_lines = [] >>>>>> for i, line in enumerate(file(filename)): >>>>>> if i < k: >>>>>> chosen_lines.append(line) >>>>>> else: >>>>>> if random() < (1.0 / (i+1)): >>>>>> chosen_lines[randrange(k)] = line >>>>>> >>>>>> print chosen_lines >>>>> >>>>> We have a winner!!! There is actually a very simple proof on how and why this works. >>>> >>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. >>>> >>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. >>>> >>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. >>>> >>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. >>>> >>>> What am I doing wrong? :-( >>> >>> Nothing except you stop the induction at step 3... >> >> ... which is the last step in this case. There are only three lines. >> >> p(a) = 5/6 >> p(b) = 5/6 >> p(c) = 1/3 >> >> That doesn't seem uniform. >> >> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not? > > Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. > > Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should. Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution) > > > Andrei | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Ary Borenszweig | Ary Borenszweig wrote:
> Andrei Alexandrescu wrote:
>> Ary Borenszweig wrote:
>>> Andrei Alexandrescu escribió:
>>>> Ary Borenszweig wrote:
>>>>> Andrei Alexandrescu escribió:
>>>>>> bearophile wrote:
>>>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>>>>>
>>>>>>> from sys import argv
>>>>>>> from random import random, randrange
>>>>>>> # randrange gives a random integer in [0, n)
>>>>>>>
>>>>>>> filename = argv[1]
>>>>>>> k = int(argv[2])
>>>>>>> assert k > 0
>>>>>>>
>>>>>>> chosen_lines = []
>>>>>>> for i, line in enumerate(file(filename)):
>>>>>>> if i < k:
>>>>>>> chosen_lines.append(line)
>>>>>>> else:
>>>>>>> if random() < (1.0 / (i+1)):
>>>>>>> chosen_lines[randrange(k)] = line
>>>>>>>
>>>>>>> print chosen_lines
>>>>>>
>>>>>> We have a winner!!! There is actually a very simple proof on how and why this works.
>>>>>
>>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>>>>>
>>>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>>>>>
>>>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>>
>>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>>>
>>>>> What am I doing wrong? :-(
>>>>
>>>> Nothing except you stop the induction at step 3...
>>>
>>> ... which is the last step in this case. There are only three lines.
>>>
>>> p(a) = 5/6
>>> p(b) = 5/6
>>> p(c) = 1/3
>>>
>>> That doesn't seem uniform.
>>>
>>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
>>
>> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it.
>>
>> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.
>
> Ah, ok. It works if you do that. So bearophile's code must say
>
> if random() > (1.0 / (i+1)):
>
> instead of
>
> if random() < (1.0 / (i+1)):
>
> (that is, if random() returns a real number between 0 and 1 with uniform distribution)
His prize gets retired. Now how does Kirk's code fare? :o)
Andrei
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Denis Koroskin | Denis Koroskin wrote:
> On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>
>> Jarrett Billingsley wrote:
>>> On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden@gmail.com> wrote:
>>>> On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
>>>> <jarrett.billingsley@gmail.com> wrote:
>>>>
>>>>> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
>>>>> <SeeWebsiteForEmail@erdani.org> wrote:
>>>>>
>>>>>> On a different topic, your use of std.stream reminded me. I was thinking
>>>>>> of
>>>>>> yanking it from Phobos. Thoughts?
>>>>> ..what? You want to remove most of Phobos' IO capabilities?
>>>>>
>>>>> I'm guessing there's something else you'd like to replace it with, but
>>>>> I wonder what it could be.
>>>> Yeah, why need them when Tango is soon to be used with Phobos side-by-side
>>>> ;)
>>>>
>>> That's kind of what I'm thinking. It seems like a waste of time for
>>> _another_ IO framework to be developed when we have two already.
>>> Priorities, priorities. How about fixing frontend bugs instead?
>>
>> Well Walter wouldn't be working on it anyway. The thing is, if we want
>> e.g. std.algorithm to work on I/O, there must be integration of I/O with
>> the range concept. As far as I can see things, ranges will be quite a
>> powerful currency allowing many std components to trade data with one
>> another.
>>
>> Andrei
>>
>
> Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)
That would make Phobos dependent on Tango, thus running into those licensing issues I have little understanding of.
Andrei
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu wrote:
>>> Ary Borenszweig wrote:
>>>> Andrei Alexandrescu escribió:
>>>>> Ary Borenszweig wrote:
>>>>>> Andrei Alexandrescu escribió:
>>>>>>> bearophile wrote:
>>>>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>>>>>>
>>>>>>>> from sys import argv
>>>>>>>> from random import random, randrange
>>>>>>>> # randrange gives a random integer in [0, n)
>>>>>>>>
>>>>>>>> filename = argv[1]
>>>>>>>> k = int(argv[2])
>>>>>>>> assert k > 0
>>>>>>>>
>>>>>>>> chosen_lines = []
>>>>>>>> for i, line in enumerate(file(filename)):
>>>>>>>> if i < k:
>>>>>>>> chosen_lines.append(line)
>>>>>>>> else:
>>>>>>>> if random() < (1.0 / (i+1)):
>>>>>>>> chosen_lines[randrange(k)] = line
>>>>>>>>
>>>>>>>> print chosen_lines
>>>>>>>
>>>>>>> We have a winner!!! There is actually a very simple proof on how and why this works.
>>>>>>
>>>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>>>>>>
>>>>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>>>>>>
>>>>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>>>
>>>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>>>>
>>>>>> What am I doing wrong? :-(
>>>>>
>>>>> Nothing except you stop the induction at step 3...
>>>>
>>>> ... which is the last step in this case. There are only three lines.
>>>>
>>>> p(a) = 5/6
>>>> p(b) = 5/6
>>>> p(c) = 1/3
>>>>
>>>> That doesn't seem uniform.
>>>>
>>>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
>>>
>>> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it.
>>>
>>> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.
>>
>> Ah, ok. It works if you do that. So bearophile's code must say
>>
>> if random() > (1.0 / (i+1)):
>>
>> instead of
>>
>> if random() < (1.0 / (i+1)):
>>
>> (that is, if random() returns a real number between 0 and 1 with uniform distribution)
>
> His prize gets retired. Now how does Kirk's code fare? :o)
I don't have Kirk's code here at work (I told Thunderbird not to download everything I had in my home)... but if I remember correctly, it had the same problem, it was something like:
if (0 == uniform(0, k)):
or something like that, and should be 0 != ...
I wonder why both used the wrong probability...
(that's why I made the math, because I wasn't sure with that probability it worked)
I really liked the problem! It has a very simple and elegant solution... I thought the probability would be something more complex, but it turned out to be simple. :-)
| |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu: > > Ah, ok. It works if you do that. So bearophile's code must say > > > > if random() > (1.0 / (i+1)): > > > > instead of > > > > if random() < (1.0 / (i+1)): > > > > (that is, if random() returns a real number between 0 and 1 with uniform distribution) > > His prize gets retired. Now how does Kirk's code fare? :o) I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 from random import random, randrange from collections import defaultdict def main(k, values, freqs, nrepeats): for i in xrange(nrepeats): chosen = [None] * k for i, line in enumerate(values): if i < k: chosen.append(line) else: if random() < (1.0 / (i+1)): chosen[randrange(k)] = line for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] My first version works quite probably, and probably the second one too. I think there are ways to solve this problem keeping only k lines, but the code isn't obvious. Bye, bearophile | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote: > Andrei Alexandrescu: >>> Ah, ok. It works if you do that. So bearophile's code must say >>> >>> if random() > (1.0 / (i+1)): >>> >>> instead of >>> >>> if random() < (1.0 / (i+1)): >>> >>> (that is, if random() returns a real number between 0 and 1 with uniform distribution) >> His prize gets retired. Now how does Kirk's code fare? :o) > > I don't like contests... > > This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. > > The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 No, that's a different problem. They want to extract n random samples from N records *without replacement*. It's a very different ballgame. See http://www.cs.duke.edu/~jsv/Papers/catalog/node139.html The problem as I stated is reasonably easy and my implementation should be correct. Andrei | |||
October 10, 2008 Re: random k-sample of a file | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | Fri, 10 Oct 2008 10:55:51 -0400,
bearophile wrote:
> Andrei Alexandrescu:
> > > Ah, ok. It works if you do that. So bearophile's code must say
> > >
> > > if random() > (1.0 / (i+1)):
> > >
> > > instead of
> > >
> > > if random() < (1.0 / (i+1)):
> > >
> > > (that is, if random() returns a real number between 0 and 1 with uniform distribution)
> >
> > His prize gets retired. Now how does Kirk's code fare? :o)
>
> I don't like contests...
>
> This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.
>
> The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335
>
> from random import random, randrange
> from collections import defaultdict
>
> def main(k, values, freqs, nrepeats):
> for i in xrange(nrepeats):
> chosen = [None] * k
> for i, line in enumerate(values):
> if i < k:
> chosen.append(line)
> else:
> if random() < (1.0 / (i+1)):
> chosen[randrange(k)] = line
>
> for c in chosen:
> freqs[c] += 1
>
> return freqs
>
> import psyco; psyco.full()
> k = 15
> n = 100
> values = range(n)
> freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
> print [freqs.get(i, 0) for i in xrange(n)]
>
>
> My first version works quite probably, and probably the second one too.
>
> I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.
I think your algorithm has one mistake. The last line of your text gets into the k with probability 1/n while it should be k/n. Replace 1.0 with 1.0*k and the distribution will be absolutely (theoretically) uniform.
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply