October 10, 2008
Jarrett Billingsley wrote:
> On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden@gmail.com> wrote:
>> On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
>> <jarrett.billingsley@gmail.com> wrote:
>>
>>> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> wrote:
>>>
>>>> On a different topic, your use of std.stream reminded me. I was thinking
>>>> of
>>>> yanking it from Phobos. Thoughts?
>>> ..what?  You want to remove most of Phobos' IO capabilities?
>>>
>>> I'm guessing there's something else you'd like to replace it with, but
>>> I wonder what it could be.
>> Yeah, why need them when Tango is soon to be used with Phobos side-by-side
>> ;)
>>
> 
> That's kind of what I'm thinking.  It seems like a waste of time for
> _another_ IO framework to be developed when we have two already.
> Priorities, priorities.  How about fixing frontend bugs instead?

Well Walter wouldn't be working on it anyway. The thing is, if we want
e.g. std.algorithm to work on I/O, there must be integration of I/O with
the range concept. As far as I can see things, ranges will be quite a
powerful currency allowing many std components to trade data with one
another.

Andrei

October 10, 2008
Sergey Gromov:
> While experimenting with Bearophile's loader
> > http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402
> I've found out that BufferedFile is the fastest available way to process a file line-by-line.  So the alternative should be as fast.

BufferedFile seems the faster in Phobos, but it's 2-3 times slower than the line-by-line reading you have in Python, and I think it has 4 bugs, so I don't use it much, that's why I use the xfile() of my libs, that is a little slower, but without those bugs.

Few bugs of BufferedFile can be seen if you try to load a very large file made of a single line (crashes D every time), or you try reading lines with \0 at the end of the line, etc. They are corner cases, usually.

I suggest reading the C sources of the I/O of Python (or Perl, if you are able to read the sources. I think the differences in clarity between Python and Perl can be seen even in their C sources).

Bye,
bearophile
October 10, 2008
On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> Jarrett Billingsley wrote:
>> On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden@gmail.com> wrote:
>>> On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
>>> <jarrett.billingsley@gmail.com> wrote:
>>>
>>>> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
>>>> <SeeWebsiteForEmail@erdani.org> wrote:
>>>>
>>>>> On a different topic, your use of std.stream reminded me. I was thinking
>>>>> of
>>>>> yanking it from Phobos. Thoughts?
>>>> ..what?  You want to remove most of Phobos' IO capabilities?
>>>>
>>>> I'm guessing there's something else you'd like to replace it with, but
>>>> I wonder what it could be.
>>> Yeah, why need them when Tango is soon to be used with Phobos side-by-side
>>> ;)
>>>
>>  That's kind of what I'm thinking.  It seems like a waste of time for
>> _another_ IO framework to be developed when we have two already.
>> Priorities, priorities.  How about fixing frontend bugs instead?
>
> Well Walter wouldn't be working on it anyway. The thing is, if we want
> e.g. std.algorithm to work on I/O, there must be integration of I/O with
> the range concept. As far as I can see things, ranges will be quite a
> powerful currency allowing many std components to trade data with one
> another.
>
> Andrei
>

Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)
October 10, 2008
Andrei Alexandrescu wrote:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu escribió:
>>> Ary Borenszweig wrote:
>>>> Andrei Alexandrescu escribió:
>>>>> bearophile wrote:
>>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>>>>
>>>>>> from sys import argv
>>>>>> from random import random, randrange
>>>>>> # randrange gives a random integer in [0, n)
>>>>>>
>>>>>> filename = argv[1]
>>>>>> k = int(argv[2])
>>>>>> assert k > 0
>>>>>>
>>>>>> chosen_lines = []
>>>>>> for i, line in enumerate(file(filename)):
>>>>>>     if i < k:
>>>>>>         chosen_lines.append(line)
>>>>>>     else:
>>>>>>         if random() < (1.0 / (i+1)):
>>>>>>             chosen_lines[randrange(k)] = line
>>>>>>
>>>>>> print chosen_lines
>>>>>
>>>>> We have a winner!!! There is actually a very simple proof on how and why this works.
>>>>
>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>>>>
>>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>>>>
>>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>
>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>>
>>>> What am I doing wrong? :-(
>>>
>>> Nothing except you stop the induction at step 3...
>>
>> ... which is the last step in this case. There are only three lines.
>>
>> p(a) = 5/6
>> p(b) = 5/6
>> p(c) = 1/3
>>
>> That doesn't seem uniform.
>>
>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
> 
> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it.
> 
> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.

Ah, ok. It works if you do that. So bearophile's code must say

if random() > (1.0 / (i+1)):

instead of

if random() < (1.0 / (i+1)):

(that is, if random() returns a real number between 0 and 1 with uniform distribution)

> 
> 
> Andrei
October 10, 2008
Ary Borenszweig wrote:
> Andrei Alexandrescu wrote:
>> Ary Borenszweig wrote:
>>> Andrei Alexandrescu escribió:
>>>> Ary Borenszweig wrote:
>>>>> Andrei Alexandrescu escribió:
>>>>>> bearophile wrote:
>>>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>>>>>
>>>>>>> from sys import argv
>>>>>>> from random import random, randrange
>>>>>>> # randrange gives a random integer in [0, n)
>>>>>>>
>>>>>>> filename = argv[1]
>>>>>>> k = int(argv[2])
>>>>>>> assert k > 0
>>>>>>>
>>>>>>> chosen_lines = []
>>>>>>> for i, line in enumerate(file(filename)):
>>>>>>>     if i < k:
>>>>>>>         chosen_lines.append(line)
>>>>>>>     else:
>>>>>>>         if random() < (1.0 / (i+1)):
>>>>>>>             chosen_lines[randrange(k)] = line
>>>>>>>
>>>>>>> print chosen_lines
>>>>>>
>>>>>> We have a winner!!! There is actually a very simple proof on how and why this works.
>>>>>
>>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>>>>>
>>>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>>>>>
>>>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>>
>>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>>>
>>>>> What am I doing wrong? :-(
>>>>
>>>> Nothing except you stop the induction at step 3...
>>>
>>> ... which is the last step in this case. There are only three lines.
>>>
>>> p(a) = 5/6
>>> p(b) = 5/6
>>> p(c) = 1/3
>>>
>>> That doesn't seem uniform.
>>>
>>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
>>
>> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it.
>>
>> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.
> 
> Ah, ok. It works if you do that. So bearophile's code must say
> 
> if random() > (1.0 / (i+1)):
> 
> instead of
> 
> if random() < (1.0 / (i+1)):
> 
> (that is, if random() returns a real number between 0 and 1 with uniform distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

Andrei
October 10, 2008
Denis Koroskin wrote:
> On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> Jarrett Billingsley wrote:
>>> On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden@gmail.com> wrote:
>>>> On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley
>>>> <jarrett.billingsley@gmail.com> wrote:
>>>>
>>>>> On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu
>>>>> <SeeWebsiteForEmail@erdani.org> wrote:
>>>>>
>>>>>> On a different topic, your use of std.stream reminded me. I was thinking
>>>>>> of
>>>>>> yanking it from Phobos. Thoughts?
>>>>> ..what?  You want to remove most of Phobos' IO capabilities?
>>>>>
>>>>> I'm guessing there's something else you'd like to replace it with, but
>>>>> I wonder what it could be.
>>>> Yeah, why need them when Tango is soon to be used with Phobos side-by-side
>>>> ;)
>>>>
>>>  That's kind of what I'm thinking.  It seems like a waste of time for
>>> _another_ IO framework to be developed when we have two already.
>>> Priorities, priorities.  How about fixing frontend bugs instead?
>>
>> Well Walter wouldn't be working on it anyway. The thing is, if we want
>> e.g. std.algorithm to work on I/O, there must be integration of I/O with
>> the range concept. As far as I can see things, ranges will be quite a
>> powerful currency allowing many std components to trade data with one
>> another.
>>
>> Andrei
>>
> 
> Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)

That would make Phobos dependent on Tango, thus running into those licensing issues I have little understanding of.

Andrei
October 10, 2008
Andrei Alexandrescu wrote:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu wrote:
>>> Ary Borenszweig wrote:
>>>> Andrei Alexandrescu escribió:
>>>>> Ary Borenszweig wrote:
>>>>>> Andrei Alexandrescu escribió:
>>>>>>> bearophile wrote:
>>>>>>>> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
>>>>>>>>
>>>>>>>> from sys import argv
>>>>>>>> from random import random, randrange
>>>>>>>> # randrange gives a random integer in [0, n)
>>>>>>>>
>>>>>>>> filename = argv[1]
>>>>>>>> k = int(argv[2])
>>>>>>>> assert k > 0
>>>>>>>>
>>>>>>>> chosen_lines = []
>>>>>>>> for i, line in enumerate(file(filename)):
>>>>>>>>     if i < k:
>>>>>>>>         chosen_lines.append(line)
>>>>>>>>     else:
>>>>>>>>         if random() < (1.0 / (i+1)):
>>>>>>>>             chosen_lines[randrange(k)] = line
>>>>>>>>
>>>>>>>> print chosen_lines
>>>>>>>
>>>>>>> We have a winner!!! There is actually a very simple proof on how and why this works.
>>>>>>
>>>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c.
>>>>>>
>>>>>> What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3.
>>>>>>
>>>>>> What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>>>
>>>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>>>>
>>>>>> What am I doing wrong? :-(
>>>>>
>>>>> Nothing except you stop the induction at step 3...
>>>>
>>>> ... which is the last step in this case. There are only three lines.
>>>>
>>>> p(a) = 5/6
>>>> p(b) = 5/6
>>>> p(c) = 1/3
>>>>
>>>> That doesn't seem uniform.
>>>>
>>>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?
>>>
>>> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it.
>>>
>>> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.
>>
>> Ah, ok. It works if you do that. So bearophile's code must say
>>
>> if random() > (1.0 / (i+1)):
>>
>> instead of
>>
>> if random() < (1.0 / (i+1)):
>>
>> (that is, if random() returns a real number between 0 and 1 with uniform distribution)
> 
> His prize gets retired. Now how does Kirk's code fare? :o)

I don't have Kirk's code here at work (I told Thunderbird not to download everything I had in my home)... but if I remember correctly, it had the same problem, it was something like:

if (0 == uniform(0, k)):

or something like that, and should be 0 != ...

I wonder why both used the wrong probability...

(that's why I made the math, because I wasn't sure with that probability it worked)

I really liked the problem! It has a very simple and elegant solution... I thought the probability would be something more complex, but it turned out to be simple. :-)
October 10, 2008
Andrei Alexandrescu:
> > Ah, ok. It works if you do that. So bearophile's code must say
> > 
> > if random() > (1.0 / (i+1)):
> > 
> > instead of
> > 
> > if random() < (1.0 / (i+1)):
> > 
> > (that is, if random() returns a real number between 0 and 1 with uniform distribution)
> 
> His prize gets retired. Now how does Kirk's code fare? :o)

I don't like contests...

This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.

The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

from random import random, randrange
from collections import defaultdict

def main(k, values, freqs, nrepeats):
    for i in xrange(nrepeats):
        chosen = [None] * k
        for i, line in enumerate(values):
            if i < k:
                chosen.append(line)
            else:
                if random() < (1.0 / (i+1)):
                    chosen[randrange(k)] = line

        for c in chosen:
            freqs[c] += 1

    return freqs

import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]


My first version works quite probably, and probably the second one too.

I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.

Bye,
bearophile
October 10, 2008
bearophile wrote:
> Andrei Alexandrescu:
>>> Ah, ok. It works if you do that. So bearophile's code must say
>>>
>>> if random() > (1.0 / (i+1)):
>>>
>>> instead of
>>>
>>> if random() < (1.0 / (i+1)):
>>>
>>> (that is, if random() returns a real number between 0 and 1 with uniform distribution)
>> His prize gets retired. Now how does Kirk's code fare? :o)
> 
> I don't like contests...
> 
> This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.
> 
> The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

No, that's a different problem. They want to extract n random samples from N records *without replacement*. It's a very different ballgame. See http://www.cs.duke.edu/~jsv/Papers/catalog/node139.html

The problem as I stated is reasonably easy and my implementation should be correct.


Andrei

October 10, 2008
Fri, 10 Oct 2008 10:55:51 -0400,
bearophile wrote:
> Andrei Alexandrescu:
> > > Ah, ok. It works if you do that. So bearophile's code must say
> > > 
> > > if random() > (1.0 / (i+1)):
> > > 
> > > instead of
> > > 
> > > if random() < (1.0 / (i+1)):
> > > 
> > > (that is, if random() returns a real number between 0 and 1 with uniform distribution)
> > 
> > His prize gets retired. Now how does Kirk's code fare? :o)
> 
> I don't like contests...
> 
> This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.
> 
> The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335
> 
> from random import random, randrange
> from collections import defaultdict
> 
> def main(k, values, freqs, nrepeats):
>     for i in xrange(nrepeats):
>         chosen = [None] * k
>         for i, line in enumerate(values):
>             if i < k:
>                 chosen.append(line)
>             else:
>                 if random() < (1.0 / (i+1)):
>                     chosen[randrange(k)] = line
> 
>         for c in chosen:
>             freqs[c] += 1
> 
>     return freqs
> 
> import psyco; psyco.full()
> k = 15
> n = 100
> values = range(n)
> freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
> print [freqs.get(i, 0) for i in xrange(n)]
> 
> 
> My first version works quite probably, and probably the second one too.
> 
> I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.

I think your algorithm has one mistake.  The last line of your text gets into the k with probability 1/n while it should be k/n.  Replace 1.0 with 1.0*k and the distribution will be absolutely (theoretically) uniform.