View mode: basic / threaded / horizontal-split · Log in · Help
April 14, 2012
Random sampling in Phobos
Hello all,

This is a follow-up to a discussion on the d-learn mailing list:
http://forum.dlang.org/thread/mailman.1652.1334241994.4860.digitalmars-d-learn@puremagic.com

In short, I've been implementing some random sampling algorithms in D to help 
teach myself how to use the language effectively:
https://github.com/WebDrake/SampleD

This is an adaptation of some code I wrote in C a couple of years ago to help 
deal with simulations where I had to take a very large random sample from an 
even larger total[1].  The question came up whether this might be useful for the 
random sampling tools provided in Phobos.

Phobos' std.random currently contains a RandomSample struct and randomSample 
wrapper function which output a random subsample of the input data.  I would 
identify 2 problems with the current implementation.

  (1) The algorithm used is highly inefficient.  As far as I can see the code
      implements Algorithm S as described in vol. 2 of Knuth's "The Art of
      Computer Programming" (pp. 142-143).  This algorithm sequentially goes
      over each of the possible data elements deciding whether to accept or
      reject it.  Consequently, it's of o(N) complexity (where N is data size)
      and requires o(N) random variates to be generated.

      Algorithms developed by Jeffrey Scott Vitter in the 1980s allow this to be
      reduced to o(n) where n is sample size.  These are the algorithms I've
      implemented.

  (2) RandomSample requires the whole data (and whole sample) to be loaded into
      memory while you're doing the sampling.  Great if you can do it, but it's
      sometimes prohibitive[1].  Often you don't _need_ the data to be loaded --
      just the index values of the selected data -- and you may not even need
      to store those as a collection.

      Example: you have a very big database and you want to estimate some
      statistics, so you try and take a representative but tractable subsample
      of the data.  Your code could go something like this:

          foreach sample point wanted
          {
              i = selectNextSamplePointIndex;
              data = loadDBRecord(i);
              // now add data to statistical aggregators
              ...
          }
          // and at the end of the loop you have your aggregate statistics but
          // you haven't needed to store either the index or data values of the
          // selected sample points.

So, I would suggest 2 possibilities for improvement, one minor, one more extensive.

The former is just to rewrite RandomSample to use Jeffrey Vitter's Algorithm D 
(nice coincidence:-).  This should be fairly easy and I'm happy to prepare 
patches for this.

The latter would be to separate out the generation of sample index values into a 
new class that relies on knowing only the total number of data points and the 
required sample size.  RandomSample could then make use of this class (in 
practice there could be several possibilities to choose from), but it would also 
be directly available to the user if they want to carry out sampling that does 
not rely on the data being loaded into memory.

My code as stands provides the latter, but as a novice D user I'm not sure that 
it comes up to the standard required for the standard library; there are also 
improvements I'd like to make, which I'm not sure how best to achieve and will 
need advice on.  I'm happy to work and improve it to the extent needed, but may 
need a fair amount of hand-holding along the way (people on the d-learn list 
have already been excellent).

The current code is on GitHub at https://github.com/WebDrake/SampleD and 
implements the sampling algorithm classes together with some testing/demo code. 
 To see the demonstration, just compile & run the code[2].

Anyway, what are everyone's thoughts on this proposal?

Thanks and best wishes,

    -- Joe


-- Footnote --

[1] I was simulating a sparse user-object network such as one might find in an 
online store.  This was a subset of the complete bipartite graph between U users 
and O objects.  Storing that complete graph in program memory would have become 
prohibitive for large values of U and O, e.g. for U, O = 10_000.

[2] As written it takes about 2 minutes to run on my system when compiled with 
GDC (gdmd -O -release -inline), and about 5 minutes when compiled with dmd and 
the same flags.  If runtime turns out to be prohibitive I suggest reducing the 
value of trialRepeats on line 273.
April 17, 2012
Re: Random sampling in Phobos
On 4/14/12 7:39 AM, Joseph Rushton Wakeling wrote:
> In short, I've been implementing some random sampling algorithms in D to
> help teach myself how to use the language effectively:
> https://github.com/WebDrake/SampleD
>
> This is an adaptation of some code I wrote in C a couple of years ago to
> help deal with simulations where I had to take a very large random
> sample from an even larger total[1]. The question came up whether this
> might be useful for the random sampling tools provided in Phobos.

Absolutely. We could and should integrate Vitter's algorithm; the only 
reason it's not there yet is because I didn't have the time.

> Phobos' std.random currently contains a RandomSample struct and
> randomSample wrapper function which output a random subsample of the
> input data. I would identify 2 problems with the current implementation.
>
> (1) The algorithm used is highly inefficient. As far as I can see the code
> implements Algorithm S as described in vol. 2 of Knuth's "The Art of
> Computer Programming" (pp. 142-143). This algorithm sequentially goes
> over each of the possible data elements deciding whether to accept or
> reject it. Consequently, it's of o(N) complexity (where N is data size)
> and requires o(N) random variates to be generated.

That is correct. However the distinction is often academic because the 
cost of generating random numbers is lower than the cost of scanning the 
data, and skipping N steps in the data is comparable to the cost of 
skipping one step N times.

> Algorithms developed by Jeffrey Scott Vitter in the 1980s allow this to be
> reduced to o(n) where n is sample size. These are the algorithms I've
> implemented.
>
> (2) RandomSample requires the whole data (and whole sample) to be loaded
> into
> memory while you're doing the sampling. Great if you can do it, but it's
> sometimes prohibitive[1]. Often you don't _need_ the data to be loaded --
> just the index values of the selected data -- and you may not even need
> to store those as a collection.

Actually that's not correct. RandomSample works fine with an input range 
and does not keep it in memory.

> The former is just to rewrite RandomSample to use Jeffrey Vitter's
> Algorithm D (nice coincidence:-). This should be fairly easy and I'm
> happy to prepare patches for this.

This would be great, but you'd need to show with benchmarks that the 
proposed implementation does better than the extant one for most cases 
that matter.



Thanks,

Andrei
April 17, 2012
Re: Random sampling in Phobos
On 17/04/12 17:31, Andrei Alexandrescu wrote:
> Absolutely. We could and should integrate Vitter's algorithm; the only reason
> it's not there yet is because I didn't have the time.

Then I'll get to work on an implementation.

> That is correct. However the distinction is often academic because the cost of
> generating random numbers is lower than the cost of scanning the data, and
> skipping N steps in the data is comparable to the cost of skipping one step N
> times.

Are we assuming here data with forward access, rather than random access?

My view on the whole setup has probably been skewed by the fact that my sampling 
requirement was based on a setup like,

  (1) generate new sample point index value (doesn't require any data scanning,
      just add skip value to last selected index value);

  (2) do something with that index value.

> Actually that's not correct. RandomSample works fine with an input range and
> does not keep it in memory.

Ahh, OK.  I should have anticipated this as the output also works as a range. 
In that case my "I just need the index values" concern is moot, as I can do,

   randomSample(iota(0,N), n);

I'm starting to appreciate more and more why you emphasize so much the cool-ness 
of ranges in your talks on D. :-)

> This would be great, but you'd need to show with benchmarks that the proposed
> implementation does better than the extant one for most cases that matter.

Well, if you can give me an idea of the cases that matter to you, I'll benchmark 
'em. :-)  In any case, I'll get on with the implementation.

Thanks and best wishes,

    -- Joe
April 18, 2012
Re: Random sampling in Phobos
On 17/04/12 17:31, Andrei Alexandrescu wrote:
> This would be great, but you'd need to show with benchmarks that the proposed
> implementation does better than the extant one for most cases that matter.

OK, I've uploaded a prototype version in a GitHub repo:
git://github.com/WebDrake/RandomSample.git

This isn't a patch against Phobos, but a copy-paste and rename of the 
RandomSample struct/randomSample functions that lets me do some side-by-side 
comparison to current Phobos implementation.  It does need a fully up-to-date 
DMD, druntime and Phobos to compile.

I've put in place a handful of very simple tests just to give an idea of some of 
the speed differences.  The difference probably isn't so great for most trivial 
use-cases, but if you're doing a LOT of sampling, or you're sampling from a very 
large set of records, the difference becomes quite apparent.

I'm not sure what use-cases you have in mind for the benchmarks -- can you give 
me some idea, so I can go ahead and implement them?

Thanks and best wishes,

    -- Joe
April 18, 2012
Re: Random sampling in Phobos
On 17/04/12 18:25, Joseph Rushton Wakeling wrote:
> On 17/04/12 17:31, Andrei Alexandrescu wrote:
>> Actually that's not correct. RandomSample works fine with an input range and
>> does not keep it in memory.
>
> Ahh, OK. I should have anticipated this as the output also works as a range.

A query on this point, as it looks like this can have some unfortunate side-effects.

If I write e.g.

    auto s = randomSample(iota(0,100), 5);

    foreach(uint i; s)
        writeln(i);

    writeln();

    foreach(uint i; s)
        writeln(i);

... then it's noticeable that the _first_ element of the two sample lists is 
identical, while the others change.  If I put in place a specified source of 
randomness, e.g.

    auto s = randomSample(iota(0,100), 5, rndGen);

... then the numbers stay the same for both lists.

I'm presuming that the first element remains identical because it's defined in 
the constructor rather than elsewhere, but it's not clear why passing a defined 
RNG produces identical output on both occasions.

The effect can be worse with Vitter's Algorithm D because often, having defined 
the first element, the second may be derived deterministically -- meaning the 
first 2 elements of the sample are identical.

I'm sure that the above use of the RandomSample struct is not the advised use, 
but it's still disconcerting to see this.
April 18, 2012
Re: Random sampling in Phobos
On Wednesday, 18 April 2012 at 01:17:18 UTC, Joseph Rushton 
Wakeling wrote:
> On 17/04/12 18:25, Joseph Rushton Wakeling wrote:
>> On 17/04/12 17:31, Andrei Alexandrescu wrote:
>>> Actually that's not correct. RandomSample works fine with an 
>>> input range and
>>> does not keep it in memory.
>>
>> Ahh, OK. I should have anticipated this as the output also 
>> works as a range.
>
> A query on this point, as it looks like this can have some 
> unfortunate side-effects.
>
> If I write e.g.
>
>     auto s = randomSample(iota(0,100), 5);
>
>     foreach(uint i; s)
>         writeln(i);
>
>     writeln();
>
>     foreach(uint i; s)
>         writeln(i);
>
> ... then it's noticeable that the _first_ element of the two 
> sample lists is identical, while the others change.  If I put 
> in place a specified source of randomness, e.g.
>
>     auto s = randomSample(iota(0,100), 5, rndGen);
>
> ... then the numbers stay the same for both lists.
>
> I'm presuming that the first element remains identical because 
> it's defined in the constructor rather than elsewhere, but it's 
> not clear why passing a defined RNG produces identical output 
> on both occasions.
>
> The effect can be worse with Vitter's Algorithm D because 
> often, having defined the first element, the second may be 
> derived deterministically -- meaning the first 2 elements of 
> the sample are identical.
>
> I'm sure that the above use of the RandomSample struct is not 
> the advised use, but it's still disconcerting to see this.

I've run into this trap more than once. :)  You have to pass the 
random number generator by ref, otherwise you are just generating 
two identical sequences of random numbers.  Just change the 
randomSample signature to:

    auto randomSampleVitter(R, Random)(R r, size_t n, ref Random 
gen)

-Lars
April 18, 2012
Re: Random sampling in Phobos
On Wednesday, 18 April 2012 at 05:45:11 UTC, Lars T. Kyllingstad 
wrote:
> On Wednesday, 18 April 2012 at 01:17:18 UTC, Joseph Rushton 
> Wakeling wrote:
>> On 17/04/12 18:25, Joseph Rushton Wakeling wrote:
>>> On 17/04/12 17:31, Andrei Alexandrescu wrote:
>>>> Actually that's not correct. RandomSample works fine with an 
>>>> input range and
>>>> does not keep it in memory.
>>>
>>> Ahh, OK. I should have anticipated this as the output also 
>>> works as a range.
>>
>> A query on this point, as it looks like this can have some 
>> unfortunate side-effects.
>>
>> If I write e.g.
>>
>>    auto s = randomSample(iota(0,100), 5);
>>
>>    foreach(uint i; s)
>>        writeln(i);
>>
>>    writeln();
>>
>>    foreach(uint i; s)
>>        writeln(i);
>>
>> ... then it's noticeable that the _first_ element of the two 
>> sample lists is identical, while the others change.  If I put 
>> in place a specified source of randomness, e.g.
>>
>>    auto s = randomSample(iota(0,100), 5, rndGen);
>>
>> ... then the numbers stay the same for both lists.
>>
>> I'm presuming that the first element remains identical because 
>> it's defined in the constructor rather than elsewhere, but 
>> it's not clear why passing a defined RNG produces identical 
>> output on both occasions.
>>
>> The effect can be worse with Vitter's Algorithm D because 
>> often, having defined the first element, the second may be 
>> derived deterministically -- meaning the first 2 elements of 
>> the sample are identical.
>>
>> I'm sure that the above use of the RandomSample struct is not 
>> the advised use, but it's still disconcerting to see this.
>
> I've run into this trap more than once. :)  You have to pass 
> the random number generator by ref, otherwise you are just 
> generating two identical sequences of random numbers.  Just 
> change the randomSample signature to:
>
>     auto randomSampleVitter(R, Random)(R r, size_t n, ref 
> Random gen)
>
> -Lars

Hmmm... I see now that randomSample() is defined without ref in 
Phobos as well.  Would you care to check whether my suggestion 
fixes your problem?  If so, it is a bug in Phobos that should be 
fixed.

-Lars
April 18, 2012
Re: Random sampling in Phobos
On 18/04/12 07:50, Lars T. Kyllingstad wrote:
> Hmmm... I see now that randomSample() is defined without ref in Phobos as well.
> Would you care to check whether my suggestion fixes your problem? If so, it is a
> bug in Phobos that should be fixed.

There's an ongoing discussion on precisely this point:
http://d.puremagic.com/issues/show_bug.cgi?id=7067

As you can see from there, it's non-trivial to just pass the RNG by ref.  I'll 
hold off on a decision on that before I tweak my own code one way or another.

There is also an additional RNG-related bug,
http://d.puremagic.com/issues/show_bug.cgi?id=7936

... which Jerro has nicely provided a fix for:
https://github.com/D-Programming-Language/phobos/pull/542
April 18, 2012
Re: Random sampling in Phobos
> Hmmm... I see now that randomSample() is defined without ref in 
> Phobos as well.  Would you care to check whether my suggestion 
> fixes your problem?  If so, it is a bug in Phobos that should 
> be fixed.
>
> -Lars

It is a bug in Phobos but it isn't related to passing by value
(see the thread in D.learn).

I agree that copying random generators is error prone but
randomSampler taking the generator by ref wouldn't be enough
to fix that problem. The generator gets saved in a member in
a RandomSampler struct, so that member would have to be a
pointer to avoid copying (so RandomSampler would have to be
instantiated with Random* instead of Random). I usually avoid
this kind of problems by using pointers to random generators
instead of random generators everywhere.
Top | Discussion index | About this forum | D home