Thread overview
Re: Random sampling in Phobos
Update on random sampling in Phobos
April 17, 2012
Hello,

Just checking, since my earlier message on this subject[1] seems to have been lost in the ether.

I've been looking at std.random's implementation of random sampling, and I think there are some improvements that can be made.

The simple way is just to tweak the RandomSample struct to use a more sophisticated algorithm that scales with sample size instead of input data size.  I think it should be fairly easy to provide a patch based on the code I've already written.

The more complicated way would be to separate out the random selection process into several different structs/classes, so that it's possible to do sampling without having to have the input data actually loaded in memory.  Again, I think I can provide code for this, but I'd like to discuss the design and requirements before jumping in.

Anyone interested? :-)

Thanks and best wishes,

    -- Joe

--------
[1] http://forum.dlang.org/thread/mailman.1737.1334433366.4860.digitalmars-d@puremagic.com
April 17, 2012
On 4/17/12 10:26 AM, Joseph Rushton Wakeling wrote:
> Hello,
>
> Just checking, since my earlier message on this subject[1] seems to have
> been lost in the ether.

Actually mine has. I replied the day you posted. Let me check.

Andrei

April 21, 2012
Hello all,

An update on the proposed replacement for Phobos' RandomSample struct.

A demo copy of the code is on GitHub at:
https://github.com/WebDrake/RandomSample

This is a little standalone D file that can be compiled and run and which goes through several simple benchmark comparisons of the current Phobos RandomSample compared to this new version based on J.S. Vitter's superior algorithm.

General conclusions:

   * The speed difference is quite apparent whenever there is a significant
     difference (2+ orders of magnitude) between the sample and data sizes.
     It's apparent, but less marked, for smaller ratios.

   * For many practical cases (e.g. if you want one sample from a not very
     big dataset) the difference would not be noticed.  However, if you are
     doing a lot of sampling, or if you are sampling from a very large dataset,
     the difference can matter.

   * Where data records have to be read from an external source, e.g. a file,
     the time needed to do this overwhelms any difference in sampling speed.
     The new algorithm is still faster, but you don't really notice the
     difference.

If there's a desire for more benchmarks or other demonstrations of the code's effectiveness, please suggest what they might be and I'll go ahead and implement them[*].

If on the basis of the above you'd like a patch for Phobos proper, let me know and I'll go ahead and prepare one.  Note that before I do that, I'd like a decision on Jerro's tweak to the randomSample function:
https://github.com/D-Programming-Language/phobos/pull/542

... which is designed to fix bug 7936:
http://d.puremagic.com/issues/show_bug.cgi?id=7936

Thanks and best wishes,

    -- Joe

-------
[*] Just checking -- no more emails on this topic have got lost in the ether, have they?