Jump to page: 1 2
Thread overview
Sampling algorithms for D
Apr 12, 2012
Dmitry Olshansky
Apr 12, 2012
bearophile
Apr 12, 2012
bearophile
Apr 12, 2012
bearophile
Apr 13, 2012
bearophile
Apr 13, 2012
Ali Çehreli
Apr 13, 2012
Dmitry Olshansky
Apr 13, 2012
bearophile
April 12, 2012
Hello all,

My programming experience is mostly in C with some C++, so much of the programming style of D is not familiar to me.  With that in mind, I thought I'd set up a little pet project to try and learn good D programming style.

What I thought I'd do is implement some clever algorithms for random sampling which I've already done in a C based on the GNU Scientific Library.

The D implementation is (in very primitive form) at:
https://github.com/WebDrake/SampleD

... and I'm using GDC 4.6.3 on Ubuntu 12.04 as my compiler.

The basic challenge here is that, given a collection of N records, we want to take a sample of size n < N such that the chance of any given record being picked is equal.  The algorithms implemented are based on the notion of generating a series of random "skips" across the sequence of records, picking the records you land on.

The idea is that one initializes the algorithm by giving the total number of records and the sample size required.  The sampler provides a method, select(), which when called picks the next entry in the sample, up to the total number required; if you keep calling it when the sample is complete, it throws an error.

Consequently, these algorithms don't store the entire sample, or the collection of records (which are just assumed to be labelled 0, 1, 2, ..., N-1).  Rather, it just outputs, one by one as called upon, the indices of the selected records.

So, introduction away, here are the main questions I came up with in creating the code.

(1) Effective handling of random numbers.
----
The design concept was for the sampler classes to use a specified uniform random number generator if provided, otherwise to use rndGen.  This turned out to be trickier than anticipated, and I'm not very happy with the current solution.

My first attempt just templatized class methods select() and skip().  However, one of the algorithms (class VitterD) requires random initialization as well, and trying to template this() ran into this bug: http://d.puremagic.com/issues/show_bug.cgi?id=435

I'm really not too happy with turning the whole class into a template just for the sake of the random number generator, so can anyone suggest an effective way to resolve the problem?

(I may be biased here by C/GSL experience, where I've got used to passing the RNG as a pointer, meaning runtime polymorphism.)

(2) init() function?
----
I was unsure whether it would be good design practice for the sampler classes to be throwaway (1 instance = 1 use) or to put in place an init() function that would reset it with new record and sample sizes.  Any thoughts?

(3) Uniform random number on (0,1)
----
The algorithms' specification explicitly refers to uniform random numbers on the open interval, which I take to mean (0,1) i.e. excluding zero.  Phobos currently provides only a half-open uniform distribution on [a,b).

I've implemented a simple uniform_open() function which should return a uniform distribution on (a,b).  The question is, are there any plans for an open-interval uniform distribution in Phobos?  Can I rely on this functionality being provided in the long term in D's standard library?

(4) Truncation
----
The code uses C's trunc() function to return the integer part of a floating point number.  D doesn't seem to like it very much when I then set a size_t as equal to the result of trunc(), and I have to use a type cast.  Is there a more elegant D-ish way of achieving the same result?  roundTo() isn't adequate, as it rounds to the _nearest_ integer rather than the integer part.

(5) ... any general stylistic comments? :-)

Thanks and best wishes,

    -- Joe
April 12, 2012
On 12.04.2012 18:45, Joseph Rushton Wakeling wrote:
> Hello all,

Hello there,

[snip]

>
> So, introduction away, here are the main questions I came up with in
> creating the code.
>
> (1) Effective handling of random numbers.
> ----
> The design concept was for the sampler classes to use a specified
> uniform random number generator if provided, otherwise to use rndGen.
> This turned out to be trickier than anticipated, and I'm not very happy
> with the current solution.
>
> My first attempt just templatized class methods select() and skip().
> However, one of the algorithms (class VitterD) requires random
> initialization as well, and trying to template this() ran into this bug:
> http://d.puremagic.com/issues/show_bug.cgi?id=435
>
> I'm really not too happy with turning the whole class into a template
> just for the sake of the random number generator, so can anyone suggest
> an effective way to resolve the problem?
>
> (I may be biased here by C/GSL experience, where I've got used to
> passing the RNG as a pointer, meaning runtime polymorphism.)

I think that if you like function pointer thing, you'd love D's delegates and lambdas. If templates are too pervasive it might be better just use little bit of dynamic stuff.
e.g.

Mt19937 gen;
double delegate() rng = (){
	auto x = gen.front;
	gen.popFron();
	return x;
}

use it somewhere later on:
writeln(rng(), rng()); // print 2 random numbers

>
> (2) init() function?
> ----
> I was unsure whether it would be good design practice for the sampler
> classes to be throwaway (1 instance = 1 use) or to put in place an
> init() function that would reset it with new record and sample sizes.
> Any thoughts?

Use structs? They are cheap throwway value types allocated on stack (by default).

>
> (3) Uniform random number on (0,1)
> ----
> The algorithms' specification explicitly refers to uniform random
> numbers on the open interval, which I take to mean (0,1) i.e. excluding
> zero. Phobos currently provides only a half-open uniform distribution on
> [a,b).
>
> I've implemented a simple uniform_open() function which should return a
> uniform distribution on (a,b). The question is, are there any plans for
> an open-interval uniform distribution in Phobos? Can I rely on this
> functionality being provided in the long term in D's standard library?
>
> (4) Truncation
> ----
> The code uses C's trunc() function to return the integer part of a
> floating point number. D doesn't seem to like it very much when I then
> set a size_t as equal to the result of trunc(), and I have to use a type
> cast. Is there a more elegant D-ish way of achieving the same result?
> roundTo() isn't adequate, as it rounds to the _nearest_ integer rather
> than the integer part.
>

It's OK. there are no implicit conversions of FP--> integer in the language I think.

> (5) ... any general stylistic comments? :-)
>

Looks fine ;)

> Thanks and best wishes,
>
> -- Joe


-- 
Dmitry Olshansky
April 12, 2012
Hi Dmitry,

Thanks for your thoughts and for pointing me at some aspects of D I'd not come across before.

> I think that if you like function pointer thing, you'd love D's delegates and
> lambdas. If templates are too pervasive it might be better just use little bit
> of dynamic stuff.

I was not talking about function pointers.  In GSL there is a struct type, gsl_rng, which contains the RNG data; it's this that you pass around.

So e.g. in a C implementation I might have a function like

  size_t sampler_select(sampler *s, gsl_rng *r) {
      x = gsl_rng_uniform(r);
      // do something with x ...
  }

What I would like is for the sampler classes to be able to take an arbitrary generator as input, or to use rndGen if one is not provided, analogous to what the uniform() function in D's std.random does.

The current implementation of my classes seems imperfect to me (it ties the class to using only one type of RNG, it means the whole class has to become templated to the RNG type, ...).

> Use structs? They are cheap throwway value types allocated on stack (by default).

It's not a matter of cheap/expensive, just a matter of good design style (I seem to recall seeing somewhere that init() functions were frowned upon).  But if there's no objection, I'll add an init function.

> It's OK. there are no implicit conversions of FP--> integer in the language I
> think.

Fine, then. :-)

>> (5) ... any general stylistic comments? :-)
>>
>
> Looks fine ;)

:-)
April 12, 2012
Joseph Rushton Wakeling:

> https://github.com/WebDrake/SampleD

Some comments on the details of your code:

> import std.c.time;

In D there there is also the syntax:
> import std.c.time: foo, bar, baz;

That tells the person that reads the code what names are used.

----------------------

> interface Sampler(UniformRandomNumberGenerator) {

The "UniformRNG" name induces shorter lines in your module and I think it's easy enough to understand.

----------------------

> double uniform_open(UniformRandomNumberGenerator)
> (double a, double b, ref UniformRandomNumberGenerator urng)
> {

In D function/method names are pascalCased. I also suggest to indent the second line, and unless you plan to modifying a and b inside that, to always mark them as "in" (or const, or even immutable):

double uniformOpen(UniformRNG)
                  (double a, double b, ref UniformRNG urng)
{

----------------------

>    do {
>        x = uniform(a,b,urng);
>    } while (x==a);

I suggest to put a space after a comma and before and after an operator:

    do {
        x = uniform(a, b, urng);
    } while (x == a);

----------------------

> class Skipper(UniformRandomNumberGenerator) : Sampler!(UniformRandomNumberGenerator) {

When there is only one template argument you are allowed to omit the (), and generally in English there is no need for a space before a ":":

class Skipper(UniformRNG): Sampler!UniformRNG {

----------------------

>        size_t S = skip(urng);
>
>        sampleRemaining--;
>        recordsRemaining -= (S+1);
>        currentRecord += S;

Unless you plan to modify a variable, then on default define it as immutable, and where that's not possible, as const, this avoids mistakes later, makes the code simpler to understand, and helps the compiler too:

        immutable size_t S = skip(urng);

        sampleRemaining--;
        recordsRemaining -= S + 1;
        currentRecord += S;

----------------------

>     final size_t records_remaining() { return recordsRemaining; }
>     final size_t records_total() { return recordsTotal; }

Generally where possible add "const", "pure" and "nothrow" attributes to functions/methods every where you can (and in some cases @safe too, if you want):

     final size_t records_remaining() const pure nothrow { return recordsRemaining; }
     final size_t records_total() const pure nothrow { return recordsTotal; }

----------------------

> protected:
>     size_t currentRecord = 0;
>     size_t recordsRemaining;

All D variables are initialized (unless you ask them to not be initialized, using "= void"), so the first "= 0" is not needed (but initializing D floating point values to zero is often necessary):

protected:
    size_t currentRecord;
    size_t recordsRemaining;

----------------------

>                 y1 = pow( (uniform_open(0.0,1.0, urng) * (cast(double) recordsRemaining) / qu1),
>                           (1.0/(sampleRemaining - 1)) );

In D there is also the ^^ operator.

----------------------

>                     for( t=recordsRemaining-1; t>=limit; --t)
>                         y2 *= top--/bottom--;

Generally packing mutation of variables inside expressions is quite bad style. It makes code less easy to understand and translate, and currently it's not defined, just as in C (it will be defined, but it will keep being bad style).

                    for (t = recordsRemaining - 1; t >= limit; t--) {
                        y2 *= top / bottom;
                        top--;
                        bottom--;
                    }


Also in D there is (I hope it will not be deprecated) foreach_reverse(), maybe to replace this for().

----------------------

>     writeln("Hello!  I'm a very simple and stupid implementation of some clever");
>     writeln("sampling algorithms.");
>     writeln();
>     writeln("What I'm going to do first is just take a sample of 5 records out of");
>     writeln("100 using each of the algorithms.");
>     writeln();

D string literals are multi-line too:

writeln(
"Hello!  I'm a very simple and stupid implementation of some clever
sampling algorithms.

What I'm going to do first is just take a sample of 5 records out of
100 using each of the algorithms.
");

----------------------

>     sampling_test_simple!(VitterA!Random,Random)(100,5,urng);

Currently your code doesn't work if you want to use a Xorshift generator.

----------------------

>     sampling_test_aggregate!(VitterA!Random,Random)(100,5,urng,10000000,true);
>     writeln();
>     sampling_test_aggregate!(VitterD!Random,Random)(100,5,urng,10000000,true);
>     writeln();

I suggest to define one enum inside the main(), like this, with underscores too, to improve readability, and use it in all the function calls:

    enum int N = 10_000_000;
    samplingTestAggregate!(VitterA!Random, Random)(100, 5, urng, N, true);

Bye,
bearophile
April 12, 2012
On 12/04/12 21:54, bearophile wrote:
> Some comments on the details of your code:

Thanks ever so much for the extensive review.

>> import std.c.time;
>
> In D there there is also the syntax:
>> import std.c.time: foo, bar, baz;
>
> That tells the person that reads the code what names are used.

Is this advised for all D modules, or just for stuff using the C standard library?

> Generally where possible add "const", "pure" and "nothrow" attributes to functions/methods every where you can (and in some cases @safe too, if you want):
>
>       final size_t records_remaining() const pure nothrow { return recordsRemaining; }
>       final size_t records_total() const pure nothrow { return recordsTotal; }

Whether I can do this would depend on whether I put in place a re-initialization function to allow the sampler to be reset with new values, no?

> All D variables are initialized (unless you ask them to not be initialized, using "= void"), so the first "= 0" is not needed (but initializing D floating point values to zero is often necessary):

I want to be _certain_ that value is zero. ;-)

> In D there is also the ^^ operator.

Ahh, another thing I once knew but had forgotten since the last time I looked at D.

>>                      for( t=recordsRemaining-1; t>=limit; --t)
>>                          y2 *= top--/bottom--;
>
> Generally packing mutation of variables inside expressions is quite bad style. It makes code less easy to understand and translate, and currently it's not defined, just as in C (it will be defined, but it will keep being bad style).

OK.  There's one other that should be a point of concern, where I have

   return currentRecord++;

... in one function; I've added a selectedRecord variable to get round this.

      final size_t select(ref UniformRNG urng)
      {
            assert(_recordsRemaining > 0);
            assert(_sampleRemaining > 0);

            immutable size_t S = skip(urng);
            immutable size_t selectedRecord = _currentRecord + S;

            _sampleRemaining--;
            _recordsRemaining -= (S+1);
            _currentRecord += (S+1);

            return selectedRecord;
      }

> Also in D there is (I hope it will not be deprecated) foreach_reverse(), maybe to replace this for().

I'll see what I can do with that.  There may be a better way of expressing this anyway -- as is it's pretty much a literal transcription of the Pascal-esque pseudo-code in the original research paper describing the algorithm.

>>      sampling_test_simple!(VitterA!Random,Random)(100,5,urng);
>
> Currently your code doesn't work if you want to use a Xorshift generator.

Ack, enabling that is precisely why I bothered templatizing it in the first place.  Can you suggest a fix ... ?  I don't understand why as-is it wouldn't work.

> ----------------------
>
>>      sampling_test_aggregate!(VitterA!Random,Random)(100,5,urng,10000000,true);
>>      writeln();
>>      sampling_test_aggregate!(VitterD!Random,Random)(100,5,urng,10000000,true);
>>      writeln();
>
> I suggest to define one enum inside the main(), like this, with underscores too, to improve readability, and use it in all the function calls:
>
>      enum int N = 10_000_000;
>      samplingTestAggregate!(VitterA!Random, Random)(100, 5, urng, N, true);

Fair enough.  I'd forgotten about the underscore notation for large numbers. This stuff I'm not too worried about, because it's only a stupid demo thing, not the important code ... :-)

Thanks again for the careful review, it's been very useful.

Best wishes,

    -- Joe

April 12, 2012
On 12/04/12 21:54, bearophile wrote:
>>      sampling_test_simple!(VitterA!Random,Random)(100,5,urng);
>
> Currently your code doesn't work if you want to use a Xorshift generator.

Ahhh, I see what you mean now -- the sampler classes are fine, but the way the main() function is written means you can't just tweak the RNG.

The original code I tried writing was something like,

    sampling_test_simple!(VitterA, Random)(100, 5, urng);

... with the sampler object being created in the function with,

    auto s = new SamplerType!UniformRNG(records, samples, urng);

... but the compiler objected to that with the error,

    sampled.d:216: Error: template instance
    sampling_test_simple!(VitterA,XorshiftEngine!(uint,128,11,8,19))
    does not match template declaration
    sampling_test_simple(SamplerType,UniformRNG)

Here's the complete (wrong) function:

void sampling_test_simple(SamplerType, UniformRNG)
                         (size_t records, size_t samples, ref UniformRNG urng)
{
      auto s = new SamplerType!UniformRNG(records,samples,urng);
      while(s.sampleRemaining > 0) {
            write("\trecord selected: ", s.select(urng), ".");
            write("\trecords remaining: ", s.recordsRemaining, ".");
            writeln("\tstill to sample: ", s.sampleRemaining, ".");
      }
}

It's not clear to me why this fails or what I can do that would equate to this.
April 12, 2012
Joseph Rushton Wakeling:

> Thanks ever so much for the extensive review.

They are shallow comments, mostly about syntax, etc. So writing them requires little time.


> Is this advised for all D modules, or just for stuff using the C standard library?

I have said "there is also the syntax" instead "it's better to do this" because some D programmers don't like to do that. You have to be aware that syntax exists, and that's it's better to write:
import std.foo: bar, baz;
Than:
import std.foo; // for bar and baz functions

then personally I sometimes like to qualify my imports (writing what I import from them), for various reasons, but it also increases the amount of work to do when you want to modify the code. So it's a personal choice.
Also, the more important is for you to not put bugs in a specific module, the more "tight"/strongly statically typed is better to write the code (like Ada language code). If for you the correctness of a specific module is not very important, then using a lazier coding style is acceptable :-) If you write code lot of money/hardware or even people lives will depend on, you have to adopt a progressively more and more error-proof style of coding, eventually even using software to proof the correctness of critical routines.


> Whether I can do this would depend on

It depends on many things, including DMD bugs, D language design faults, Phobos incomplete coverage of those annotations, and of course on the semantics of your code, etc.


> I want to be _certain_ that value is zero. ;-)

Then assign on more time, to be really sure :-)
int x = 0;
x = 0; // It must be ZERO, dammit.


>        final size_t select(ref UniformRNG urng)
>        {
>              assert(_recordsRemaining > 0);
>              assert(_sampleRemaining > 0);

Probably it's better to move those asserts in preconditions/postconditions or in class/struct invariants.
And I think your code needs some unittests{} too.


> Ack, enabling that is precisely why I bothered templatizing it in the first place.  Can you suggest a fix ... ?  I don't understand why as-is it wouldn't work.

If you try to use a Xorshift the compiler will tell you the problems, I think with some work you will be able to fix them and create a more flexible module:

void main() {
    auto urng = Xorshift(23);


> This stuff I'm not too worried about, because it's only a stupid demo thing, not the important code ... :-)

In my opinion demo code is a bit like unittests. And unittests require the same coding precision as the rest of the code :-)

Bye,
bearophile
April 12, 2012
On 13/04/12 00:48, bearophile wrote:
>> final size_t select(ref UniformRNG urng)
>> {
>> assert(_recordsRemaining > 0);
>> assert(_sampleRemaining > 0);
>
> Probably it's better to move those asserts in preconditions/postconditions or in
> class/struct invariants.

Those asserts are deliberately intended for the function.  To see why it matters, consider setting up an instance of the class to select a sample of size 5, and then calling select() 6 times.

Of course, there may be a better way of handling that which I'm not aware of :-)

> And I think your code needs some unittests{} too.

Agreed.

>> Ack, enabling that is precisely why I bothered templatizing it in the first
>> place. Can you suggest a fix ... ? I don't understand why as-is it wouldn't work.
>
> If you try to use a Xorshift the compiler will tell you the problems, I think
> with some work you will be able to fix them and create a more flexible module:
>
> void main() {
> auto urng = Xorshift(23);

I realized afterwards what you were talking about, but am not sure how to fix (cf. my other email).  My first coding attempt was designed to be more flexible but the compiler didn't like it. ;-)

> In my opinion demo code is a bit like unittests. And unittests require the same
> coding precision as the rest of the code :-)

Agree :-)  They are imperfect tests for now, IMO, which is why I'm being lazy...

April 12, 2012
Joseph Rushton Wakeling:

> >> final size_t select(ref UniformRNG urng)
> >> {
> >> assert(_recordsRemaining > 0);
> >> assert(_sampleRemaining > 0);
> >
> > Probably it's better to move those asserts in preconditions/postconditions or in class/struct invariants.
> 
> Those asserts are deliberately intended for the function.

Then use a function/method precondition:

final size_t select(ref UniformRNG urng)
in {
    assert(_recordsRemaining > 0);
    assert(_sampleRemaining > 0);
} body {
    ...
}

Bye,
bearophile
April 13, 2012
On 13.04.2012 1:48, Joseph Rushton Wakeling wrote:
> On 12/04/12 21:54, bearophile wrote:

>>> for( t=recordsRemaining-1; t>=limit; --t)
>>> y2 *= top--/bottom--;
>>
>> Generally packing mutation of variables inside expressions is quite
>> bad style. It makes code less easy to understand and translate, and
>> currently it's not defined, just as in C (it will be defined, but it
>> will keep being bad style).
>
> OK. There's one other that should be a point of concern, where I have
>
> return currentRecord++;
>
> ... in one function; I've added a selectedRecord variable to get round
> this.

I believe it's something that reasonable people may disagree on. To me it's perfectly easy to see what return x++; does. So is arr[i++] = ..., so is arr1[i++] = arr2[j++]. But the downward loops look hairy all the time ;)

>
>> Also in D there is (I hope it will not be deprecated)
>> foreach_reverse(), maybe to replace this for().
>

or use some std.range primitives ( I think iota does a [begin, end) range)
foreach( x ; iota(recordRemaining-1, limit+1, -1)){
	y2 *= top--/bottom--;
}

[snip]


-- 
Dmitry Olshansky
« First   ‹ Prev
1 2