June 18, 2013
Hello all,

The previous discussion on random ranges -- that is, ranges where popFront() calls a random number generator (real or pseudo) -- seems to have come to a nice conclusion: namely, that the problems identified can be resolved by ensuring that random number generators operate as reference rather than value types.

So now, let's turn to a different set of problems that arise with random numbers and ranges.  In this case I don't have a concrete proposal or solution, but I'd like to present the problems and invite discussion.

Let's take as read that a conversion to reference types has been made (e.g. the attached Mersenne Twister variant).  Now consider the following random range, which generates a sequence of n random numbers u_1, u_2, ..., u_n, uniformly distributed in [0, 1) and also calculates the remainder of n - \sum_{i}u_i.

    struct Remainder(RNG)
        if(isUniformRNG!RNG)
    {
        private real _remaining, _u;
        private size_t _n;
        private RNG _gen;

        this(size_t n, RNG gen)
        {
            assert(n > 0);
            _n = n;
            _remaining = n;
            _gen = gen;
            prime();       // sets value of _u.
        }

        this(typeof(this) source)
        {
            this._n = source._n;
            this._remaining = source._remaining;
            this._gen = source._gen;
            this._u = source._u;
        }

        bool empty() @property pure nothrow const
        {
            return _n == 0;
        }

        real front() @property pure nothrow const
        {
            return _u;
        }

        void popFront()
        {
            assert(_n > 0);
            --_n;
            if(!empty)
            {
                prime();
            }
        }

        typeof(this) save() @property
        {
            auto ret = typeof(this)(this);
            ret._gen = this._gen.save;
            return ret;
        }

        real remainder() @property const pure nothrow
        {
            return _remaining;
        }

        private void prime()
        {
            _u = uniform(0.0L, 1.0L, _gen);
            _remaining -= _u;
        }
    }

    auto remaining()(size_t n)
    {
        return Remainder!Random(n, rndGen);
        /* Caution!! Won't really work with current Phobos
         * as rndGen is value type.  This is how you'd do it
         * with all PRNGs corrected to being reference types. */
    }

    auto remainder(RNG)(size_t n, RNG gen)
        if(isUniformRNG!RNG)
    {
        return Remainder!RNG(n, gen);
    }

So, let's initialize one of these ranges with a reference-type PRNG, and see what comes out if it.

    auto gen = new MtClass19937(unpredictableSeed);
    auto r = remainder(5, gen);
    writeln(r);
    writeln(r);

... which gives us for example:

    [0.686448, 0.206227, 0.331488, 0.558315, 0.255811]
    [0.686448, 0.596189, 0.602239, 0.704929, 0.740741]

What do we see?  Well, the two different readings from the range are different -- as they should be (see previous discussion) -- but _not_ statistically independent: the first value is identical.  This will be the case no matter how many times we consume the range, because the very first value of .front is determined in the constructor:

    this(size_t n, RNG gen)
    {
        assert(n > 0);
        _n = n;
        _remaining = n;
        _gen = gen;
        prime();    // <------ sets first value of _u;
    }

Obviously this is an unacceptable violation of statistical independence.  The issue here is that the initial value of .front shouldn't actually be determined until we start consuming the range.  So for example we might add a boolean variable to control this, and rewrite the constructor and front:

    private bool _first;

    this(size_t n, RNG gen)
    {
        assert(n > 0);
        _n = n;
        _remaining = n;
        _gen = gen;
        _first = true;
    }

    real front() @property
    {
        if(_first)
        {
            _first = false;
            prime();
        }
        return _u;
    }

Now if we do the same as before,

    auto gen = new MtClass19937(unpredictableSeed);
    auto r = remainder(5, gen);
    writeln(r);
    writeln(r);

... we get statistically independent sequences as desired, e.g.:

    [0.745698, 0.737766, 0.126828, 0.00599556, 0.180587]
    [0.248489, 0.357781, 0.848302, 0.418882, 0.160556]

On the other hand, we have some annoyances.  First, it means that our .front property can no longer be pure, const or nothrow.  Ouch!

Secondly, there are some rather more subtle issues that emerge.  First, let's consider the .remaining property.

    auto gen = new MtClass19937(unpredictableSeed);
    auto r = remainder(5, gen);
    assert(5 - r.front == r.remaining);

This passes, as we expect.  But suppose we do instead,

    auto gen = new MtClass19937(unpredictableSeed);
    auto r = remainder(5, gen);
    assert(r.remaining == 5 - r.front);

Now we get an assertion failure -- because if we call .remaining before the first call to .front, then .remaining will have the value it was given in the constructor (i.e., 5) whereas 5 - r.front will initialize .front and -- too late -- correspondingly update the value of .remaining.

OK, another correction should fix that:

    real remaining() @property
    {
        if(_first)
        {
            _first = false;
            prime();
        }
        return _remaining;
    }

... and sure enough, now the assert(r.remaining == 5 - r.front) passes.  We
should probably also put a similar check into popFront():

    void popFront()
    {
        assert(_n > 0);
        --_n;
        if(!empty)
        {
            if(_first)
            {
                _first = false;
                prime();
            }
            prime();
        }
    }

This last one might seem superfluous for the current example, but there is a real and serious issue with it in the current std.random.RandomSample: http://d.puremagic.com/issues/show_bug.cgi?id=10322#c3

Anyway, with these checks in place, let's try some alternative little bits of code:

    auto gen = new MtClass19937(unpredictableSeed);

    auto r1 = remainder(5, gen);
    writeln(r1);
    writeln(r1);
    writeln;

... gives us two independent sequences as we'd expect:

    [0.712801, 0.869297, 0.984699, 0.921241, 0.942269]
    [0.103303, 0.463341, 0.266899, 0.773202, 0.92338]

but this:

    auto r2 = remainder(5, gen);
    r2.front;
    writeln(r2);
    writeln(r2);
    writeln;

... gives us two sequences with the first entry the same,

    [0.516372, 0.551643, 0.935778, 0.0585611, 0.128367]
    [0.516372, 0.352255, 0.736399, 0.735018, 0.442041]

... as does this:

    auto r3 = remainder(5, gen);
    r3.remaining;
    writeln(r3);
    writeln(r3);
    writeln;

In other words, by reading from .front or .remaining, we've fixed the value of front for all time and killed the statistical independence of different readings from the range.

Something similar happens with popFront():

    auto r4 = remainder(5, gen);
    r4.popFront();
    writeln(r4);
    writeln(r4);

... which gives a sequence of 4 values with the first fixed.  We can correct
this by tweaking how popFront() relates to prime():

    void popFront()
    {
        assert(_n > 0);
        --_n;
        if(!empty)
        {
            if(_first)
            {
                prime();
            }
            _first = true;
        }
    }

... which means that each successive value of .front is calculated in .front
itself (or in .remaining) at the time of first access; popFront() only calls
prime() if the current value of .front is not yet initialized.

However, we remain with the following issues.

    * We can't set the initial value of .front in the constructor, because it
      will destroy statistical independence.

    * We can initialize the value of .front in .front itself upon the first
      call, and continue to do so each time .front is popped.  But this kills
      some of the nice attributes we'd like to apply to .front (const, pure,
      nothrow), and we also have to take care of other functions/properties
      that rely on .front being initialized in order to return correct values.

    * If we follow the initialize-on-first-call approach described here, then
      we have created a new way to kill statistical independence -- namely, by
      accessing .front or one of the other corresponding properties before we
      start iterating across the range.

A concrete existing example of this is std.random.RandomSample, where .front cannot be initialized in the constructor, and another property (.index) depends on .front being initialized: see, http://d.puremagic.com/issues/show_bug.cgi?id=10322

As I said, I don't have a firm sense of how to resolve this.  I'm inclining towards saying, "Just recognize that calling .front or similar properties will fix .front's value, and don't do that unless you mean to."  But I'd appreciate others' input and thoughts.

A "final" copy of the example code, with all the stated tweaks, is attached if anybody wants to do any testing or suggest any solutions.