Jump to page: 1 26  
Page
Thread overview
Ranges and random numbers -- again
Jun 17, 2013
H. S. Teoh
Jun 17, 2013
monarch_dodra
Jun 17, 2013
monarch_dodra
Jun 17, 2013
monarch_dodra
Jun 18, 2013
monarch_dodra
Jun 18, 2013
H. S. Teoh
Jun 18, 2013
monarch_dodra
Jun 18, 2013
monarch_dodra
Jun 18, 2013
Jonathan M Davis
Jun 18, 2013
monarch_dodra
Jun 18, 2013
Jonathan M Davis
Jun 18, 2013
monarch_dodra
Jun 18, 2013
H. S. Teoh
Jun 18, 2013
H. S. Teoh
Jun 17, 2013
H. S. Teoh
Jun 18, 2013
Jonathan M Davis
Jun 18, 2013
Jonathan M Davis
Jun 18, 2013
monarch_dodra
Jun 18, 2013
H. S. Teoh
Jun 18, 2013
H. S. Teoh
June 17, 2013
Hello all,

I think my last post on this topic may have been too convoluted, so here's another go.  The goal here is to try and derive some well-defined strategies for designing ranges that make use of random number generators.

The purpose of this first email is to justify what I think should be the first basic rule of random ranges [that is, ranges where popFront() makes use of random or pseudo-random numbers to generate the new value of front()].  That rule is:

    **************************************************************************
    * Reading multiple times from the start of the same random range, should *
    * produce different (and statistically independent) results each time.   *
    **************************************************************************

To see why, let's start by considering a simple input range that provides an infinite stream of random numbers in the range [min, max).

    struct SimpleRandomRange(RNG = void)
        if(isUniformRNG!RNG || is(RNG == void))
    {
        private real _min, _max, _u;

        static if (!is(RNG == void))
        {
            private RNG _gen;

            this(real min, real max, RNG gen)
            {
                _min = min;
                _max = max;
                _gen = gen;
                _u = uniform(_min, _max, _gen);
            }
        }
        else
        {
            this(real min, real max)
            {
                _min = min;
                _max = max;
                _u = uniform(_min, _max);
            }
        }

        immutable bool empty = false;

        real front() @property pure nothrow const
        {
            return _u;
        }

        void popFront()
        {
            static if (is(RNG == void))
                _u = uniform(_min, _max);
            else
                _u = uniform(_min, _max, _gen);
        }
    }

    auto simpleRandomRange()(real min, real max)
    {
        return SimpleRandomRange!void(min, max);
    }

    auto simpleRandomRange(RNG)(real min, real max, RNG gen)
        if (isUniformRNG!RNG)
    {
        return SimpleRandomRange!RNG(min, max, gen);
    }


Let's consider the basics of how this works.  If we don't specify a generator,

    auto r1 = simpleRandomRange(0.0, 1.0);

... then the call to uniform() will make use of the thread-local default random number generator, rndGen.  If we do specify an RNG:

    Random gen;
    gen.seed(unpredictableSeed);
    auto r2 = simpleRandomRange(0.0, 1.0, gen);

... then SimpleRandomRange will store a copy of it to use when generating the random numbers.

So, what happens when we try and read numbers from this?  Let's try and generate some output from this range:

    auto r1 = simpleRandomRange(0.0L, 1.0L);

    auto take1 = r1.take(5);
    auto take2 = r1.take(5);

    writeln(take1);
    writeln(take1);
    writeln(take2);
    writeln(take2);

... which might give us for example:

    [0.816207, 0.40483, 0.638285, 0.00737022, 0.467244]
    [0.816207, 0.902301, 0.0961124, 0.611573, 0.579579]
    [0.816207, 0.788726, 0.614613, 0.613375, 0.136722]
    [0.816207, 0.0942176, 0.894744, 0.751959, 0.77816]

We notice immediately that these sequences are all different, except for the very first number, which is always identical.  What's more, it's identical for each of the two 'takes'.

What's going on here?  Well, the first value is being set in the constructor; subsequent values are being set by calls to the global RNG rndGen, and because its state is being updated, that affects all subsequent values generated.

So, it means that

    (1) Different 'takes' from the same range will be different;

    (2) Reading twice from the same 'take' will also generate a different
        sequence.

That in itself is not so bad -- but the always-the-same first value is a problem, as it will bias any statistics we generate.

What happens instead if we specify a generator?

    Random gen;
    gen.seed(unpredictableSeed);
    auto r2 = simpleRandomRange(0.0L, 1.0L, gen);

    auto take3 = r2.take(5);
    writeln(take3);
    writeln(take3);

... then we may get for example:

    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]
    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]

i.e., two identical sequences.  Superficially this looks nice -- a random sequence that is consistent within the same 'take'.  But now let's take another 5 numbers:

    auto take4 = r2.take(5);
    writeln(take4);
    writeln(take4);

... and we get again:

    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]
    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]

Why is this the same?  When we pass gen to the constructor of SimpleRandomRange, it copies it by value -- meaning that updates to the internally-stored _gen don't propagate back to the source RNG.  This is extremely undesirable, because it means that we can generate very strong correlations between what ought to be independent random sequences.

How can we resolve this dilemma?  One thought would be to ensure that any RNG stored inside SimpleRandomRange is seeded unpredictably.  Manually, without changing SimpleRandomRange, we can do this by making sure that we do something like:

    auto r3 = simpleRandomRange(0.0L, 1.0L, Random(unpredictableSeed));
    auto take5 = r3.take(5);

    writeln(take5);
    writeln(take5);

    auto take6 = r3.take(5);
    writeln(take6);
    writeln(take6);

... but in fact this again produces identical output all round.

The other alternative is for random number generators to be redefined as reference types.  A simple experiment is to redefine MersenneTwisterEngine as a final class rather than a struct [N.B. this is probably _not_ the optimal way to turn RNGs into reference types, but is quick and easy to try out!]:

    auto gen2 = new MtClass19937(unpredictableSeed);
    auto r5 = simpleRandomRange(0.0L, 1.0L, gen2);
    auto take9 = r5.take(5);
    auto take10 = r5.take(5);

    writeln(take9);
    writeln(take9);
    writeln(take10);
    writeln(take10);

... which gives us output similar to the original example using rndGen:

    [0.844254, 0.136284, 0.0805684, 0.351376, 0.830413]
    [0.844254, 0.863786, 0.983373, 0.389967, 0.257907]
    [0.844254, 0.825822, 0.176731, 0.397709, 0.470349]
    [0.844254, 0.349027, 0.118591, 0.707107, 0.893357]

i.e., an identical first value, but otherwise different.

One might object that the previous example -- the one where simpleRandomRange was passed Random(unpredictableSeed) -- offers a superior outcome.  After all, it generates a reproducible pseudo-random sequence that (theoretically) shouldn't generate correlations with other (pseudo-)random activity in the program.  The fact that two different .take commands generate identical output is a feature, not a problem -- each time you're reading from the start of the range, so you'd expect to get the same values!

To this I'd offer three responses.  The first is that enforcing an unpredictable seed for the random range (in effect, making it operate like an independent thread in terms of its internal RNG) makes things difficult to debug as, for debugging purposes, you want to be able to have predictable seeds for streams of pseudo-random numbers.  The second is that it seems to me that having a multiplicity of separate RNGs risks statistical reliability: it's difficult to be certain that two different "unpredictable" seeds won't in fact produce correlated sequences, and becomes more difficult if you have many different separately seeded RNGs.

Finally, note that even if you could set aside these issues, the reproducibility of the sequence rests on the fact that your RNG is a pseudo-random number generator.  If instead you pass SimpleRandomRange an RNG that reads from a true source of randomness (say, /dev/random) it will be impossible to generate predictable sequences.

Hence the rule which I posited at the beginning of this email, and which I'll restate here, which seems to me the only way to ensure consistent behaviour of these entities regardless of the source of randomness used:

    **************************************************************************
    * Reading multiple times from the start of the same random range, should *
    * produce different (and statistically independent) results each time.   *
    **************************************************************************

As a corollary, this means that pseudo-random number generators (in Phobos and
elsewhere) should be redefined as reference types, which has already been
pointed out in D bugzilla:
http://d.puremagic.com/issues/show_bug.cgi?id=7067

However, that's not the end of the story.  Remember that first value from the range being always the same?  This also needs dealing with, and (I think) is non-trivial.  I will discuss this in the next email.

In the meantime -- I'm interested in feedback on the proposed rule. :-)

Thanks and best wishes,

    -- Joe
June 17, 2013
On 6/17/13 3:32 PM, Joseph Rushton Wakeling wrote:
>      **************************************************************************
>      * Reading multiple times from the start of the same random range, should *
>      * produce different (and statistically independent) results each time.   *
>      **************************************************************************

I don't think that's a good rule.

A range that does that should be an input range, not a forward range. Calling .save on a random range and tracing the steps again should produce the same values.


Andrei

June 17, 2013
On Monday, 17 June 2013 at 19:32:20 UTC, Joseph Rushton Wakeling wrote:
> In the meantime -- I'm interested in feedback on the proposed rule. :-)
>
> Thanks and best wishes,
>
>     -- Joe

Good analysis but (sorry) I think you are on the wrong track.

One of the major problems in std.random is that the ranges use value semantics. This means they are *saved* whenever they are copied, as opposed to just referenced. This creates the problems you have mentioned, and even more.

I have tried to fix it before:
http://forum.dlang.org/thread/oiczxzkzxketxitncghl@forum.dlang.org
FWI, I gave up on the project, because it was too complex for me to handle an entire module. But there were no reasons for it to not work.

In any case, you can *also* take a look at these 2 other conversations, they sup up the problem pretty quickly (especially the second):
http://forum.dlang.org/thread/kuvhrfnrrbhzpyoirqgt@forum.dlang.org
http://forum.dlang.org/thread/mailman.259.1357667544.22503.digitalmars-d@puremagic.com

So I don't agree with your entire point:
   **************************************************************************
   * Reading multiple times from the start of the same random range, should *
   * produce different (and statistically independent) results each time.   *
   **************************************************************************

A random range should be viewed (IMO) as nothing more than a range that "was" (conceptually) simply filled with random numbers. Calling front on the same range twice in a row *should* produce the same result: No call to popFront => no change to the range. If it did change, it'd be a blatant violation of the range concept. It also means you can't have safe/nothrow/pure/const "front".

Being able to *save* a random range (which your proposal would prevent) can have useful applications too. It means you can iterate on the same random number sequence several times, lazily, without having to store the results in a buffer.

The problems you are trying to fix can (imo) can be better fixed with the reference approach.
June 17, 2013
On Mon, Jun 17, 2013 at 04:22:38PM -0400, Andrei Alexandrescu wrote:
> On 6/17/13 3:32 PM, Joseph Rushton Wakeling wrote:
> >     **************************************************************************
> >     * Reading multiple times from the start of the same random range, should *
> >     * produce different (and statistically independent) results each time.   *
> >     **************************************************************************
> 
> I don't think that's a good rule.
> 
> A range that does that should be an input range, not a forward range. Calling .save on a random range and tracing the steps again should produce the same values.
[...]

Yeah, if a range encapsulates a non-repeatable RNG (e.g., a "true" RNG that, say, reads environmental variables to randomize its output, or reads from a hardware RNG), then it should be an input range. Such ranges cannot have any meaningful .save semantics, so they should not be forward ranges.

OTOH, one of the main issues in the current std.random is that RNGs are implemented as pass-by-value structs, which is a bad choice IMO. If we change them to have reference semantics instead, many of these issues would be solved.


T

-- 
2+2=4. 2*2=4. 2^2=4. Therefore, +, *, and ^ are the same operation.
June 17, 2013
On 06/17/2013 09:22 PM, Andrei Alexandrescu wrote:
> On 6/17/13 3:32 PM, Joseph Rushton Wakeling wrote:
>>      **************************************************************************
>>      * Reading multiple times from the start of the same random range, should *
>>      * produce different (and statistically independent) results each time.   *
>>      **************************************************************************
> 
> I don't think that's a good rule.
> 
> A range that does that should be an input range, not a forward range. Calling .save on a random range and tracing the steps again should produce the same values.

Let me clarify.

I'm _not_ talking about pseudo-random number generating ranges such as
MersenneTwisterEngine, which can indeed be forward ranges and in fact are
deterministic.  What I mean by a "random range" is one where popFront() calls a
source of randomness (real or pseudo) in order to update the value of front().

So, I don't consider MersenneTwister to be a random range -- but RandomSample
is, because it calls uniform() within popFront().

When saying "reading from the start", I also didn't mean to imply that successive calls to .front should produce different values.  By "reading from the start" I meant something like this:

    SomeRandomRange r;

    foreach(x; r)
        writeln(r);
    writeln();

    foreach(x; r)
        writeln(r);

... should produce two statistically independent sequences.

RandomSample is interesting to consider in light of your remarks -- its .save can't in general guarantee the same values moving forward, because it can't guarantee that the source of randomness can reproduce the same values:

        @property typeof(this) save()
        {
            auto ret = this;
            ret._input = _input.save;
            return ret;
        }

If (and only if) the RandomSample instance contains an internal generator, then the auto ret = this; line will copy it (by value) and so the saved copy will reproduce the sequence that would have been produced by the original.

However, if the RandomSample instance was initiated with Random == void in the template parameters, then it will use rndGen to generate random numbers, and so the forward sequence will be different each time.  So in this sense the "save" is meaningless.

Now, we could and maybe should deal with that by forbidding save() in the case
where Random == void, but on the other hand, the idea of a random number
generator copied by value creates other problems to do with statistical safety.
 If I do this:

    auto sample = randomSample(iota(100), 5, rndGen);

... which is an "obvious" thing to do, then the results of that will be correlated with subsequent things I do involving rndGen -- all because the generator gets copied by value.

On the other hand, if we assume that the generator is a reference type, then it will inevitably produce different samples each time it's iterated over.
June 17, 2013
On 06/17/2013 09:36 PM, monarch_dodra wrote:
> Good analysis but (sorry) I think you are on the wrong track.
> 
> One of the major problems in std.random is that the ranges use value semantics. This means they are *saved* whenever they are copied, as opposed to just referenced. This creates the problems you have mentioned, and even more.

I agree that the fact that pseudo-random number generators use value semantics is a serious problem, and I was thinking of our previous discussions in preparing these remarks.

However, I'll stress again for the avoidance of doubt that when referring to "random ranges" I was not referring to pseudo-random number generators, but to ranges that _call_ random-number generators (real or pseudo) inside popFront().

> I have tried to fix it before:
> http://forum.dlang.org/thread/oiczxzkzxketxitncghl@forum.dlang.org
> FWI, I gave up on the project, because it was too complex for me to handle an
> entire module. But there were no reasons for it to not work.

I remember your work and was sad to see that it was not accepted -- actually one reason to start this discussion was to try and push awareness back to your contributions :-)

> A random range should be viewed (IMO) as nothing more than a range that "was" (conceptually) simply filled with random numbers. Calling front on the same range twice in a row *should* produce the same result: No call to popFront => no change to the range. If it did change, it'd be a blatant violation of the range concept. It also means you can't have safe/nothrow/pure/const "front".

Completely agree, and I don't think this is in contradiction with what I've proposed.  My proposed "rule" might be better stated to clarify this.

> Being able to *save* a random range (which your proposal would prevent) can have useful applications too. It means you can iterate on the same random number sequence several times, lazily, without having to store the results in a buffer.

I agree, but there are broader issues to do with .save and ranges that call random number generators in popFront().  If they are calling rndGen() or any other "global" random number generator, then it's impossible for them to reproduce the same sequence over and over.

> The problems you are trying to fix can (imo) can be better fixed with the
> reference approach.

On the contrary, consider the SimpleRandomRange -- or RandomSample -- with a specified RNG that has reference semantics.  You'll get a different sequence each time you iterate over it, and .save will not enable you to reproduce the sequence.

I'd be happy to see a way in which one _could_ reproduce the sequence with .save, but it would have to be statistically reliable.  (That is, if I pass it a given random number generator, then the results of the sequence would have to be uncorrelated with other subsequent calls to that generator in other parts of the program.)

June 17, 2013
On 06/17/2013 11:18 PM, Joseph Rushton Wakeling wrote:
>> A random range should be viewed (IMO) as nothing more than a range that "was" (conceptually) simply filled with random numbers. Calling front on the same range twice in a row *should* produce the same result: No call to popFront => no change to the range. If it did change, it'd be a blatant violation of the range concept. It also means you can't have safe/nothrow/pure/const "front".
> 
> Completely agree, and I don't think this is in contradiction with what I've proposed.  My proposed "rule" might be better stated to clarify this.

Perhaps this would be a better statement:

    ************************************************************************
    * Iterating fully over a given random range should produce a different *
    * sequence for each such complete iteration.                           *
    ************************************************************************

So, if you do,

    SomeRandomRange r;
    x = r.front;
    y = r.front;
    assert(x == y);  // Passes!!

But

    SomeRandomRange r;
    arr1 = array(r);
    arr2 = array(r);
    assert(x != y);  // the two arrays are filled with different sequences.
June 17, 2013
On Monday, 17 June 2013 at 22:18:44 UTC, Joseph Rushton Wakeling wrote:
> On 06/17/2013 09:36 PM, monarch_dodra wrote:
>> Good analysis but (sorry) I think you are on the wrong track.
>> 
>> One of the major problems in std.random is that the ranges use value semantics.
>> This means they are *saved* whenever they are copied, as opposed to just
>> referenced. This creates the problems you have mentioned, and even more.
>
> I agree that the fact that pseudo-random number generators use value semantics
> is a serious problem, and I was thinking of our previous discussions in
> preparing these remarks.

I didn't even realize you were part of that thread :/ Sorry about that.

> On the contrary, consider the SimpleRandomRange -- or RandomSample -- with a
> specified RNG that has reference semantics.  You'll get a different sequence
> each time you iterate over it, and .save will not enable you to reproduce the
> sequence.

I don't think you should pass rndGen to adapter ranges if you want any kind of deterministic behavior. You should really give such ranges their own new generator, eg:

   auto sample = randomSample(iota(100), 5, Random(unpredictableSeed));

That said, it would help even more if we had something *even* simpler, such as:
"Random random()" or "Random newRandomRange()", and then have the more idiomatic:

   auto sample = randomSample(iota(100), 5, newRandomRange());
June 17, 2013
On Monday, 17 June 2013 at 22:29:34 UTC, Joseph Rushton Wakeling wrote:
> On 06/17/2013 11:18 PM, Joseph Rushton Wakeling wrote:
>>> A random range should be viewed (IMO) as nothing more than a range that "was"
>>> (conceptually) simply filled with random numbers. Calling front on the same
>>> range twice in a row *should* produce the same result: No call to popFront => no
>>> change to the range. If it did change, it'd be a blatant violation of the range
>>> concept. It also means you can't have safe/nothrow/pure/const "front".
>> 
>> Completely agree, and I don't think this is in contradiction with what I've
>> proposed.  My proposed "rule" might be better stated to clarify this.
>
> Perhaps this would be a better statement:
>
>     ************************************************************************
>     * Iterating fully over a given random range should produce a different *
>     * sequence for each such complete iteration.
>            *
>     ************************************************************************

I'm not sure what that means but...

> So, if you do,
>
>     SomeRandomRange r;
>     x = r.front;
>     y = r.front;
>     assert(x == y);  // Passes!!
>
> But
>
>     SomeRandomRange r;
>     arr1 = array(r);
>     arr2 = array(r);
>     assert(x != y);  // the two arrays are filled with different sequences.

With reference ranges, that is indeed what you'd get.

I could even add:

     SomeRandomForwardRange r;
     auto arr1 = array(r.save);
     auto arr2 = array(r);
     assert(x == y); // This time both are the same
June 17, 2013
On 06/17/2013 11:34 PM, monarch_dodra wrote:
> I don't think you should pass rndGen to adapter ranges if you want any kind of deterministic behavior.

Actually, _passing_ rndGen with the current value semantics _will_ produce deterministic behaviour.  The problem is that it's statistically unsafe as the copy-by-value means that the results of the adapter ranges will correlate with subsequent other uses of rndGen.

> You should really give such ranges their own new
> generator, eg:
> 
>    auto sample = randomSample(iota(100), 5, Random(unpredictableSeed));

I did discuss that in my longer email.  I consider this to be statistically unsound because you can't guarantee that the results of Random(unpredictableSeed) will be statistically independent from the results of other random number generation activities in your program (a risk which increases with the number of random samples you generate in your program).

It's also undesirable because relying on the user to be virtuous and generate a new, unpredictably seeded RNG is (to my mind) too trustworthy.

If we consider D's goal to be safe-by-default, where statistics are concerned that should mean that the easy thing to do is also statistically "safe".
« First   ‹ Prev
1 2 3 4 5 6