Ranges and random numbers -- initializing .front and related values

Re: Ranges and random numbers -- initializing .front and related values

Jun 18, 2013

H. S. Teoh

Jun 18, 2013

Joseph Rushton Wakeling

Jun 19, 2013

Joseph Rushton Wakeling

June 18, 2013

Re: Ranges and random numbers -- initializing .front and related values

Posted by H. S. Teoh

Permalink

H. S. Teoh

Permalink

On Tue, Jun 18, 2013 at 03:17:01PM +0100, Joseph Rushton Wakeling wrote:
> Hello all,
> 
> The previous discussion on random ranges -- that is, ranges where popFront() calls a random number generator (real or pseudo) -- seems to have come to a nice conclusion: namely, that the problems identified can be resolved by ensuring that random number generators operate as reference rather than value types.
> 
> So now, let's turn to a different set of problems that arise with random numbers and ranges.  In this case I don't have a concrete proposal or solution, but I'd like to present the problems and invite discussion.
> 
> Let's take as read that a conversion to reference types has been made (e.g. the attached Mersenne Twister variant).  Now consider the following random range, which generates a sequence of n random numbers u_1, u_2, ..., u_n, uniformly distributed in [0, 1) and also calculates the remainder of n - \sum_{i}u_i.
> 
>     struct Remainder(RNG)
>         if(isUniformRNG!RNG)
>     {
>         private real _remaining, _u;
>         private size_t _n;
>         private RNG _gen;
> 
>         this(size_t n, RNG gen)
>         {
>             assert(n > 0);
>             _n = n;
>             _remaining = n;
>             _gen = gen;
>             prime();       // sets value of _u.
>         }
[...]
>         private void prime()
>         {
>             _u = uniform(0.0L, 1.0L, _gen);
>             _remaining -= _u;
>         }
>     }
[...]
> So, let's initialize one of these ranges with a reference-type PRNG, and see what comes out if it.
> 
>     auto gen = new MtClass19937(unpredictableSeed);
>     auto r = remainder(5, gen);
>     writeln(r);
>     writeln(r);
> 
> ... which gives us for example:
> 
>     [0.686448, 0.206227, 0.331488, 0.558315, 0.255811]
>     [0.686448, 0.596189, 0.602239, 0.704929, 0.740741]
> 
> What do we see?  Well, the two different readings from the range are different -- as they should be (see previous discussion) -- but _not_ statistically independent: the first value is identical.  This will be the case no matter how many times we consume the range, because the very first value of .front is determined in the constructor:
> 
>     this(size_t n, RNG gen)
>     {
>         assert(n > 0);
>         _n = n;
>         _remaining = n;
>         _gen = gen;
>         prime();    // <------ sets first value of _u;
>     }
> 
> Obviously this is an unacceptable violation of statistical independence.  The issue here is that the initial value of .front shouldn't actually be determined until we start consuming the range. So for example we might add a boolean variable to control this, and rewrite the constructor and front:
[...]

Actually, I question the test code:

     auto gen = new MtClass19937(unpredictableSeed);
     auto r = remainder(5, gen);
     writeln(r);
     writeln(r);

I think there may be some misunderstanding of the semantics of r here. This may have been caused by the fact that arrays are "anomalous" ranges (as Jonathan put it), in that you can iterate over them multiple times and get the same results. However, generally, wrapper ranges are not designed to be iterated more than once; once the range has been consumed, no assumption can be made about it afterwards. That is to say, unless the range-wrapping function returns a forward range, they can only be iterated over once. The second writeln above is therefore treading on dangerous ground; the correct approach is to decide whether you want to iterate over the same range twice:

     auto gen = new MtClass19937(unpredictableSeed);
     auto r = remainder(5, gen);
     writeln(r.save);		// assuming r is a forward range
     writeln(r);

or, if you want two *different* sequences to be extracted from gen:

     auto gen = new MtClass19937(unpredictableSeed);
     writeln(remainder(5, gen));
     writeln(remainder(5, gen));	// call remainder() twice to
					// extract 5 elements from gen
					// twice

IOW, the range returned by remainder() is not to be treated as some kind of "repeatable data source" in the sense that iterating it multiple times will repeat the behaviour of taking 5 elements from gen and doing stuff to them.  Instead, it should be treated as a single-use data source, that gets consumed by the first writeln(), so the second writeln() is, strictly speaking, invalid.  Technically, the second writeln should just print an empty range; the reason it doesn't, is because r is a value type.  Arguably, it should be a reference type, but I suppose your typical range-wrapping function in std.algorithm / std.range assumes that the user understands that the resulting range is intended to be iterated only once, so they return value types (structs) as an optimization (and also to avoid GC allocations).

But r being a value type is really a question of implementation; from an abstract POV, the user shouldn't need to know whether it's a value type or reference type, and thus should *not* assume one way or another. So passing r to writeln twice is, strictly speaking, incorrect code. The correct way to achieve what you intend in this case, is to call remainder() twice, as in my second code snippet above.

This is applicable not only to remainder(), but to many of the wrapper ranges returned by std.range / std.algorithm. Attempting to pass them around (by value, because they're generally structs) and iterating them multiple times tend to produce odd results. The bad thing is that most test cases use arrays as convenient underlying data sources, which masks this problem because arrays have stronger semantics than your general input/forward range, so issues like the above are often overlooked.

T

-- 
"Real programmers can write assembly code in any language. :-)" -- Larry Wall

On 06/18/2013 05:29 PM, H. S. Teoh wrote: > Actually, I question the test code: > > auto gen = new MtClass19937(unpredictableSeed); > auto r = remainder(5, gen); > writeln(r); > writeln(r); > > I think there may be some misunderstanding of the semantics of r here. This may have been caused by the fact that arrays are "anomalous" ranges (as Jonathan put it), in that you can iterate over them multiple times and get the same results. However, generally, wrapper ranges are not designed to be iterated more than once; once the range has been consumed, no assumption can be made about it afterwards. That is to say, unless the range-wrapping function returns a forward range, they can only be iterated over once. I can accept that as a design principle, that iterating over a range twice is in general considered as undefined behaviour. It's certainly something that I've implicitly accepted in my own code, as I don't think I use anything akin to the above in "real" applications. Now that said, I think there's also a value in the definition of a random range as being one where iterating over it multiple times generates statistically independent outcomes, and I'm curious whether one can reliably achieve that design goal. > The second writeln above is therefore treading on dangerous ground; the correct approach is to decide whether you want to iterate over the same range twice: > > auto gen = new MtClass19937(unpredictableSeed); > auto r = remainder(5, gen); > writeln(r.save); // assuming r is a forward range > writeln(r); > > or, if you want two *different* sequences to be extracted from gen: > > auto gen = new MtClass19937(unpredictableSeed); > writeln(remainder(5, gen)); > writeln(remainder(5, gen)); // call remainder() twice to > // extract 5 elements from gen > // twice > > IOW, the range returned by remainder() is not to be treated as some kind of "repeatable data source" in the sense that iterating it multiple times will repeat the behaviour of taking 5 elements from gen and doing stuff to them. Instead, it should be treated as a single-use data source, that gets consumed by the first writeln(), so the second writeln() is, strictly speaking, invalid. Your second example corresponds to what I would do in practice if I were, for example, generating many random samples. However, suppose that you have an entity that is heavy to construct. Then there may be a value in being able to do something like, auto r = someRandomRange(.....); foreach(i; iota(1_000_000)) { processResultsOf(r); } ... as opposed to, foreach(i; iota(1_000_000)) { auto r = someRandomRange(.....); processResultsOf(r); } Despite the difficulties I've identified in my previous email, I still think it ought to be possible to derive a random range design that can do that reliably. > But r being a value type is really a question of implementation; from an abstract POV, the user shouldn't need to know whether it's a value type or reference type, and thus should *not* assume one way or another. So passing r to writeln twice is, strictly speaking, incorrect code. The correct way to achieve what you intend in this case, is to call remainder() twice, as in my second code snippet above. Your point that the repeatable behaviour is an artifact of the range being a value type is a very good one. Could a reasonable conclusion be that ranges should be designed to behave _like_ reference types even if they're technically implemented as structs, precisely in order to prevent the user from doing unsafe things like multiple iteration passes? On the other hand, is there anything particularly wrong with designing a range to be able to handle multiple iteration passes?

On 06/18/2013 07:21 PM, Joseph Rushton Wakeling wrote: > I can accept that as a design principle, that iterating over a range twice is in general considered as undefined behaviour. It's certainly something that I've implicitly accepted in my own code, as I don't think I use anything akin to the above in "real" applications. > > Now that said, I think there's also a value in the definition of a random range as being one where iterating over it multiple times generates statistically independent outcomes, and I'm curious whether one can reliably achieve that design goal. I'm a little concerned here because when the idea of an "if (_first)" lazy evaluation was put in RandomSample.front, nobody dissented: https://github.com/D-Programming-Language/phobos/pull/553#commits-pushed-205c3bf I might not have made that particular change if I'd been advised that random samples should be treated as consume-once entities. That said, I think there's a benefit to doing it like this. If we start with the assumption that we're generating a lazily-evaluated random sequence, consider the following, using the simpleRandomRange I described in the previous discussion: auto gen1 = new MtClass19937(1001); real x1 = uniform(0.0L, 1.0L, gen1); auto r1 = simpleRandomRange(0.0L, 1.0L, gen1); writeln(r1.take(5)); auto gen2 = new MtClass19937(1001); auto r2 = simpleRandomRange(0.0L, 1.0L, gen2); real x2 = uniform(0.0L, 1.0L, gen2); writeln(r2.take(5)); These will produce sequences that differ only in the _first_ entry: [0.379529, 0.265064, 0.551418, 0.19606, 0.227684] [0.306232, 0.265064, 0.551418, 0.19606, 0.227684] ... because the initial value gets set in the constructor, and so it depends whether the call to uniform() happens before or after. It's maybe a small thing, but it feels more natural to me that all the randomness should happen at the point where the range is consumed, rather than at its point of construction. On the other hand, I don't like the idea of front losing its const pure nothrow properties. :-( A while ago, in response to a remark of Andrei's about the possibility of a .finish() method for output ranges, I made an off the cuff suggestion that random ranges might have a .start() method that would be called at the beginning of the consumption of the range: http://forum.dlang.org/post/mailman.1227.1344816494.31962.digitalmars-d@puremagic.com Could such a thing be feasible?

Forums