Jump to page: 1 24  
Page
Thread overview
Re: RandomSample with specified random number generator
Jun 12, 2012
Jens Mueller
Jun 12, 2012
Jens Mueller
Jun 12, 2012
Jens Mueller
Jun 12, 2012
Jens Mueller
Jun 13, 2012
jerro
Jun 14, 2012
jerro
Jun 14, 2012
Dmitry Olshansky
Jun 14, 2012
Jonathan M Davis
Jun 14, 2012
Jonathan M Davis
Jun 15, 2012
David Nadlinger
Jun 15, 2012
Jens Mueller
Jun 14, 2012
Jens Mueller
Jun 15, 2012
Artur Skawina
Jun 15, 2012
jerro
Jun 17, 2012
Artur Skawina
Jun 17, 2012
Artur Skawina
Jun 18, 2012
Artur Skawina
Jun 15, 2012
Jonathan M Davis
June 12, 2012
Joseph Rushton Wakeling wrote:
> Hello all,
> 
> An interesting challenge with the randomSample functionality in random.d.
> 
> randomSample takes as input a range, a number of points to sample from that range, and (optionally) a random number generator.  It returns a range corresponding to a lazily-calculated sample of the desired size.
> 
> If you don't pass it a specific random number generator, then it just uses the thread-global random number generator.  A consequence of this is that the sample changes each time you use it, i.e. if you write,
> 
>       auto sample1 = randomSample(iota(0, 100), 5);
>       writeln(sample1);
>       writeln(sample1);
>       writeln(sample1);
> 
> ... you get 3 different random samples, e.g.
> 
>       [0, 17, 18, 43, 77]
>       [20, 25, 34, 53, 97]
>       [6, 12, 28, 42, 87]
> 
> Conversely, if you _do_ pass randomSample a specific random number generator, a copy of it is stored inside the RandomSample struct. This is necessary because, as the sample is lazily evaluated, you need to guarantee the RNG's existence at the point when you evaluate.
> 
> A consequence of this is that if you evaluate the sample multiple times, you get the same result, i.e. if you do:
> 
>       auto sample2 = randomSample(iota(0, 100), 5, rndGen);
>       writeln("Sample with specified RNG:");
>       writeln(sample2);
>       writeln(sample2);
>       writeln(sample2);
> 
> then you get something like e.g.
> 
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
> 
> ... which is problematic because it's different behaviour from the case without a specific RNG being used, but not problematic per se. One can see a logic for either case (a sample range always giving the same result, or always giving a different and independent result), it's just a matter of being clear which will happen.
> 
> However, a real problem arises if you want to have multiple samples each being passed a specific source of randomness.  If you do:
> 
>       auto sample3 = randomSample(iota(0, 100), 5, rndGen);
>       writeln(sample3);
>       auto sample4 = randomSample(iota(0, 100), 5, rndGen);
>       writeln(sample4);
>       auto sample5 = randomSample(iota(0, 100), 5, rndGen);
>       writeln(sample5);
> 
> ... then you'd expect in principle to get 3 different samples. However, because rndGen is _copied_ rather than used directly, its state does not get updated when the samples are evaluated.  As a consequence, each of the samples above gets passed _the same random number generator in the same state_, and you get out the same samples, e.g.
> 
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
> 
> (... yes, the same as sample2, assuming we're still in the same program and haven't made any other uses of rndGen in the meantime).
> 
> What this means is that randomSample is impossible to use effectively with a specified random number generator.  It's not just that successive "different" samples produce the same output, it's also that if you generate a random sample and then generate other random numbers afterwards, they will be correlated.
> 
> I can see only two possible resolutions of this issue, but I'm curious if anyone else can think of something different.
> 
>   (i) store a reference to the random number generator instead of a copy
>       (is this even possible?).  The trouble is, this isn't safe, because
>       what if e.g. you have a function which does something like
> 
>           auto generateSample()
>           {
>               auto rng = Random(unpredictableSeed);
>               auto sample = randomSample(iota(0, 100), 5, rng);
>               return sample;
>           }
> 
>           void main()
>           {
>               auto sample = generateSample();
>               writeln(sample);  // Won't work, because rng will no longer exist.
>           }
> 
>       ... so this option seems impermissible.
> 
>  (ii) when copying the random number generator, seed it with an unpredictable
>       seed before generating the first point of the random sample.  This will
>       effectively make the sample an independent thread with respect to random
>       number generation.  The disadvantage is that you lose the reproducibility
>       of the sampling behaviour (e.g. computer game where you want to avoid the
>       player being able to save/reload/try again and get a different outcome)
>       and there might be unexpected correlations that occur if the seeding or
>       the random number generation algorithm are done poorly.
> 
> The second seems the only really valid option to me, and has the advantage of making identical the behaviour of the sample whether or not it's given a specific RNG (i.e. using the sample 3 different times gets you 3 different samples).
> 
> ... any thoughts on this and on possible alternative options?

Probably I'm not seeing the issue

auto sample3 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample3);
auto sample4 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample4);
auto sample5 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample5);

outputs

[21, 38, 57, 86, 90]
[21, 39, 68, 79, 94]
[21, 22, 57, 86, 92]

Besides the fact that the ranges always contain 21 (a bug?) this looks
fine to me, doesn't it?

Jens
June 12, 2012
On 12/06/12 13:46, Jens Mueller wrote:
> Probably I'm not seeing the issue
>
> auto sample3 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
> writeln(sample3);
> auto sample4 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
> writeln(sample4);
> auto sample5 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
> writeln(sample5);
>
> outputs
>
> [21, 38, 57, 86, 90]
> [21, 39, 68, 79, 94]
> [21, 22, 57, 86, 92]

Yes, because you're passing it each time a new RNG seeded with an unpredictable seed :-)

Try instead

   auto rng = Random(unpredictableSeed)
   auto sample3 = randomSample(iota(0, 100), 5, rng);
   writeln(sample3);
   auto sample4 = randomSample(iota(0, 100), 5, rng);
   writeln(sample4);
   auto sample5 = randomSample(iota(0, 100), 5, rng);
   writeln(sample5);

... and you'll get out each time the same values.  (Or instead of a newly-defined generator you could just use rndGen as in my code examples.)

> Besides the fact that the ranges always contain 21 (a bug?) this looks
> fine to me, doesn't it?

The first-sample-point issue is a bug which I fixed in a recent pull request:
https://github.com/D-Programming-Language/phobos/pull/553#issuecomment-5608813
June 12, 2012
Joseph Rushton Wakeling wrote:
> On 12/06/12 13:46, Jens Mueller wrote:
> >Probably I'm not seeing the issue
> >
> >auto sample3 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
> >writeln(sample3);
> >auto sample4 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
> >writeln(sample4);
> >auto sample5 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
> >writeln(sample5);
> >
> >outputs
> >
> >[21, 38, 57, 86, 90]
> >[21, 39, 68, 79, 94]
> >[21, 22, 57, 86, 92]
> 
> Yes, because you're passing it each time a new RNG seeded with an unpredictable seed :-)
> 
> Try instead
> 
>    auto rng = Random(unpredictableSeed)
>    auto sample3 = randomSample(iota(0, 100), 5, rng);
>    writeln(sample3);
>    auto sample4 = randomSample(iota(0, 100), 5, rng);
>    writeln(sample4);
>    auto sample5 = randomSample(iota(0, 100), 5, rng);
>    writeln(sample5);
> 
> ... and you'll get out each time the same values.  (Or instead of a newly-defined generator you could just use rndGen as in my code examples.)

Of course. But this behavior is clear from the documentation. Did it surprise you? Passing a rng by value is fine I think.

> >Besides the fact that the ranges always contain 21 (a bug?) this looks
> >fine to me, doesn't it?
> 
> The first-sample-point issue is a bug which I fixed in a recent pull request: https://github.com/D-Programming-Language/phobos/pull/553#issuecomment-5608813

I see.

Jens
June 12, 2012
On 12/06/12 14:28, Jens Mueller wrote:
>> ... and you'll get out each time the same values.  (Or instead of a
>> newly-defined generator you could just use rndGen as in my code
>> examples.)
>
> Of course. But this behavior is clear from the documentation. Did it
> surprise you? Passing a rng by value is fine I think.

_Is_ it clear?  I would expect that if I create and use 3 samples as in my example code, then I would get 3 _different_ samples.  But perhaps I'm too used to a C-style past where a single RNG is passed around to different functions within the thread.

Bear in mind that the behaviour is different depending on whether you pass an RNG or not.  i.e. if you do

    sample = randomSample(iota(0, 100), 5);
    writeln(sample);
    writeln(sample);
    writeln(sample);

... you'll get 3 _different_ sets of values, whereas if you do

    sample = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
    writeln(sample);
    writeln(sample);
    writeln(sample);

... you'll get out 3 times the same sample.

This could be resolved by making it so that if RandomSample is not passed a specified RNG, it sets the internally-stored RNG to be Random(unpredictableSeed).
June 12, 2012
Joseph Rushton Wakeling wrote:
> On 12/06/12 14:28, Jens Mueller wrote:
> >>... and you'll get out each time the same values.  (Or instead of a newly-defined generator you could just use rndGen as in my code examples.)
> >
> >Of course. But this behavior is clear from the documentation. Did it surprise you? Passing a rng by value is fine I think.
> 
> _Is_ it clear?  I would expect that if I create and use 3 samples as in my example code, then I would get 3 _different_ samples.  But perhaps I'm too used to a C-style past where a single RNG is passed around to different functions within the thread.
> 
> Bear in mind that the behaviour is different depending on whether you pass an RNG or not.  i.e. if you do
> 
>     sample = randomSample(iota(0, 100), 5);
>     writeln(sample);
>     writeln(sample);
>     writeln(sample);
> 
> ... you'll get 3 _different_ sets of values, whereas if you do
> 
>     sample = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
>     writeln(sample);
>     writeln(sample);
>     writeln(sample);
> 
> ... you'll get out 3 times the same sample.
> 
> This could be resolved by making it so that if RandomSample is not passed a specified RNG, it sets the internally-stored RNG to be Random(unpredictableSeed).

Right. These are inconsistent. This should be fixed. Can't we just use a
default argument like
auto randomSample(R, Random)(R r, size_t n, Random gen = Random(unpredictableSeed));

Currently the documentation does not even state what is difference between the versions accepting a random generator vs. the ones without a random generator.

Jens
June 12, 2012
On 12/06/12 14:58, Jens Mueller wrote:
> Right. These are inconsistent. This should be fixed. Can't we just use a
> default argument like
> auto randomSample(R, Random)(R r, size_t n, Random gen = Random(unpredictableSeed));
>
> Currently the documentation does not even state what is difference
> between the versions accepting a random generator vs. the ones without a
> random generator.

... so can we agree that a given random sample should _always_ lazily evaluate to the same output, whether or not it's been given a specified RNG?

i.e. that

      auto sample = randomSample(iota(0, 100), 5);
      writeln(sample);
      writeln(sample);
      writeln(sample);

... should produce 3 times the same output?

If so I'll get to work on a bug report and a patch set and unittest.
June 12, 2012
Joseph Rushton Wakeling wrote:
> On 12/06/12 14:58, Jens Mueller wrote:
> >Right. These are inconsistent. This should be fixed. Can't we just use a
> >default argument like
> >auto randomSample(R, Random)(R r, size_t n, Random gen = Random(unpredictableSeed));
> >
> >Currently the documentation does not even state what is difference between the versions accepting a random generator vs. the ones without a random generator.
> 
> ... so can we agree that a given random sample should _always_ lazily evaluate to the same output, whether or not it's been given a specified RNG?
> 
> i.e. that
> 
>       auto sample = randomSample(iota(0, 100), 5);
>       writeln(sample);
>       writeln(sample);
>       writeln(sample);
> 
> ... should produce 3 times the same output?
> 
> If so I'll get to work on a bug report and a patch set and unittest.

Yes, I'll agree with you. But I don't know about others. It'll be nice if others share their opinion such that your efforts won't be wasted. That code was mostly written by Andrei and David.

Jens
June 13, 2012
> Yes, I'll agree with you. But I don't know about others. It'll be nice
> if others share their opinion such that your efforts won't be wasted.
> That code was mostly written by Andrei and David.
>
> Jens

I, for one, agree that this:

> auto randomSample(R, Random)(R r, size_t n, Random gen = Random(unpredictableSeed))

is the best way to solve this. The current implementation where
Random can be void and then RandomSample uses the global rng and
everything behaves slightly differently just seems like an endless
source of bugs.
June 14, 2012
On 13/06/12 03:15, jerro wrote:
>> Yes, I'll agree with you. But I don't know about others. It'll be nice
>> if others share their opinion such that your efforts won't be wasted.
>> That code was mostly written by Andrei and David.
>>
>> Jens
>
> I, for one, agree that this:
>
>> auto randomSample(R, Random)(R r, size_t n, Random gen =
>> Random(unpredictableSeed))
>
> is the best way to solve this. The current implementation where
> Random can be void and then RandomSample uses the global rng and
> everything behaves slightly differently just seems like an endless
> source of bugs.

I've filed a bug report relating to this.
http://d.puremagic.com/issues/show_bug.cgi?id=8247

Assuming API changes are permissible (a big IF), I think the way to do it might be something like

auto randomSample(R, UniformRNG = Random, seed)(R r, size_t n, seed s = unpredictableSeed)

... meaning that the user has an option to specify an RNG _type_ without specifying a seed, and that the seed can be either unpredictable or specific.

The user could also not specify a RNG type, in which case the default RNG type would be used, but would be seeded unpredictably.
June 14, 2012
Joseph Rushton Wakeling wrote:
> On 13/06/12 03:15, jerro wrote:
> >>Yes, I'll agree with you. But I don't know about others. It'll be nice if others share their opinion such that your efforts won't be wasted. That code was mostly written by Andrei and David.
> >>
> >>Jens
> >
> >I, for one, agree that this:
> >
> >>auto randomSample(R, Random)(R r, size_t n, Random gen =
> >>Random(unpredictableSeed))
> >
> >is the best way to solve this. The current implementation where Random can be void and then RandomSample uses the global rng and everything behaves slightly differently just seems like an endless source of bugs.
> 
> I've filed a bug report relating to this. http://d.puremagic.com/issues/show_bug.cgi?id=8247

Thanks.

> Assuming API changes are permissible (a big IF), I think the way to
> do it might be something like
> 
> auto randomSample(R, UniformRNG = Random, seed)(R r, size_t n, seed
> s = unpredictableSeed)
> 
> ... meaning that the user has an option to specify an RNG _type_ without specifying a seed, and that the seed can be either unpredictable or specific.
> 
> The user could also not specify a RNG type, in which case the default RNG type would be used, but would be seeded unpredictably.

Why don't you want to pass an instance of some RNG?
Then there are two places to change when you want to change generated
random numbers.

What are the advantages of the above in regard to
auto randomSample(R, Random)(R r, size_t n, Random r = Random(unpredictableSeed))
?
Is it because you got bitten that the generators are passed by value?

Jens
« First   ‹ Prev
1 2 3 4