June 15, 2012
On Friday, 15 June 2012 at 06:00:43 UTC, Artur Skawina wrote:
> On 06/15/12 00:55, Joseph Rushton Wakeling wrote:
>>     sample = randomSample(iota(0, 100), 5, rndGen);
>> 
>> ... should probably be disallowed on grounds of safety.
>
> Considering the output of this program:
>
>    import std.stdio;
>    import std.random;
>
>    void main() {
>       foreach (i; 0..20)
>          writeln(randomSample([0,1,2,4,5,6,7,8,9], 3, Random(unpredictableSeed)));
>    }
>
> I'd say the use of std.random should be disallowed on grounds of safety...
>
> Does it work for someone else? (JIC it's only my old GDC installation that fails)
>
> artur

Joseph's pull request already contains a fix for this bug but
I'm guessing it won't be merged until other issues with
randomSample are resolved.
June 15, 2012
On 15/06/12 12:48, jerro wrote:
> Joseph's pull request already contains a fix for this bug but
> I'm guessing it won't be merged until other issues with
> randomSample are resolved.

Assuming that everyone agrees that a random sample should always lazily evaluate to the same values, a different fix will be needed; probably that if no RNG is specified, RandomSample should contain a generator set to Random(unpredictableSeed).

On the other hand, if RNGs are to be passed to randomSample by reference, it might make more sense to always have the sample evaluate differently.
June 17, 2012
On 15/06/12 06:58, Artur Skawina wrote:
> Considering the output of this program:
>
>     import std.stdio;
>     import std.random;
>
>     void main() {
>        foreach (i; 0..20)
>           writeln(randomSample([0,1,2,4,5,6,7,8,9], 3, Random(unpredictableSeed)));
>     }
>
> I'd say the use of std.random should be disallowed on grounds of safety...

Yea, it's for a mix of reasons which Jerro and I prepared fixes for.

In the current RandomSample implementation the first value of the sample is set in the constructor so that when you call .front() there will be a value waiting.  That means that whatever the subsequent lazy evaluation of the sample, the first point will _always_ be the same.

You wouldn't expect that to be a problem here because you're generating 20 different samples.  But -- nastily -- in the current setup, if you specify a RNG, its value is set only _after_ the RandomSample object has been constructed, meaning this RNG does not affect the generation of the first value.  Hence the output you see.

Jerro's fix corrected the copying of RNG.  Mine instead delayed the calculation of the first value so that it was always part of the lazy evaluation.  Which is the better option depends on how you prefer to resolve the inconsistency identified in this discussion thread, i.e. whether the lazy evaluation should always come out to the same set of values, or always different ones.

In any case, I think the pull request I have here:
https://github.com/D-Programming-Language/phobos/pull/553

... improves the safety (as well as the speed) of RandomSample, and that pulling it in would improve the situation without breaking backwards compatibility, but that still leaves unresolved the inconsistent behaviour identified in this thread.
June 17, 2012
On 15/06/12 12:48, jerro wrote:
> Joseph's pull request already contains a fix for this bug but
> I'm guessing it won't be merged until other issues with
> randomSample are resolved.

FWIW, I think my updates can be merged first without problem -- yes, they don't address the bigger design issues, but they don't make matters _worse_ and in all other respects they improve things.
June 17, 2012
On 06/17/12 16:37, Joseph Rushton Wakeling wrote:
> On 15/06/12 06:58, Artur Skawina wrote:
>> Considering the output of this program:
>>
>>     import std.stdio;
>>     import std.random;
>>
>>     void main() {
>>        foreach (i; 0..20)
>>           writeln(randomSample([0,1,2,4,5,6,7,8,9], 3, Random(unpredictableSeed)));
>>     }
>>
>> I'd say the use of std.random should be disallowed on grounds of safety...
> 
> Yea, it's for a mix of reasons which Jerro and I prepared fixes for.
> 
> In the current RandomSample implementation the first value of the sample is set in the constructor so that when you call .front() there will be a value waiting.  That means that whatever the subsequent lazy evaluation of the sample, the first point will _always_ be the same.
> 
> You wouldn't expect that to be a problem here because you're generating 20 different samples.  But -- nastily -- in the current setup, if you specify a RNG, its value is set only _after_ the RandomSample object has been constructed, meaning this RNG does not affect the generation of the first value.  Hence the output you see.

The bug description and cause makes sense, thanks for the explanation.

But the problem is that this kind of bug inside a module which is supposed
to generate pseudo-random data makes it very hard to trust _any_ result
given back by the code...
Other than the constant first element, does the output of the above program
look correct? It did not for me, so i tried this:

   import std.stdio;
   import std.random;

   void main() {
      long cnts[10];

      foreach (i; 0..20000) {
         auto r = randomSample([0,1,2,4,5,6,7,8,9], 3, Random(unpredictableSeed));
         foreach (e; r)
            cnts[e]++;
      }
      writeln(cnts);
   }

which results in:

   [0, 20000, 5749, 0, 5781, 5702, 5692, 5618, 5674, 5784]

So let's fix the already discovered bug:

@@ -1330,7 +1330,8 @@ struct RandomSample(R, Random = void)
     // choice but to store a copy.
     static if(!is(Random == void))
     {
-        Random gen;
+        Random _gen;
+        @property gen(Random g) { _gen = g; prime(); return _gen; }
     }

 /**
@@ -1411,7 +1412,7 @@ Returns the index of the visited record.
             }
             else
             {
-                auto r = uniform(0, _available, gen);
+                auto r = uniform(0, _available, _gen);
             }

             if (r < _toSelect)

and rerun the program. Now the result is:

   [0, 7568, 7476, 0, 7494, 7500, 7461, 7504, 7527, 7470]

ie still not quite what you'd expect...


Unfortunately this makes it impossible to use std.random in practice, because you can't tell how many more such problems exist, if my first two attempts to use the module already uncovered two serious bugs...

> Jerro's fix corrected the copying of RNG.  Mine instead delayed the calculation of the first value so that it was always part of the lazy evaluation.  Which is the better option depends on how you prefer to resolve the inconsistency identified in this discussion thread, i.e. whether the lazy evaluation should always come out to the same set of values, or always different ones.

This is *not* a RNG, it's a PRNG - the results must always be completely
repeatable, just like you say in your first message. Of course having
a mode that improves the randomness is ok and should probably even be the
default. But if a PRNG is seeded with a known value then it must behave
completely predictable.

artur
June 17, 2012
On 17/06/12 17:08, Artur Skawina wrote:
> The bug description and cause makes sense, thanks for the explanation.
>
> But the problem is that this kind of bug inside a module which is supposed
> to generate pseudo-random data makes it very hard to trust _any_ result
> given back by the code...

Sure.  When I was working on this I did spend a fair amount of time scratching my head over how you could create _really_ effective unittests for random-number functionality, without stretching out the time required too greatly.  I don't think sufficient tests are in place, though probably really rigorous tests of pseudo-random number generation would take longer than unittests are supposed to.

> So let's fix the already discovered bug:

I'm feeling a bit braindead today so I may have misunderstood your code, but I'm not sure your fix actually does fix the problem identified.  You could check out Jerro's pull request for an alternative:
https://github.com/D-Programming-Language/phobos/pull/542

> Now the result is:
>
>     [0, 7568, 7476, 0, 7494, 7500, 7461, 7504, 7527, 7470]
>
> ie still not quite what you'd expect...

If you've got time, you might like to pull from my master branch:
https://github.com/WebDrake/phobos

... and check if the same bug arises.  I made exactly this kind of test.

> This is *not* a RNG, it's a PRNG - the results must always be completely
> repeatable, just like you say in your first message. Of course having
> a mode that improves the randomness is ok and should probably even be the
> default. But if a PRNG is seeded with a known value then it must behave
> completely predictable.

I think you've slightly misunderstood what I meant.  Let's say we create a random sample range:

    auto sample = randomSample(/* whatever input */);

... then there are two perfectly logical and acceptable ways to handle its lazy evaluation.

The first is that each time you evaluate it produces the exact same result, i.e.

    writeln(sample);
    writeln(sample);
    writeln(sample);

... will produce identical output 3 times.  The alternative is that each time a new random sample is generated, i.e. each time we

    writeln(sample);

... we get a different sample.  This is still predictable, because the samples will derive from the same sequence of pseudo-random numbers, each new sample picking up the pseudo-random sequence where the last one left.  Assuming the sequence's approximation of randomness to be good enough, you'll get properly independent samples each time.

To me either of these possibilities is acceptable -- they're both logical and predictable -- but the behaviour should be the same whether or not randomSample is called with a specific RNG.
June 17, 2012
On 17/06/12 17:51, Joseph Rushton Wakeling wrote:
> On 17/06/12 17:08, Artur Skawina wrote:
>> Now the result is:
>>
>> [0, 7568, 7476, 0, 7494, 7500, 7461, 7504, 7527, 7470]
>>
>> ie still not quite what you'd expect...
>
> If you've got time, you might like to pull from my master branch:
> https://github.com/WebDrake/phobos
>
> ... and check if the same bug arises. I made exactly this kind of test.

I've compiled your test against my code and get out, on 4 different runs:

[5984, 5950, 6046, 6020, 5977, 5839, 5936, 6032, 6137, 6079]
[6075, 5904, 5993, 6027, 6076, 5987, 6012, 6013, 5884, 6029]
[6089, 5964, 6006, 5920, 5937, 5997, 6034, 6019, 6038, 5996]
[5975, 5916, 6087, 6026, 6068, 5938, 5972, 5937, 5966, 6115]

... which seems in the ballpark.

However, a disturbing fact: if instead of compiling, I run with rdmd, I get out:

[0, 0, 20000, 5643, 5800, 5749, 5699, 5655, 5664, 5790]

I can't work out why the compiled result would be different from the output of rdmd, as I recompiled not only Phobos but also rdmd itself.  Perhaps that it's looking in /usr/include/ when building instead of /usr/local/include/ ... ?
June 17, 2012
On 17/06/12 18:31, Joseph Rushton Wakeling wrote:
> I can't work out why the compiled result would be different from the output of
> rdmd

... and one restart later, the correct output is coming out.  Probably should have run rehash. :-\

June 17, 2012
On 06/17/12 18:51, Joseph Rushton Wakeling wrote:
> I'm feeling a bit braindead today so I may have misunderstood your code, but I'm not sure your fix actually does fix the problem identified.  You could check out Jerro's pull request for an alternative:
> https://github.com/D-Programming-Language/phobos/pull/542

The result should be the same; I just wanted a fix and did that one-liner, so that it would be clear the bugs aren't related.

>> Now the result is:
>>
>>     [0, 7568, 7476, 0, 7494, 7500, 7461, 7504, 7527, 7470]
>>
>> ie still not quite what you'd expect...
> 
> If you've got time, you might like to pull from my master branch: https://github.com/WebDrake/phobos
> 
> ... and check if the same bug arises.  I made exactly this kind of test.

I took your random.d and ran with that:

   [6612, 6650, 6704, 0, 6629, 6834, 6634, 6756, 6590, 6591]
   [6589, 6587, 6636, 0, 6673, 6704, 6647, 6704, 6643, 6817]
   [6744, 6552, 6602, 0, 6641, 6722, 6598, 6676, 6749, 6716]
   [6641, 6583, 6710, 0, 6618, 6684, 6789, 6673, 6683, 6619]
   [6667, 6692, 6600, 0, 6717, 6588, 6660, 6678, 6673, 6725]

so the problem is still there, just a little different.

I'm using an old GDC build, in case that makes any difference:

gcc version 4.6.3 20120106 (prerelease gdc 0.31 - r748:ab99d67f04c2, using dmd 2.057)

artur
June 17, 2012
On 17/06/12 19:50, Artur Skawina wrote:
> I took your random.d and ran with that:
>
>     [6612, 6650, 6704, 0, 6629, 6834, 6634, 6756, 6590, 6591]
>     [6589, 6587, 6636, 0, 6673, 6704, 6647, 6704, 6643, 6817]
>     [6744, 6552, 6602, 0, 6641, 6722, 6598, 6676, 6749, 6716]
>     [6641, 6583, 6710, 0, 6618, 6684, 6789, 6673, 6683, 6619]
>     [6667, 6692, 6600, 0, 6717, 6588, 6660, 6678, 6673, 6725]
>
> so the problem is still there, just a little different.

That's odd, and not what I'm seeing.  I don't know if it's related to GDC -- I'm using the latest DMD and druntime from GitHub.