Chris Cain
Posted in reply to Chris Cain
| https://d.puremagic.com/issues/show_bug.cgi?id=11598
--- Comment #5 from Chris Cain <zshazz@gmail.com> 2013-12-01 08:55:54 PST ---
It's a fair request to have documented rigor for such an important function in std.random. I'll document my entire process here.
For tests doing what you describe, refer to this gist:
https://gist.github.com/Zshazz/1bffebf9cb7cb97cfabc
One thing to note, knowing the implementation for this change (a bit of information on the idea behind the implementation is at the end of this post), is that this wouldn't be sufficient for detecting all of the potential problems. For instance, if you used a bad uniform implementation with _just_ `rnum % upperDist` (using the notation of this implementation), then the distribution error over `uniform(0, 100)` would be indetectable (even using a 32-bit rnum).
In that case, there would be `((2^32) / 100) + 1` buckets (around 43 million)
and the last bucket would only cover up to `((2^32) % 100) - 1`, which is 95.
So, 96-99 each would be off by about 1 in 43 million rolls. Not really
detectable visually. That would essentially appear to be background noise at
that point and explainable just by the nature of random rolls.
So, let's define `test1`, a test that is divised to guard against that particular type of error. That's easy enough: just pick your numbers in such a way to minimize the number of buckets while still having some elements missing from the last bucket. So, picking `((2^32) / 4) * 3` would result in two buckets, with about 2/3 of the numbers in that range not being represented in the second bucket. Consequently, this makes it _very_ easy to count and _very_ easy to tell that something is off. Roll 100_000 times or so and count how many of those rolls fall into the first third of the range, and then count how many of those rolls fall into the last third of the range. If the counts differ (and they will in the case of `badUniform`), then you know there's an issue.
Now, let's define `test2` to answer the following question: what if the reroll condition was chosen in such a way that it was off-by-one? Well, we'd have a very, very tough time figuring that out since, even with 32-bit numbers, a count array holding _all_ numbers in that space would be too massive. Thoroughly analyzing that data would be even more of a challenge. So, to be fully rigorous about this, `test2` has to _slightly_ modify the types internal to the new uniform implementation to ensure that the rnum is generated using ubytes to make it so that it's easy to analyze (it uses `typeof(unsigned(b-lower))` by default which is minimally a uint, which prevents excessive and expensive rerolls when smaller types are used). This does _not_ change overall algorithm, so it still applies to the bigger version of the new uniform as well. Now `test2` just tries out several `i` in `uniform(0, i)` s.t. `i` is a number around `128` (which is the exact middle of the range of 8-bit integers) to make sure everything behaves properly around that critical point.
Here is the code and results from running the code for these two tests using the new uniform and a bad uniform:
https://gist.github.com/Zshazz/450f57a5374bf0b80fb5
---
So all that said, these tests aren't necessarily going to cover every possible question about statistical issues that this new uniform could bring up, either. Hence a general argument for correctness is probably a better approach. For a complete description, I refer to my document on my dropbox:
https://dl.dropboxusercontent.com/u/2206555/uniformUpgrade.pdf
Because it's possible (and likely eventually) that this document will no longer appear on my dropbox, I'll reproduce the most important information here (not exactly what is in the document, but close enough):
The modulus operator maps an integer to a small, finite space. For instance, `x % 3` will map whatever x is into the range [0 .. 3). 0 maps to 0, 1 maps to 1, 2 maps to 2, 3 maps to 0, and so on infinitely. As long as the integer is uniformly chosen from the infinite space of all non-negative integers then `x % 3` will uniformly fall into that range.
(Non-negative is important in this case because some definitions of modulus, namely the one used in computers generally, map negative numbers differently to (-3 .. 0]. `uniform` does not use negative number modulus, thus we can safely ignore that fact.)
The issue with computers is that integers have a finite space they must fit in, and our uniformly chosen random number is picked in that finite space. So, that method is not sufficient. You can look at it as the integer space being divided into "buckets" and every bucket after the first bucket maps directly into that first bucket. `[0, 1, 2]`, `[3, 4, 5]`, ... When integers are finite, then the last bucket has the chance to be "incomplete": `[uint.max - 3, uint.max - 2, uint.max - 1]`, `[uint.max]` ... (Uh oh, the last bucket only has 1!). The issue here is that _every_ bucket maps _completely_ to the first bucket except for that last one. The last one doesn't have corresponding mappings to 1 or 2, in this case, which makes it unfair.
So, the answer is to simply "reroll" if you're in that last bucket, since it's the only unfair one. Eventually you'll roll into a fair bucket. Simply, instead of the meaning of the last bucket being "maps to `[0]`", it changes to "maps to `[0, 1, 2]`", which is precisely what we want.
To generalize, `upperDist` represents the size of our buckets (and, thus, the exclusive upper bound for our desired uniform number). `rnum` is a uniformly random number picked from the space of integers that a computer can hold (we'll say `UpperType` represents that type).
We'll first try to do the mapping into the first bucket by doing `offset = rnum % upperDist`. We can figure out the position of the front of the bucket we're in by `bucketFront = rnum - offset`.
If we start at `UpperType.max` and walk backwards `upperDist - 1` spaces, then the space we land on is the last acceptable position where a full bucket can fit:
```
bucketFront UpperType.max
v v
[..., 0, 1, 2, ..., upperDist - 1]
^~~ upperDist - 1 ~~^
```
If the bucket starts any later, then it must have lost at least one number and at least that number won't be represented fairly.
```
bucketFront UpperType.max
v v
[..., upperDist - 1, 0, 1, 2, ..., upperDist - 2]
^~~~~~~~ upperDist - 1 ~~~~~~~^
```
Hence, our condition to reroll is
`bucketFront > (UpperType.max - (upperDist - 1))`
---
OK, I think that about covers everything I've done on this. I hope that wasn't too much, but I did want to document the work put into developing this.
--
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
|