Jump to page: 1 2
Thread overview
Re: [OT] Generating distribution of N dice rolls
Nov 10, 2022
H. S. Teoh
Nov 10, 2022
Ali Çehreli
Nov 10, 2022
Siarhei Siamashka
Nov 10, 2022
Ali Çehreli
Nov 11, 2022
Siarhei Siamashka
Nov 10, 2022
H. S. Teoh
Nov 11, 2022
Siarhei Siamashka
Nov 11, 2022
H. S. Teoh
Nov 11, 2022
Abdulhaq
Nov 12, 2022
Timon Gehr
Nov 12, 2022
Siarhei Siamashka
Nov 11, 2022
H. S. Teoh
Nov 11, 2022
H. S. Teoh
Nov 11, 2022
Max Haughton
Nov 11, 2022
H. S. Teoh
November 10, 2022
On Thu, Nov 10, 2022 at 11:45:50AM +0100, Timon Gehr via Digitalmars-d wrote: [...]
> The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]), i.e., n trials, k events, and the events follow an uniform distribution.  https://en.wikipedia.org/wiki/Multinomial_distribution

I have to admit, the linked article is way above my head.  I just don't grok probability. :-(


> It is at least as hard to sample as binomial(n,1/k), because that is the marginal distribution of each component of the result.
[...]

Hmm.  I'm thinking one way to approximate this, if there's a fast way to compute the standard deviation of the multinomial, is to use the Box-Muller transform to do the initial population of the result array, then distribute the remaining (N - sum(array)) elements by running individual trials and adjusting the array accordingly.  The initial population should bring us reasonably close to the mean, so (N - sum(array)) should be small enough that running individual trials won't be excessively expensive.

Or if (N - sum(array)) is still large, maybe do another estimate using
the standard deviation of the remaining trials to reduce it, and then
run the remaining trials manually?


T

-- 
Doubt is a self-fulfilling prophecy.
November 10, 2022
On 11/10/22 08:57, H. S. Teoh wrote:

> if there's a fast way to
> compute the standard deviation of the multinomial

I am happy with my solution which includes scientific terms like normalDistributianness. :p

import std;

uint[k] diceDistrib(uint k)(uint N)
in(k > 0)
in(N > 0)
out(r; r[].sum == N)
{
    uint[k] result;
    const average = N / k;

    // foreach (i; 0 .. N) {
    //     result[uniform(0, k)]++;
    // }

    // Larger this value, closer the results to normal distribution
    enum normalDistributianness = 1000;

    // Window around the average of values to pick randomly within
    const halfWindow = average / 2;

    foreach (ref r; result) {
        r = iota(normalDistributianness)
            .map!(_ => uniform(average - halfWindow,
                               average + halfWindow + 1))
            .mean
            .to!uint;
    }

    uint total = result[].sum;

    while (true) {
        if (total == N) {
            break;
        }

        auto which = &result[uniform(0, k)];

        if (total < N) {
            ++(*which);
            ++total;

        } else if (total > N) {
            --(*which);
            --total;
        }
    }

    return result;
}

void main() {
    writeln(diceDistrib!6(100_000_000));
}

Error checking etc. are missing but hundred million dice throws in a split second... Now that's fast! :p

Ali

November 10, 2022

On Thursday, 10 November 2022 at 17:46:59 UTC, Ali Çehreli wrote:

>

On 11/10/22 08:57, H. S. Teoh wrote:

>

if there's a fast way to
compute the standard deviation of the multinomial

I am happy with my solution which includes scientific terms like normalDistributianness. :p

[...]

Error checking etc. are missing but hundred million dice throws in a split second... Now that's fast! :p

That's fast, but unfortunately doesn't seem to be correct. Your code outputs something like this:

[16548205, 16741274, 16600777, 16792301, 16585808, 16731635]

While the original slow reference H. S. Teoh's implementation outputs:

[16669388, 16665882, 16668019, 16671186, 16664718, 16660807]

It's easy to see even with a naked eye that the statistical properties are obviously not the same (your implementation has a significantly higher variance). And for comparison, a Python/SciPy implementation outputs:

[16664115, 16666144, 16677962, 16660692, 16663181, 16667906]
November 10, 2022
On 11/10/22 13:45, Siarhei Siamashka wrote:

> (your implementation has a significantly
> higher variance)

Agreed. I didn't think the following "random" values would magically be correct. :)

    // Larger this value, closer the results to normal distribution
    enum normalDistributianness = 1000;

    // Window around the average of values to pick randomly within
    const halfWindow = average / 2;

I was half joking anyway but the values 4_000 and 'average / 20', respectively produce closer values:

[16663617, 16663620, 16662052, 16671587, 16663818, 16675306]

> . And for comparison, a [Python/SciPy
> implementation](https://github.com/scipy/scipy/issues/17388#issuecomment-1310913940) outputs:
>
>      [16664115, 16666144, 16677962, 16660692, 16663181, 16667906]

But I am still half joking. :)

Ali

November 10, 2022
On Thu, Nov 10, 2022 at 09:45:56PM +0000, Siarhei Siamashka via Digitalmars-d wrote:
> On Thursday, 10 November 2022 at 17:46:59 UTC, Ali Çehreli wrote:
> > On 11/10/22 08:57, H. S. Teoh wrote:
> > 
> > > if there's a fast way to compute the standard deviation of the multinomial

According to the Wikipedia page on multinomial distribution (linked by
Timon), it states that the variance of X_i for n rolls of a k-sided dice
(with probability p_i), where i is a specific outcome, is:

	Var(X_i) = n*p_i*(1 - p_i)

Don't really understand where this formula came from (as I said, that page is way above my head), but we can make use of it.  In this case, p_i = 1/k since we're assuming unbiased dice rolls.  So this reduces to:

	Var(X_i) = (n/k)*(1 - 1/k)

Now, since the standard deviation is just the square root of the variance, we have:

	sigma = sqrt((n/k)*(1 - 1/k))

which is easy to compute.

Furthermore, the mean is just:

	E(X_i) = n*p_i = n/k

We can then plug this into the Box-Muller transform (which I found on
Wikipedia :-P):

	import std.random, std.math;
	const sigma = sqrt((n/k)*(1.0 - 1.0/k));
	auto u1 = uniform01();
	auto u2 = uniform01();
	auto mag = sigma*sqrt(-2*log(u1));
	auto z0 = mag * cos(2*PI*u2) + n/k;
	auto z1 = mag * sin(2*PI*u2) + n/k;

which sets z0 and z1 to be two random variables having mean n/k and variance Var(X_i).  So they can be used to populate two elements of the resulting array, and should have the correct distribution.

Repeat this procedure until all array elements are populated. (If the output array length is odd, i.e., k is odd, we can just discard the last z1.)  Then compute the difference d between sum of the array and n (the expected sum), which should be relatively small, and run d iterations of individual dice rolls and adjust the array (++ if the sum is too small, or -- if the sum is too large) to make it sum exactly to n.


> > I am happy with my solution which includes scientific terms like normalDistributianness. :p
> > 
> > [...]
> > 
> > Error checking etc. are missing but hundred million dice throws in a split second... Now that's fast! :p
> 
> That's fast, but unfortunately doesn't seem to be correct. Your code outputs something like this:
> 
>     [16548205, 16741274, 16600777, 16792301, 16585808, 16731635]
> 
> While the original slow reference H. S. Teoh's implementation outputs:
> 
>     [16669388, 16665882, 16668019, 16671186, 16664718, 16660807]
> 
> It's easy to see even with a naked eye that the statistical properties
> are obviously not the same (your implementation has a significantly
> higher variance). And for comparison, a [Python/SciPy
> implementation](https://github.com/scipy/scipy/issues/17388#issuecomment-1310913940)
> outputs:
> 
>     [16664115, 16666144, 16677962, 16660692, 16663181, 16667906]
[...]

I implemented what I described above, and got very good-looking results.

Output:
-----------
1 modules passed unittests
[16666837, 16666500, 16664692, 16666306, 16663568, 16672097] sum=100000000
[16661247, 16666674, 16668562, 16673476, 16665867, 16664174] sum=100000000
[16663070, 16665221, 16664850, 16668355, 16668246, 16670258] sum=100000000
[16671423, 16665663, 16667209, 16669164, 16663916, 16662625] sum=100000000
[16666194, 16669936, 16662697, 16666535, 16666671, 16667967] sum=100000000
[16665977, 16667094, 16662753, 16670737, 16666762, 16666677] sum=100000000
[16665518, 16666940, 16671829, 16664487, 16665344, 16665882] sum=100000000
[16658486, 16666671, 16667778, 16665901, 16667471, 16673693] sum=100000000
[16668149, 16669431, 16661462, 16668241, 16668286, 16664431] sum=100000000
[16660879, 16671401, 16659154, 16671245, 16667720, 16669601] sum=100000000
[16665213, 16670794, 16665401, 16667548, 16663484, 16667560] sum=100000000
[16663091, 16671196, 16669946, 16670791, 16661162, 16663814] sum=100000000
[16664305, 16670070, 16662123, 16663467, 16673867, 16666168] sum=100000000
[16664395, 16667406, 16657623, 16673838, 16670227, 16666511] sum=100000000
[16671401, 16662181, 16672923, 16657766, 16674160, 16661569] sum=100000000
[16663059, 16669574, 16662338, 16668319, 16670541, 16666169] sum=100000000
[16672366, 16662296, 16666526, 16664284, 16669201, 16665327] sum=100000000
[16665787, 16669038, 16666956, 16668728, 16660096, 16669395] sum=100000000
[16669162, 16669034, 16662788, 16663672, 16669127, 16666217] sum=100000000
[16667016, 16669649, 16661551, 16670882, 16668304, 16662598] sum=100000000
-----------

(I've no idea how accurate it is, but visual inspection certainly indicates that it's doing the right thing. And it's lightning-fast. :-P)


Code (compile with `dmd -unittest -main`):
------------------------------------------
import std;

/// Box-Muller transform.
double[2] gaussianPoint(double[2] mean, double deviation)
{
    import std.math : cos, log, sin, sqrt, PI, round;
    import std.random : uniform01;

    auto u = uniform01();
    auto v = uniform01();
    auto x0 = sqrt(-2.0*log(u)) * cos(2.0*PI*v);
    auto y0 = sqrt(-2.0*log(u)) * sin(2.0*PI*v);

    double[2] result;
    result[0] = mean[0] + cast(int)round(deviation * x0);
    result[1] = mean[1] + cast(int)round(deviation * y0);
    return result;
}

/// Roll k-sided dice N times and tally each output.
int[] diceDistrib(int k, int N)
    in (k > 0)
    in (N > 0)
    out (r; r.sum == N)
{
    import std.math : sqrt, round;
    import std.range : chunks;
    import std.random : uniform;

    const mean = cast(double)N / k;
    const deviation = sqrt(mean * (1.0 - 1.0 / k));
    double[2] means = [ mean, mean ];

    // Populate output array with initial values with the correct mean and
    // standard deviation.
    auto result = new int[k];
    foreach (chunk; result.chunks(2))
    {
        auto z = gaussianPoint(means, deviation);
        chunk[0] = cast(int) round(z[0]);
        if (chunk.length > 1)
            chunk[1] = cast(int) round(z[1]);
    }

    // Tweak resulting array until it sums exactly to N.
    auto total = result.sum;
    while (total != N)
    {
        auto i = uniform(0, k);
        if (total < N)
        {
            result[i]++;
            total++;
        }
        else
        {
            result[i]--;
            total--;
        }
    }

    return result;
}

unittest
{
    foreach (i; 0 .. 20)
    {
        import std.stdio;
        auto N = 100_000_000;
        auto dist = diceDistrib(6, N);
        writefln("%s sum=%d", dist, dist[].sum);
        assert(dist.sum == N);
    }
}
------------------------------------------

//

Now, just for fun, I added a writeln before the `while (total != N)` to print out just how big of a discrepancy to N the array sum is.  Turns out, quite a bit: for N = 100_000_000 as above, the discrepancies range from approximately -25000 to 25000, with the typical discrepancy being about 3-4 digits long, which means that the code is spending quite a bit of time in that final loop.  Which suggests that a possible improvement is to recursively run the initial approximation step, setting sub_N = (N - array.sum).

I'll leave that to a future iteration, though.  Being able to compute a hundred million dice rolls in a split second is already good enough for what I need. :-D


T

-- 
Answer: Because it breaks the logical sequence of discussion. / Question: Why is top posting bad?
November 11, 2022

On Thursday, 10 November 2022 at 22:18:47 UTC, Ali Çehreli wrote:

>

I was half joking anyway but the values 4_000 and 'average / 20', respectively produce closer values:

[16663617, 16663620, 16662052, 16671587, 16663818, 16675306]

Yes, this particular result looks okish (when tested by the XNomial R package):

P value (LLR) = 0.11807 ± 0.00102

You can modify your code this way:

void main() {
    auto results = diceDistrib!6(100_000_000);
    auto expected_probabilities = [1.0 / 6].replicate(6);
    writefln("# Paste this to https://rdrr.io/cran/XNomial/");
    writefln("library(XNomial)");
    writefln("experimental_results <- c(%(%s, %))", results);
    writefln("expected_probabilities <- c(%(%.18f, %))", expected_probabilities);
    writefln("xmonte(experimental_results, expected_probabilities)");
}

And then paste its output to https://rdrr.io/cran/XNomial/ to calculate p-value. If the reported p-value is smaller than 0.05, then we can be 95% confident that the generator is working incorrectly. A p-value higher than 0.05 means that the test can't see anything wrong, but this doesn't guarantee anything and isn't a proof of correctness.

Even a correct generator can sporadically have p-values smaller than 0.05, but this happens only in roughly 1 out of 20 runs (in other words, there's a 5% false positive chance). This situation is explained by https://xkcd.com/882/

November 10, 2022
On Thu, Nov 10, 2022 at 03:15:24PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
> Now, just for fun, I added a writeln before the `while (total != N)` to print out just how big of a discrepancy to N the array sum is. Turns out, quite a bit: for N = 100_000_000 as above, the discrepancies range from approximately -25000 to 25000, with the typical discrepancy being about 3-4 digits long, which means that the code is spending quite a bit of time in that final loop.  Which suggests that a possible improvement is to recursively run the initial approximation step, setting sub_N = (N - array.sum).
> 
> I'll leave that to a future iteration, though.  Being able to compute a hundred million dice rolls in a split second is already good enough for what I need. :-D
[...]

OK, I couldn't resist, the idea is so tempting.  So I made a new implementation that iteratively uses the Box-Muller transform to close the gap between array.sum and N, resorting to individual rolls only when the difference is < k.  For N = 100_000_000, it typically takes about 3-5 iterations to bring the difference down to < k, so the entire algorithm takes about k+5 iterations to compute the result.

Now, in theory, I could just run the Box-Muller estimate until the difference is 0, but I found that once the difference is small, it tends to bounce around 0 several iterations before converging on 0.  So I arbitrarily decided to stop when the difference < k, and use individual rolls to do the rest.

One wrinkle that came up is that for small values of N, sometimes the result array can end up with negative elements, either due to overcompensation during the iterative Box-Muller stage (discrepancy > 0 and the selected z values happen to be larger than the current array element), or during the final adjustment loop (it picks an array element that's already 0 and tries to decrement it). So I had to insert extra checks to discard generated z values if they would cause the result array to have negative counts.  Generally, this happens only for small values of N; for large N the initial estimate is large enough that it's extremely unlikely for a later adjustment to overshoot into negative values.  To ensure the output is never negative, I added another out-contract to check this.


Code:
------------
import std.algorithm;
import std.conv : text;

double[2] gaussianPoint(double[2] mean, double deviation)
{
    import std.math : cos, log, sin, sqrt, PI, round;
    import std.random : uniform01;

    auto u = uniform01();
    auto v = uniform01();
    auto x0 = sqrt(-2.0*log(u)) * cos(2.0*PI*v);
    auto y0 = sqrt(-2.0*log(u)) * sin(2.0*PI*v);

    double[2] result;
    result[0] = mean[0] + cast(int)round(deviation * x0);
    result[1] = mean[1] + cast(int)round(deviation * y0);
    return result;
}

/**
 * Simulates rolling N k-sided dice.
 *
 * Returns: A static array representing how many of each of 1..(k+1) were
 * obtained by the rolls.
 *
 * Bugs: The current implementation uses a naïve algorithm that's rather
 * inefficient for large N.  For our purposes, though, which involve only
 * relatively small N, this is Good Enough(tm) for the time being. We can look
 * into improving this if it becomes a performance bottleneck.
 */
int[] diceDistrib(int k, int N)
    in (k > 0)
    in (N > 0)
    out (r; r.sum == N)
    out (r; r.all!(c => c >= 0), r.text)
{
    import std.math : abs, sgn, sqrt, round;
    import std.range : chunks;
    import std.random : uniform;
    debug import std.stdio;

    // Populate output array with initial values with the correct mean and
    // standard deviation.
    auto result = new int[k];
    auto discrepancy = N - result.sum;
    while (abs(discrepancy) > k)
    {
        debug writefln("discrepancy=%d", discrepancy);
        const mean = cast(double)abs(discrepancy) / k;
        const deviation = sqrt(mean * (1.0 - 1.0 / k));
        double[2] means = [ mean, mean ];
        double sign = sgn(discrepancy);

        foreach (chunk; result.chunks(2))
        {
            do
            {
                auto z = gaussianPoint(means, deviation);
                auto c0 = chunk[0] + cast(int) round(sign*z[0]);
                if (c0 < 0)
                    continue;   // discard nonsensical result

                if (chunk.length > 1)
                {
                    auto c1 = chunk[1] + cast(int) round(sign*z[1]);
                    if (c1 < 0)
                        continue;   // discard nonsensical result
                    chunk[1] = c1;
                }
                chunk[0] = c0;
            } while (false);
        }
        discrepancy = N - result.sum;
    }
    debug writefln("final discrepancy=%d", discrepancy);

    // Tweak resulting array until it sums exactly to N.
    auto total = result.sum;
    while (total != N)
    {
        auto i = uniform(0, k);
        if (total < N)
        {
            result[i]++;
            total++;
        }
        else if (result[i] > 0)
        {
            result[i]--;
            total--;
        }
    }

    return result;
}

unittest
{
    import std.stdio;
    foreach (i; 0 .. 5)
    {
        auto N = 100_000_000;
        auto dist = diceDistrib(6, N);
        debug writefln("%s sum=%d", dist, dist[].sum);
        assert(dist.sum == N);
    }

    foreach (i; 0 .. 5)
    {
        auto N = 10;
        auto dist = diceDistrib(6, N);
        debug writefln("%s sum=%d", dist, dist[].sum);
        assert(dist.sum == N);
    }
}
------------

Typical output:
------------
1 modules passed unittests
discrepancy=1000000000
discrepancy=-94251
discrepancy=287
final discrepancy=5
[166665591, 166651866, 166681043, 166654104, 166673559, 166673837] sum=1000000000
discrepancy=1000000000
discrepancy=20314
discrepancy=-88
discrepancy=-9
final discrepancy=6
[166668490, 166665588, 166648866, 166658164, 166684475, 166674417] sum=1000000000
discrepancy=1000000000
discrepancy=14062
discrepancy=18
final discrepancy=5
[166658497, 166674399, 166652111, 166667954, 166672801, 166674238] sum=1000000000
discrepancy=1000000000
discrepancy=15926
discrepancy=-175
discrepancy=-19
final discrepancy=-4
[166664463, 166659610, 166655243, 166678783, 166673507, 166668394] sum=1000000000
discrepancy=1000000000
discrepancy=39916
discrepancy=201
discrepancy=-16
final discrepancy=2
[166666774, 166663400, 166672276, 166686269, 166652312, 166658969] sum=1000000000
discrepancy=10
final discrepancy=-1
[2, 2, 2, 3, 0, 1] sum=10
discrepancy=10
final discrepancy=-5
[2, 1, 3, 2, 1, 1] sum=10
discrepancy=10
final discrepancy=-1
[1, 0, 1, 1, 4, 3] sum=10
discrepancy=10
final discrepancy=-3
[3, 2, 2, 1, 0, 2] sum=10
discrepancy=10
final discrepancy=-2
[1, 1, 2, 2, 2, 2] sum=10
------------

Note that the last 5 outputs are for a different test case (N=10), just to make sure that we don't produce nonsensical outputs when N is small.


T

-- 
Leather is waterproof.  Ever see a cow with an umbrella?
November 10, 2022
On Thu, Nov 10, 2022 at 05:27:07PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
> /**
>  * Simulates rolling N k-sided dice.
>  *
>  * Returns: A static array representing how many of each of 1..(k+1) were
>  * obtained by the rolls.
>  *
>  * Bugs: The current implementation uses a naïve algorithm that's rather
>  * inefficient for large N.  For our purposes, though, which involve only
>  * relatively small N, this is Good Enough(tm) for the time being. We can look
>  * into improving this if it becomes a performance bottleneck.
>  */
[...]

Argh, that last paragraph should be deleted. :-(

See, this is proof that documentation goes out of sync with the code. We need a new innovation, along the lines of built-in unittests, to revolutionize keeping docs up-to-date with the code.


T

-- 
The best compiler is between your ears. -- Michael Abrash
November 11, 2022

On Friday, 11 November 2022 at 01:47:54 UTC, H. S. Teoh wrote:

>

On Thu, Nov 10, 2022 at 05:27:07PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]

>

/**

  • Simulates rolling N k-sided dice.
  • Returns: A static array representing how many of each of 1..(k+1) were
  • obtained by the rolls.
  • Bugs: The current implementation uses a naïve algorithm that's rather
  • inefficient for large N. For our purposes, though, which involve only
  • relatively small N, this is Good Enough(tm) for the time being. We can look
  • into improving this if it becomes a performance bottleneck.
    */
    [...]

Argh, that last paragraph should be deleted. :-(

See, this is proof that documentation goes out of sync with the code. We need a new innovation, along the lines of built-in unittests, to revolutionize keeping docs up-to-date with the code.

T

Built in AI model? (I joke but this is probably the future, whoever gets it right will make a lot of money).

Describing what code does is actually already relatively doable using contemporary AI techniques, but that isn't documentation (or at least isn't good documentation, I don't know that x + y adds two numbers together.

November 10, 2022
On Fri, Nov 11, 2022 at 03:34:31AM +0000, Max Haughton via Digitalmars-d wrote:
> On Friday, 11 November 2022 at 01:47:54 UTC, H. S. Teoh wrote:
[...]
> > See, this is proof that documentation goes out of sync with the code. We need a new innovation, along the lines of built-in unittests, to revolutionize keeping docs up-to-date with the code.
[...]
> Built in AI model? (I joke but this is probably the future, whoever gets it right will make a lot of money).
> 
> Describing what code *does* is actually already relatively doable using contemporary AI techniques, but that isn't documentation (or at least isn't good documentation, I don't know that `x + y` adds two numbers together.

I don't think AI will ever be able to replace real documentation. It can describe what the code does, but like you said, that isn't good documentation.  I mean, I already know `x + y` adds two numbers (if I didn't, I shouldn't be touching the code in the first place). The AI saying this for me, doesn't really add any value.  What makes good documentation is to explain *why* I'm adding two numbers.  I don't think any AI will be able to divine my intention in adding two numbers; that has to be written by the programmer.

One of the factors that make D unittests so awesome is the fact that you can put them right in the same source file as you're coding in, and in fact, quite often right next to the function it's documenting.  This proximity is what increases the likelihood that people are more likely to write unittests, and more inclined to keep them up-to-date (as opposed to, y'know, opening up an external unittest file and commenting out / disabling a troublesome test just to get things to run). Even if somebody `version(none)`'d out a unittest, it's still sitting right there in the source file, sticking out like a sore thumb and begging silently but persistently "fix me, fix me, you know you should!".

One idea that comes to mind w.r.t. documentation, is if the docs can be written not only in the comment header, but inside the function body, next to the block of code responsible for some particular task. It would have some tags to help ddoc figure out where in the doc block it goes, so that there is some control over order of presentation.  But the comment would be sitting next to the offending code, so when somebody changes the code, the comment would be staring right at them, "update me, update me, c'mon I know you know you should!".


T

-- 
Why do conspiracy theories always come from the same people??
« First   ‹ Prev
1 2