March 23, 2012
On 23 March 2012 17:53, Juan Manuel Cabo <juanmanuel.cabo@gmail.com> wrote:
> But I think the most important change is that I'm now showing
> the 95% and 99% confidence intervals. (For the confidence intervals
> to mean anything, please everyone, remember to control
> your variables (don't defrag and benchmark :-) !!) so that apples
> are still apples and don't become oranges, and make sure N>30).
>
> More info on histogram and confidence intervals in the usage help.

Dude, this is awesome. I tend to just use time, but if I was doing anything more complicated, I'd use this. I would suggest changing the name while you still can. avgtime is not that informative a name given that it now does more than just "Average" times.

--
James Miller
March 23, 2012
On Friday, 23 March 2012 at 05:16:20 UTC, Andrei Alexandrescu wrote:
[.....]
>> (man, the gaussian curve is everywhere, it never ceases to
>> perplex me).
>
> I'm actually surprised. I'm working on benchmarking lately and the distributions I get are very concentrated around the minimum.
>
> Andrei


Well, the shape of the curve depends a lot on
how the random noise gets inside the measurement.

I like  'ls -lR'  because the randomness comes
from everywhere, and its quite bell shaped.
I guess there is a lot of I/O mess (even if
I/O is all cached, there are lots of opportunities
for kernel mutexes to mess everything I guess).

When testing "/bin/sleep 0.5", it will be quite
a pretty boring histogram.

And I guess than when testing something thats only
CPU bound and doesn't make too much syscalls,
the shape is more concentrated in a few values.


On the other hand, I'm getting some weird bimodal
(two peaks) curves sometimes, like the one I put on
the README.md.
It's definitely because of my laptop's CPU throttling,
because it went away when I disabled it (for the curious
ones, in ubuntu 64bit, here is a way to disable
throttling (WARNING: might get hot until you undo or reboot):

echo 1600000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq

echo 1600000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq

(yes my cpu is 1.6GHz, but it rocks).


--jm



March 23, 2012
On Thursday, 22 March 2012 at 17:13:58 UTC, Manfred Nowak wrote:
> Juan Manuel Cabo wrote:
>
>> like the unix 'time' command
>
> `version linux' is missing.
>
> -manfred


Linux only for now. Will make it work in windows this weekend.

I hope that's what you meant.

--jm


March 23, 2012
On Friday, 23 March 2012 at 06:51:48 UTC, James Miller wrote:

> Dude, this is awesome. I tend to just use time, but if I was doing
> anything more complicated, I'd use this. I would suggest changing the
> name while you still can. avgtime is not that informative a name given
> that it now does more than just "Average" times.
>
> --
> James Miller


> Dude, this is awesome.

Thanks!! I appreciate your feedback!

> I would suggest changing the name while you still can.

Suggestions welcome!!

--jm

March 23, 2012
On Friday, 23 March 2012 at 05:51:40 UTC, Manfred Nowak wrote:
>
> | For samples, if it is known that they are drawn from a symmetric
> | distribution, the sample mean can be used as an estimate of the
> | population mode.

I'm not printing the population mode, I'm printing the 'sample mode'.
It has a very clear meaning: most frequent value. To have frequency,
I group into 'bins' by precision: 12.345 and 12.3111 will both
go to the 12.3 bin.

>
> and the program computes the variance as if the values of the sample
> follow a normal distribution, which is symmetric.

This program doesn't compute the variance. Maybe you are talking
about another program. This program computes the standard deviation
of the sample. The sample doesn't need to of any distribution
to have a standard deviation. It is not a distribution parameter,
it is a statistic.

> Therefore the mode of the sample is of interest only, when the variance
> is calculated wrongly.

???

The 'sample mode', 'median' and 'average' can quickly tell you
something about the shape of the histogram, without
looking at it.
If the three coincide, then maybe you are in normal distribution land.

The only place where I assume normal distribution is for the
confidence intervals. And it's in the usage help.

If you want to support estimating weird probability
distributions parameters, forking and pull requests are
welcome. Rewrites too. Good luck detecting distribution
shapes!!!!  ;-)


>
> -manfred

PS: I should use the t student to make the confidence intervals,
and for computing that I should use the sample standard
deviation (/n-1), but that is a completely different story.
The z normal with n>30 aproximation is quite good.
(I would have to embed a table for the t student tail factors,
pull reqs velcome).

PS2: I now fixed the confusion with the confidence interval
of the variable and the confidence interval of the mu average,
I simply now show both. (release 0.4).

PS3: Statistics estimate distribution parameters.

--jm



March 23, 2012
On 23 March 2012 21:37, Juan Manuel Cabo <juanmanuel.cabo@gmail.com> wrote:
> PS: I should use the t student to make the confidence intervals,
> and for computing that I should use the sample standard
> deviation (/n-1), but that is a completely different story.
> The z normal with n>30 aproximation is quite good.
> (I would have to embed a table for the t student tail factors,
> pull reqs velcome).

If its possible to calculate it, then you can generate a table at compile-time using CTFE. Less error-prone, and controllable accuracy.

--
James Miller
March 23, 2012
On 23/03/12 09:37, Juan Manuel Cabo wrote:
> On Friday, 23 March 2012 at 05:51:40 UTC, Manfred Nowak wrote:
>>
>> | For samples, if it is known that they are drawn from a symmetric
>> | distribution, the sample mean can be used as an estimate of the
>> | population mode.
>
> I'm not printing the population mode, I'm printing the 'sample mode'.
> It has a very clear meaning: most frequent value. To have frequency,
> I group into 'bins' by precision: 12.345 and 12.3111 will both
> go to the 12.3 bin.
>
>>
>> and the program computes the variance as if the values of the sample
>> follow a normal distribution, which is symmetric.
>
> This program doesn't compute the variance. Maybe you are talking
> about another program. This program computes the standard deviation
> of the sample. The sample doesn't need to of any distribution
> to have a standard deviation. It is not a distribution parameter,
> it is a statistic.
>
>> Therefore the mode of the sample is of interest only, when the variance
>> is calculated wrongly.
>
> ???
>
> The 'sample mode', 'median' and 'average' can quickly tell you
> something about the shape of the histogram, without
> looking at it.
> If the three coincide, then maybe you are in normal distribution land.
>
> The only place where I assume normal distribution is for the
> confidence intervals. And it's in the usage help.
>
> If you want to support estimating weird probability
> distributions parameters, forking and pull requests are
> welcome. Rewrites too. Good luck detecting distribution
> shapes!!!! ;-)
>
>
>>
>> -manfred
>
> PS: I should use the t student to make the confidence intervals,
> and for computing that I should use the sample standard
> deviation (/n-1), but that is a completely different story.
> The z normal with n>30 aproximation is quite good.
> (I would have to embed a table for the t student tail factors,
> pull reqs velcome).

No, it's easy. Student t is in std.mathspecial.


>
> PS2: I now fixed the confusion with the confidence interval
> of the variable and the confidence interval of the mu average,
> I simply now show both. (release 0.4).
>
> PS3: Statistics estimate distribution parameters.
>
> --jm
>
>
>

March 23, 2012
On 23/03/12 11:20, Don Clugston wrote:
> On 23/03/12 09:37, Juan Manuel Cabo wrote:
>> On Friday, 23 March 2012 at 05:51:40 UTC, Manfred Nowak wrote:
>>>
>>> | For samples, if it is known that they are drawn from a symmetric
>>> | distribution, the sample mean can be used as an estimate of the
>>> | population mode.
>>
>> I'm not printing the population mode, I'm printing the 'sample mode'.
>> It has a very clear meaning: most frequent value. To have frequency,
>> I group into 'bins' by precision: 12.345 and 12.3111 will both
>> go to the 12.3 bin.
>>
>>>
>>> and the program computes the variance as if the values of the sample
>>> follow a normal distribution, which is symmetric.
>>
>> This program doesn't compute the variance. Maybe you are talking
>> about another program. This program computes the standard deviation
>> of the sample. The sample doesn't need to of any distribution
>> to have a standard deviation. It is not a distribution parameter,
>> it is a statistic.
>>
>>> Therefore the mode of the sample is of interest only, when the variance
>>> is calculated wrongly.
>>
>> ???
>>
>> The 'sample mode', 'median' and 'average' can quickly tell you
>> something about the shape of the histogram, without
>> looking at it.
>> If the three coincide, then maybe you are in normal distribution land.
>>
>> The only place where I assume normal distribution is for the
>> confidence intervals. And it's in the usage help.
>>
>> If you want to support estimating weird probability
>> distributions parameters, forking and pull requests are
>> welcome. Rewrites too. Good luck detecting distribution
>> shapes!!!! ;-)
>>
>>
>>>
>>> -manfred
>>
>> PS: I should use the t student to make the confidence intervals,
>> and for computing that I should use the sample standard
>> deviation (/n-1), but that is a completely different story.
>> The z normal with n>30 aproximation is quite good.
>> (I would have to embed a table for the t student tail factors,
>> pull reqs velcome).
>
> No, it's easy. Student t is in std.mathspecial.

Aargh, I didn't get around to copying it in. But this should do it.

/** Inverse of Student's t distribution
 *
 * Given probability p and degrees of freedom nu,
 * finds the argument t such that the one-sided
 * studentsDistribution(nu,t) is equal to p.
 *
 * Params:
 * nu = degrees of freedom. Must be >1
 * p  = probability. 0 < p < 1
 */
real studentsTDistributionInv(int nu, real p )
in {
   assert(nu>0);
   assert(p>=0.0L && p<=1.0L);
}
body
{
    if (p==0) return -real.infinity;
    if (p==1) return real.infinity;

    real rk, z;
    rk =  nu;

    if ( p > 0.25L && p < 0.75L ) {
        if ( p == 0.5L ) return 0;
        z = 1.0L - 2.0L * p;
        z = betaIncompleteInv( 0.5L, 0.5L*rk, fabs(z) );
        real t = sqrt( rk*z/(1.0L-z) );
        if( p < 0.5L )
            t = -t;
        return t;
    }
    int rflg = -1; // sign of the result
    if (p >= 0.5L) {
        p = 1.0L - p;
        rflg = 1;
    }
    z = betaIncompleteInv( 0.5L*rk, 0.5L, 2.0L*p );

    if (z<0) return rflg * real.infinity;
    return rflg * sqrt( rk/z - rk );
}
March 23, 2012
On 3/23/12 12:51 AM, Manfred Nowak wrote:
> Andrei Alexandrescu wrote:
>
>> You may want to also print the mode of the distribution,
>> nontrivial but informative
>
> In case of this implementation and according to the given link: trivial
> and noninformative, because
>
> | For samples, if it is known that they are drawn from a symmetric
> | distribution, the sample mean can be used as an estimate of the
> | population mode.
>
> and the program computes the variance as if the values of the sample
> follow a normal distribution, which is symmetric.
>
> Therefore the mode of the sample is of interest only, when the variance
> is calculated wrongly.

Again, benchmarks I've seen are always asymmetric. Not sure why those shown here are symmetric. The mode should be very close to the minimum (and in fact I think taking the minimum is a pretty good approximation of the sought-after time).

Andrei


March 23, 2012
On 3/23/12 3:02 AM, Juan Manuel Cabo wrote:
> On Friday, 23 March 2012 at 05:16:20 UTC, Andrei Alexandrescu wrote:
> [.....]
>>> (man, the gaussian curve is everywhere, it never ceases to
>>> perplex me).
>>
>> I'm actually surprised. I'm working on benchmarking lately and the
>> distributions I get are very concentrated around the minimum.
>>
>> Andrei
>
>
> Well, the shape of the curve depends a lot on
> how the random noise gets inside the measurement.
[snip]

Hmm, well the way I see it, the observed measurements have the following composition:

X = T + Q + N

where T > 0 (a constant) is the "real" time taken by the processing, Q > 0 is the quantization noise caused by the limited resolution of the clock (can be considered 0 if the resolution is much smaller than the actual time), and N is noise caused by a variety of factors (other processes, throttling, interrupts, networking, memory hierarchy effects, and many more). The challenge is estimating T given a bunch of X samples.

N can be probably approximated to a Gaussian, although for short timings I noticed it's more like bursts that just cause outliers. But note that N is always positive (therefore not 100% Gaussian), i.e. there's no way to insert some noise that makes the code seem artificially faster. It's all additive.

Taking the mode of the distribution will estimate T + mode(N), which is informative because after all there's no way to eliminate noise. However, if the focus is improving T, we want an estimate as close to T as possible. In the limit, taking the minimum over infinitely many measurements of X would yield T.


Andrei