September 21, 2012
On 9/19/12 4:12 AM, Thiez wrote:
> On Tuesday, 18 September 2012 at 22:01:30 UTC, Andrei Alexandrescu wrote:
>> After extensive tests with a variety of aggregate functions, I can say
>> firmly that taking the minimum time is by far the best when it comes
>> to assessing the speed of a function.
>
> What if one tries to benchmark a nondeterministic function? In such a
> case one might well be interested in the best run, worst run, and the
> average.

I agree. Currently std.benchmark is not geared for measuring non-deterministic functions.

Andrei
September 21, 2012
On 9/19/12 3:54 PM, Jacob Carlborg wrote:
> On 2012-09-19 11:38, Peter Alexander wrote:
>
>> The problem with slowest is that you end up with the occasional OS
>> hiccup or GC collection which throws the entire benchmark off. I see
>> your point, but unless you can prevent the OS from interrupting, the
>> time would be meaningless.
>
> That's way the average is good to have as well.

The occasional hiccup is often orders of magnitude slower than the rest, which means it will ruin the average. You may have meant "median", which has more merit, but then I'd say why bother - just use the minimum.

Andrei


September 21, 2012
On 9/19/12 3:59 PM, Graham Fawcett wrote:
> For comparison's sake, the Criterion benchmarking package for Haskell is
> worth a look:
>
> http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/
>
>
> Criterion accounts for clock-call costs, displays various central
> tendencies, reports outliers (and their significance --- whether the
> variance is significantly affected by the outliers), etc., etc. It's a
> very well conceived benchmarking system, and might well be worth
> stealing from.

Will look into it, thanks.

Andrei
September 21, 2012
On 9/19/12 4:11 PM, "Øivind" wrote:
> New question for you :)
>
> To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
> shared static initializer into the module. If I have a module A and a
> module B, that both depend on eachother, than this will probably not
> work..? The runtime will detect the init cycle and fail with the
> following error:
>
> "Cycle detected between modules with ctors/dtors"
>
> Or am I wrong now?

I think you have discovered a major issue. Ideas on how to attack this?

Andrei
September 21, 2012
On 9/20/12 3:42 AM, Manu wrote:
> On 19 September 2012 12:38, Peter Alexander
> <peter.alexander.au@gmail.com <mailto:peter.alexander.au@gmail.com>> wrote:
>
>         The fastest execution time is rarely useful to me, I'm almost
>         always much
>         more interested in the slowest execution time.
>         In realtime software, the slowest time is often the only
>         important factor,
>         everything must be designed to tolerate this possibility.
>         I can also imagine other situations where multiple workloads are
>         competing
>         for time, the average time may be more useful in that case.
>
>
>     The problem with slowest is that you end up with the occasional OS
>     hiccup or GC collection which throws the entire benchmark off. I see
>     your point, but unless you can prevent the OS from interrupting, the
>     time would be meaningless.
>
>
> So then we need to start getting tricky, and choose the slowest one that
> is not beyond an order of magnitude or so outside the average?

That's exactly where it all starts getting unprincipled. Just use the minimum.

Just. Use. The. Minimum.

Andrei
September 21, 2012
On 21-Sep-12 22:49, David Piepgrass wrote:
>> After extensive tests with a variety of aggregate functions, I can say
>> firmly that taking the minimum time is by far the best when it comes
>> to assessing the speed of a function.

>
> As far as I know, D doesn't offer a sampling profiler, so one might
> indeed use a benchmarking library as a (poor) substitute. So I'd want to
> be able to set up some benchmarks that operate on realistic data, with
> perhaps different data in different runs in order to learn about how the
> speed varies with different inputs (if it varies a lot then I might
> create more benchmarks to investigate which inputs are processed
> quickly, and which slowly.)

Real good profilers are the ones served by CPU vendor. See AMD's CodeAnalyst or Intel's VTune. They could even count number of branch predictions, cache misses etc.
It is certainly out of the charter of module or for that matter any standard library code.

>
> Some random comments about std.benchmark based on its documentation:
>
> - It is very strange that the documentation of printBenchmarks uses
> neither of the words "average" or "minimum", and doesn't say how many
> trials are done.... I suppose the obvious interpretation is that it only
> does one trial, but then we wouldn't be having this discussion about
> averages and minimums right?

See the algorithm in action here:
https://github.com/D-Programming-Language/phobos/pull/794/files#L2R381

In other word a function is run 10^n times with n is picked so that total time is big enough to be a trustworthy measurement. Then run-time is time/10^n.

Øivind says tests are run 1000 times...

The above 1000 times, picking the minimum as the best. Obviously it'd be good to be configurable.

 but
> it needs to be configurable per-test (my idea: support a _x1000 suffix
> in function names, or _for1000ms to run the test for at least 1000
> milliseconds; and allow a multiplier when when running a group of
> benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
> many trials as usual.) Also, it is not clear from the documentation what
> the single parameter to each benchmark is (define "iterations count".)
>

> - The "benchmark_relative_" feature looks quite useful. I'm also happy
> to see benchmarkSuspend() and benchmarkResume(), though
> benchmarkSuspend() seems redundant in most cases: I'd like to just call
> one function, say, benchmarkStart() to indicate "setup complete, please
> start measuring time now."
>
> - I'm glad that StopWatch can auto-start; but the documentation should
> be clearer: does reset() stop the timer or just reset the time to zero?
> does stop() followed by start() start from zero or does it keep the time
> on the clock? I also think there should be a method that returns the
> value of peek() and restarts the timer at the same time (perhaps stop()
> and reset() should just return peek()?)

It's the same as the usual stopwatch (as in the real hardware thingy). Thus:
- reset just resets numbers to zeros
- stop just stops counting
- start just starts counting
- peek imitates taking a look at numbers on a device ;)

>
> - After reading the documentation of comparingBenchmark and measureTime,
> I have almost no idea what they do.

I think that comparingBenchmark was present in std.datetime and is carried over as is.

-- 
Dmitry Olshansky
September 21, 2012
I'd throw in a request to address the following.

Suppose we have a function F and a set of inputs S that are supposedly different scenarios we optimize for.
What is interesting is to benchmark all of F(S[i]) as |S| separate functions greatly saving on boilerplate (and helping readability).

One way would to allow passing in an input range of ArgumentTuples to F.

Say as prefix:

void benchmark_f(int a, double b, string s){ ... }

enum benchmark_data_f = [ tuple(1, 2.0, "hi"), tuple(2, 3.0, "bye") ];

Then in the results it'd look as:
f(1, 2.0, "hi")                <ns/iter>   <iter/s>
f(2, 3.0, "bye")               <ns/iter>   <iter/s>

Using any input range is interestingly flexible e.g. :

enum benchmark_data_x = cortesianProduct(iota(1, 3), iota(1, 3));
//we should probably have it in std.range somewhere

void benchmark_x(int a, int b){ ... }

That being said I don't really get the benefit of passing iteration count to the function being benched. To allow it to do initialization step once then do resumeBenchmark() and run some inner loop n times?

-- 
Dmitry Olshansky
September 21, 2012
On Friday, September 21, 2012 15:59:31 Andrei Alexandrescu wrote:
> On 9/19/12 4:11 PM, "Øivind" wrote:
> > New question for you :)
> > 
> > To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a shared static initializer into the module. If I have a module A and a module B, that both depend on eachother, than this will probably not work..? The runtime will detect the init cycle and fail with the following error:
> > 
> > "Cycle detected between modules with ctors/dtors"
> > 
> > Or am I wrong now?
> 
> I think you have discovered a major issue. Ideas on how to attack this?

Some of us have been asking for ages for the ability to mark a static constructor as not depending on anything so that the runtime _doesn't_ think that there's a circular dependency, but Walter has been against the idea when it's been brought up. That would _really_ help here.

Without redesigning std.benchmark so that it doesn't use static constructors, I don't know how you can fix that. Normally, if you really need a static constructor, you go through the pain of creating a separate module which does the initialization for you (like std.stdio does). But that won't work in this case, because you're mixing it in. So, unless you can redesign it so that std.benchmark doesn't require static constructors, it may have to be a limitation of std.benchmark that it can't be used where it would create a circular dependency.

Unfortunately, the circular dependency issue makes static constructors almost useless outside of isolated cases, even though they rarely actually have circular dependencies. It's one of the few places in D that I'd say that there's a major design flaw.

- Jonathan M Davis
September 21, 2012
On 9/21/12 5:39 AM, Jacob Carlborg wrote:
> On 2012-09-21 06:23, Andrei Alexandrescu wrote:
>
>> For a very simple reason: unless the algorithm under benchmark is very
>> long-running, max is completely useless, and it ruins average as well.
>
> I may have completely misunderstood this but aren't we talking about
> what do include in the output of the benchmark? In that case, if you
> don't like max and average just don't look at it.

I disagree. I won't include something in my design just so people don't look at it most of the time. Min and average are most of the time an awful thing to include, and will throw off people with bizarre results.

If it's there, it's worth looking at. Note how all columns are directly comparable (I might add, unlike other approaches to benchmarking).

>> For virtually all benchmarks I've run, the distribution of timings is a
>> half-Gaussian very concentrated around the minimum. Say you have a
>> minimum of e.g. 73 us. Then there would be a lot of results close to
>> that; the mode of the distribution would be very close, e.g. 75 us, and
>> the more measurements you take, the closer the mode is to the minimum.
>> Then you have a few timings up to e.g. 90 us. And finally you will
>> inevitably have a few outliers at some milliseconds. Those are orders of
>> magnitude larger than anything of interest and are caused by system
>> interrupts that happened to fall in the middle of the measurement.
>>
>> Taking those into consideration and computing the average with those
>> outliers simply brings useless noise into the measurement process.
>
> After your replay to one of Manu's post, I think I misunderstood the
> std.benchmark module. I was thinking more of profiling. But are these
> quite similar tasks, couldn't std.benchmark work for both?

This is an interesting idea. It would delay release quite a bit because I'd need to design and implement things like performance counters and such.


Andrei
September 21, 2012
On 9/21/12 11:12 AM, Manu wrote:
> On 21 September 2012 07:45, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org <mailto:SeeWebsiteForEmail@erdani.org>>
> wrote:
>
>         As such, you're going to need a far more
>         convincing argument than "It worked well for me."
>
>
>     Sure. I have just detailed the choices made by std.benchmark in a
>     couple of posts.
>
>     At Facebook we measure using the minimum, and it's working for us.
>
>
> Facebook isn't exactly 'realtime' software. Obviously, faster is always
> better, but it's not in a situation where if you slip a sync point by
> 1ms in an off case, it's all over. You can lose 1ms here, and make it up
> at a later time, and the result is the same. But again, this feeds back
> to your distinction between benchmarking and profiling.

You'd be surprised at how much we care about e.g. 90 percentile time to interaction.

>         Otherwise, I think we'll need richer results. At the very least
>         there
>         should be an easy way to get at the raw results programmatically
>         so we can run whatever
>         stats/plots/visualizations/__output-formats we
>         want. I didn't see anything like that browsing through the docs, but
>         it's possible I may have missed it.
>
>
>     Currently std.benchmark does not expose raw results for the sake of
>     simplicity. It's easy to expose such, but I'd need a bit more
>     convincing about their utility.
>
>
> Custom visualisation, realtime charting/plotting, user supplied reduce
> function?

Hrm, that sounds like an entire new project.


Andrei