September 21, 2012
On 21-Sep-12 23:59, Andrei Alexandrescu wrote:
> On 9/19/12 4:11 PM, "Øivind" wrote:
>> New question for you :)
>>
>> To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
>> shared static initializer into the module. If I have a module A and a
>> module B, that both depend on eachother, than this will probably not
>> work..? The runtime will detect the init cycle and fail with the
>> following error:
>>
>> "Cycle detected between modules with ctors/dtors"
>>
>> Or am I wrong now?
>
> I think you have discovered a major issue. Ideas on how to attack this?
>
Not ideal but...

Make scheduleForBenchmarking to mixin in something else but not code - say global templated struct with certain name.

Then it should be possible to do:

benchmarkModules!(module1, module2, ...);

That would search for this specific anchor at the top scope of modules and collect all info. I'm not sure we can pass module names as alias parameters but I think our meta-programming tricksters certainly did something along the these lines.

-- 
Dmitry Olshansky
September 21, 2012
On 9/21/12 11:14 AM, Manu wrote:
> On 21 September 2012 07:23, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org <mailto:SeeWebsiteForEmail@erdani.org>>
> wrote:
>
>     For a very simple reason: unless the algorithm under benchmark is
>     very long-running, max is completely useless, and it ruins average
>     as well.
>
>
> This is only true for systems with a comprehensive pre-emptive OS
> running on the same core. Most embedded systems will only be affected by
> cache misses and bus contention, in that situation, max is perfectly
> acceptable.

I think embedded systems that run e.g. Linux will be affected by task switching.

Andrei
September 21, 2012
On 9/21/12 2:49 PM, David Piepgrass wrote:
>> After extensive tests with a variety of aggregate functions, I can say
>> firmly that taking the minimum time is by far the best when it comes
>> to assessing the speed of a function.
>
> Like others, I must also disagree in princple. The minimum sounds like a
> useful metric for functions that (1) do the same amount of work in every
> test and (2) are microbenchmarks, i.e. they measure a small and simple
> task.

That is correct.

> If the benchmark being measured either (1) varies the amount of
> work each time (e.g. according to some approximation of real-world
> input, which obviously may vary)* or (2) measures a large system, then
> the average and standard deviation and even a histogram may be useful
> (or perhaps some indicator whether the runtimes are consistent with a
> normal distribution or not). If the running-time is long then the max
> might be useful (because things like task-switching overhead probably do
> not contribute that much to the total).
>
> * I anticipate that you might respond "so, only test a single input per
> benchmark", but if I've got 1000 inputs that I want to try, I really
> don't want to write 1000 functions nor do I want 1000 lines of output
> from the benchmark. An average, standard deviation, min and max may be
> all I need, and if I need more detail, then I might break it up into 10
> groups of 100 inputs. In any case, the minimum runtime is not the
> desired output when the input varies.

I understand. What we currently do at Facebook is support benchmark functions with two parameters (see https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md). One is the number of iterations, the second is "problem size", akin to what you're discussing.

I chose to not support that in this version of std.benchmark because it can be tackled later easily, but I probably need to add it now, sigh.

> It's a little surprising to hear "The purpose of std.benchmark is not to
> estimate real-world time. (That is the purpose of profiling)"...
> Firstly, of COURSE I would want to estimate real-world time with some of
> my benchmarks. For some benchmarks I just want to know which of two or
> three approaches is faster, or to get a coarse ball-park sense of
> performance, but for others I really want to know the wall-clock time
> used for realistic inputs.

I would contend that a benchmark without a baseline is very often misguided. I've seen tons and tons and TONS of nonsensical benchmarks lacking a baseline. "I created one million smart pointers, it took me only one millisecond!" Well how long did it take you to create one million dumb pointers?

Choosing good baselines and committing to good comparisons instead of un-based absolutes is what makes the difference between a professional and a well-intended dilettante.

> Secondly, what D profiler actually helps you answer the question "where
> does the time go in the real-world?"? The D -profile switch creates an
> instrumented executable, which in my experience (admittedly not
> experience with DMD) severely distorts running times. I usually prefer
> sampling-based profiling, where the executable is left unchanged and a
> sampling program interrupts the program at random and grabs the call
> stack, to avoid the distortion effect of instrumentation. Of course,
> instrumentation is useful to find out what functions are called the most
> and whether call frequencies are in line with expectations, but I
> wouldn't trust the time measurements that much.
>
> As far as I know, D doesn't offer a sampling profiler, so one might
> indeed use a benchmarking library as a (poor) substitute. So I'd want to
> be able to set up some benchmarks that operate on realistic data, with
> perhaps different data in different runs in order to learn about how the
> speed varies with different inputs (if it varies a lot then I might
> create more benchmarks to investigate which inputs are processed
> quickly, and which slowly.)

I understand there's a good case to be made for profiling. If this turns out to be an acceptance condition for std.benchmark (which I think it shouldn't), I'll define one.

> Some random comments about std.benchmark based on its documentation:
>
> - It is very strange that the documentation of printBenchmarks uses
> neither of the words "average" or "minimum", and doesn't say how many
> trials are done....

Because all of those are irrelevant and confusing. We had an older framework at Facebook that reported those numbers, and they were utterly and completely meaningless. Besides the trials column contained numbers that were not even comparable. Everybody was happy when I removed them with today's simple and elegant numbers.

> I suppose the obvious interpretation is that it only
> does one trial, but then we wouldn't be having this discussion about
> averages and minimums right? Øivind says tests are run 1000 times... but
> it needs to be configurable per-test (my idea: support a _x1000 suffix
> in function names, or _for1000ms to run the test for at least 1000
> milliseconds; and allow a multiplier when when running a group of
> benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
> many trials as usual.)

I don't think that's a good idea.

> Also, it is not clear from the documentation what
> the single parameter to each benchmark is (define "iterations count".)

The documentation could include that, but I don't want to overspecify.

> - The "benchmark_relative_" feature looks quite useful. I'm also happy
> to see benchmarkSuspend() and benchmarkResume(), though
> benchmarkSuspend() seems redundant in most cases: I'd like to just call
> one function, say, benchmarkStart() to indicate "setup complete, please
> start measuring time now."

Good point. I think this is a minor encumbrance, so it's good to keep generality.

> - I'm glad that StopWatch can auto-start; but the documentation should
> be clearer: does reset() stop the timer or just reset the time to zero?
> does stop() followed by start() start from zero or does it keep the time
> on the clock? I also think there should be a method that returns the
> value of peek() and restarts the timer at the same time (perhaps stop()
> and reset() should just return peek()?)
>
> - After reading the documentation of comparingBenchmark and measureTime,
> I have almost no idea what they do.

Yah, these are moved over from std.datetime. I'll need to make a couple more passes through the dox.


Andrei
September 21, 2012
Andrei Alexandrescu wrote:
> On 9/21/12 5:39 AM, Jacob Carlborg wrote:
> >On 2012-09-21 06:23, Andrei Alexandrescu wrote:
> >
> >>For a very simple reason: unless the algorithm under benchmark is very long-running, max is completely useless, and it ruins average as well.
> >
> >I may have completely misunderstood this but aren't we talking about what do include in the output of the benchmark? In that case, if you don't like max and average just don't look at it.
> 
> I disagree. I won't include something in my design just so people don't look at it most of the time. Min and average are most of the time an awful thing to include, and will throw off people with bizarre results.
> 
> If it's there, it's worth looking at. Note how all columns are directly comparable (I might add, unlike other approaches to benchmarking).
> 
> >>For virtually all benchmarks I've run, the distribution of timings is a half-Gaussian very concentrated around the minimum. Say you have a minimum of e.g. 73 us. Then there would be a lot of results close to that; the mode of the distribution would be very close, e.g. 75 us, and the more measurements you take, the closer the mode is to the minimum. Then you have a few timings up to e.g. 90 us. And finally you will inevitably have a few outliers at some milliseconds. Those are orders of magnitude larger than anything of interest and are caused by system interrupts that happened to fall in the middle of the measurement.
> >>
> >>Taking those into consideration and computing the average with those outliers simply brings useless noise into the measurement process.
> >
> >After your replay to one of Manu's post, I think I misunderstood the std.benchmark module. I was thinking more of profiling. But are these quite similar tasks, couldn't std.benchmark work for both?
> 
> This is an interesting idea. It would delay release quite a bit because I'd need to design and implement things like performance counters and such.

You mean like extending StopWatch and allowing the user to provide the measuring code, i.e. counting the number of instructions. This would be very useful. Is it possible to make sure that these changes can be introduced later without breaking the API?

Jens
September 21, 2012
>> Some random comments about std.benchmark based on its documentation:
>>
>> - It is very strange that the documentation of printBenchmarks uses neither of the words "average" or "minimum", and doesn't say how many trials are done....
>
> Because all of those are irrelevant and confusing.

Huh? It's not nearly as confusing as reading the documentation and not having the faintest idea what it will do. The way the benchmarker works is somehow 'irrelevant'? The documentation doesn't even indicate that the functions are to be run more than once!!

> I don't think that's a good idea.

I have never seen you make such vague arguments, Andrei.
September 21, 2012
On 9/21/12 5:36 PM, David Piepgrass wrote:
>>> Some random comments about std.benchmark based on its
>>> documentation:
>>>
>>> - It is very strange that the documentation of printBenchmarks
>>> uses neither of the words "average" or "minimum", and doesn't say
>>> how many trials are done....
>>
>> Because all of those are irrelevant and confusing.
>
> Huh? It's not nearly as confusing as reading the documentation and
> not having the faintest idea what it will do. The way the benchmarker
> works is somehow 'irrelevant'? The documentation doesn't even
> indicate that the functions are to be run more than once!!

I misunderstood. I agree that it's a good thing to specify how
benchmarking proceeds.

>> I don't think that's a good idea.
>
> I have never seen you make such vague arguments, Andrei.

I had expanded my point elsewhere. Your suggestion was:

> - It is very strange that the documentation of printBenchmarks uses
> neither of the words "average" or "minimum", and doesn't say how many
> trials are done.... I suppose the obvious interpretation is that it
> only does one trial, but then we wouldn't be having this discussion
> about averages and minimums right? Øivind says tests are run 1000
> times... but it needs to be configurable per-test (my idea: support a
> _x1000 suffix in function names, or _for1000ms to run the test for at
> least 1000 milliseconds; and allow a multiplier when when running a
> group of benchmarks, e.g. a multiplier argument of 0.5 means to only
> run half as many trials as usual.) Also, it is not clear from the
> documentation what the single parameter to each benchmark is (define
> "iterations count".)

I don't think it's a good idea because the "for 1000 ms" doesn't say anything except how good the clock resolution was on the system. I'm as strongly convinced we shouldn't print useless information as I am we should print useful information.


Andrei
September 21, 2012
On Friday, 21 September 2012 at 19:54:12 UTC, Andrei Alexandrescu wrote:
> On 9/19/12 4:06 AM, Peter Alexander wrote:
>> I don't see why `benchmark` takes (almost) all of its parameters as
>> template parameters. It looks quite odd, seems unnecessary, and (if I'm
>> not mistaken) makes certain use cases quite difficult.
>
> That is intentional - indirect calls would add undue overhead to the measurements.

I accept that it adds undue overhead. I just think that the function would be more usable with non-template parameters (as per my example). I also think the overhead would be negligible.
September 22, 2012
On Fri, 21 Sep 2012 17:00:29 -0400
Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 9/21/12 11:12 AM, Manu wrote:
> > On 21 September 2012 07:45, Andrei Alexandrescu
> >
> >     Currently std.benchmark does not expose raw results for the
> > sake of simplicity. It's easy to expose such, but I'd need a bit
> > ***more convincing about their utility***.
> >

(Emphasis added for proper context.)

> >
> > Custom visualisation, realtime charting/plotting, user supplied reduce function?
> 
> Hrm, that sounds like an entire new project.
> 

That doesn't diminish their utility.

Keep in mind, nobody's suggesting putting all of that into std.benchmark (certainly not initially anyway), but the idea is to at least have the door open for them.

September 22, 2012
Stepping back for a moment, I think we're facing two key issues here:

The first key issue is that the docs for std.benchmark don't adequately
explain Andre's intended charter/scope for it, it's methodology or the
rationale for its methodology. So people see "benchmark" and they think
"oh, ok, for timing stuff", but it appears to be intended as being
for very specific use-cases. I think this entire discussion serves as
evidence that, at the very least, it needs to communicate that
scope/methodology/rationale better that it currently does. If all of
us are having trouble "getting it", then others certainly will too.

Aside from that, there's the second key issue: whether the
current intended scope is sufficient. Should it be more general in
scope and not so specialized? Personally, I would tend to think do, and
I think that seems to the the popular notion. But I don't know for sure.
If it should be more generalized, than does it need to be so for the
first iteration, or can it be done later after being added to phobos?
That, I have no idea.

September 22, 2012
On 2012-21-09 22:58:36, Andrei Alexandrescu wrote:

> On 9/21/12 5:39 AM, Jacob Carlborg wrote:

>> After your replay to one of Manu's post, I think I misunderstood the
>> std.benchmark module. I was thinking more of profiling. But are these
>> quite similar tasks, couldn't std.benchmark work for both?
>
> This is an interesting idea. It would delay release quite a bit because I'd need to design and implement things like performance counters and such.

I certainly think the two use cases are similar enough to warrant their
inclusion in a common module. That does not preclude std.benchmark being
included as is now, and extended with profiling features at some later
point.

-- 
Simen