Review of Andrei's std.benchmark (page 4)

On 2012-09-21 18:21, Andrei Alexandrescu wrote: > That's a good angle. Profiling is currently done by the -profile switch, > and there are a couple of library functions associated with it. To my > surprise, that documentation page has not been ported to the dlang.org > style: http://digitalmars.com/ctg/trace.html > > I haven't yet thought whether std.benchmark should add more > profiling-related primitives. I'd opine for releasing it without such > for the time being. If you have an API that is fairly open and provides more of the raw results then one can build a more profiling like solution on top of that. This can later be used to create a specific profiling module if we choose to do so. -- /Jacob Carlborg

Am Fri, 21 Sep 2012 00:45:44 -0400 schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>: > > The issue here is automating the benchmark of a module, which would require some naming convention anyway. A perfect use case for user defined attributes ;-) @benchmark void foo(){} @benchmark("File read test") void foo(){}

On 2012-09-21 19:45, Johannes Pfau wrote: > A perfect use case for user defined attributes ;-) > > @benchmark void foo(){} > @benchmark("File read test") void foo(){} Yes, we need user defined attributes and AST macros ASAP :) -- /Jacob Carlborg

September 21, 2012

Re: Review of Andrei's std.benchmark

Posted by David Piepgrass
in reply to Andrei Alexandrescu

Permalink

David Piepgrass

Posted in reply to Andrei Alexandrescu

Permalink

> After extensive tests with a variety of aggregate functions, I can say firmly that taking the minimum time is by far the best when it comes to assessing the speed of a function.

Like others, I must also disagree in princple. The minimum sounds like a useful metric for functions that (1) do the same amount of work in every test and (2) are microbenchmarks, i.e. they measure a small and simple task. If the benchmark being measured either (1) varies the amount of work each time (e.g. according to some approximation of real-world input, which obviously may vary)* or (2) measures a large system, then the average and standard deviation and even a histogram may be useful (or perhaps some indicator whether the runtimes are consistent with a normal distribution or not). If the running-time is long then the max might be useful (because things like task-switching overhead probably do not contribute that much to the total).

* I anticipate that you might respond "so, only test a single input per benchmark", but if I've got 1000 inputs that I want to try, I really don't want to write 1000 functions nor do I want 1000 lines of output from the benchmark. An average, standard deviation, min and max may be all I need, and if I need more detail, then I might break it up into 10 groups of 100 inputs. In any case, the minimum runtime is not the desired output when the input varies.

It's a little surprising to hear "The purpose of std.benchmark is not to estimate real-world time. (That is the purpose of profiling)"... Firstly, of COURSE I would want to estimate real-world time with some of my benchmarks. For some benchmarks I just want to know which of two or three approaches is faster, or to get a coarse ball-park sense of performance, but for others I really want to know the wall-clock time used for realistic inputs.

Secondly, what D profiler actually helps you answer the question "where does the time go in the real-world?"? The D -profile switch creates an instrumented executable, which in my experience (admittedly not experience with DMD) severely distorts running times. I usually prefer sampling-based profiling, where the executable is left unchanged and a sampling program interrupts the program at random and grabs the call stack, to avoid the distortion effect of instrumentation. Of course, instrumentation is useful to find out what functions are called the most and whether call frequencies are in line with expectations, but I wouldn't trust the time measurements that much.

As far as I know, D doesn't offer a sampling profiler, so one might indeed use a benchmarking library as a (poor) substitute. So I'd want to be able to set up some benchmarks that operate on realistic data, with perhaps different data in different runs in order to learn about how the speed varies with different inputs (if it varies a lot then I might create more benchmarks to investigate which inputs are processed quickly, and which slowly.)

Some random comments about std.benchmark based on its documentation:

- It is very strange that the documentation of printBenchmarks uses neither of the words "average" or "minimum", and doesn't say how many trials are done.... I suppose the obvious interpretation is that it only does one trial, but then we wouldn't be having this discussion about averages and minimums right? Øivind says tests are run 1000 times... but it needs to be configurable per-test (my idea: support a _x1000 suffix in function names, or _for1000ms to run the test for at least 1000 milliseconds; and allow a multiplier when when running a group of benchmarks, e.g. a multiplier argument of 0.5 means to only run half as many trials as usual.) Also, it is not clear from the documentation what the single parameter to each benchmark is (define "iterations count".)

- The "benchmark_relative_" feature looks quite useful. I'm also happy to see benchmarkSuspend() and benchmarkResume(), though benchmarkSuspend() seems redundant in most cases: I'd like to just call one function, say, benchmarkStart() to indicate "setup complete, please start measuring time now."

- I'm glad that StopWatch can auto-start; but the documentation should be clearer: does reset() stop the timer or just reset the time to zero? does stop() followed by start() start from zero or does it keep the time on the clock? I also think there should be a method that returns the value of peek() and restarts the timer at the same time (perhaps stop() and reset() should just return peek()?)

- After reading the documentation of comparingBenchmark and measureTime, I have almost no idea what they do.

On 21 September 2012 07:30, Andrei Alexandrescu < SeeWebsiteForEmail@erdani.org> wrote: > I don't quite agree. This is a domain in which intuition is having a hard time, and at least some of the responses come from an intuitive standpoint, as opposed from hard data. > > For example, there's this opinion that taking the min, max, and average is the "fair" thing to do and the most informative. I don't think this is a 'fair' claim, the situation is that different people are looking for different statistical information, and you can distinguish it with whatever terminology you prefer. You are only addressing a single use case; 'benchmarking', by your definition. I'm more frequently interested in profiling than 'benchmark'ing, and I think both are useful to have. The thing is, the distinction between 'benchmarking' and 'profiling' is effectively implemented via nothing more than the sampling algorithm; min vs avg, so is it sensible to expose the distinction in the API in this way?

On 21 September 2012 07:45, Andrei Alexandrescu < SeeWebsiteForEmail@erdani.org> wrote: > As such, you're going to need a far more >> convincing argument than "It worked well for me." >> > > Sure. I have just detailed the choices made by std.benchmark in a couple of posts. > > At Facebook we measure using the minimum, and it's working for us. Facebook isn't exactly 'realtime' software. Obviously, faster is always better, but it's not in a situation where if you slip a sync point by 1ms in an off case, it's all over. You can lose 1ms here, and make it up at a later time, and the result is the same. But again, this feeds back to your distinction between benchmarking and profiling. Otherwise, I think we'll need richer results. At the very least there >> should be an easy way to get at the raw results programmatically >> so we can run whatever stats/plots/visualizations/**output-formats we >> want. I didn't see anything like that browsing through the docs, but >> it's possible I may have missed it. >> > > Currently std.benchmark does not expose raw results for the sake of simplicity. It's easy to expose such, but I'd need a bit more convincing about their utility. Custom visualisation, realtime charting/plotting, user supplied reduce function?

On 21 September 2012 07:23, Andrei Alexandrescu < SeeWebsiteForEmail@erdani.org> wrote: > For a very simple reason: unless the algorithm under benchmark is very long-running, max is completely useless, and it ruins average as well. > This is only true for systems with a comprehensive pre-emptive OS running on the same core. Most embedded systems will only be affected by cache misses and bus contention, in that situation, max is perfectly acceptable.

On Friday, September 21, 2012 17:58:05 Manu wrote: > Okay, I can buy this distinction in terminology. > What I'm typically more interested in is profiling. I do occasionally need > to do some benchmarking by your definition, so I'll find this useful, but > should there then be another module to provide a 'profiling' API? Also > worked into this API? dmd has the -profile flag. - Jonathan M Davis

On 9/19/12 4:06 AM, Peter Alexander wrote: > I don't see why `benchmark` takes (almost) all of its parameters as > template parameters. It looks quite odd, seems unnecessary, and (if I'm > not mistaken) makes certain use cases quite difficult. That is intentional - indirect calls would add undue overhead to the measurements. Andrei

Forums