View mode: basic / threaded / horizontal-split · Log in · Help
September 21, 2012
Re: Review of Andrei's std.benchmark
On Tue, 18 Sep 2012 18:02:10 -0400
Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 9/18/12 5:07 PM, "Øivind" wrote:
> > * For all tests, the best run is selected, but would it not be
> > reasonable in some cases to get the average value? Maybe excluding
> > the runs that are more than a couple std. deviations away from the
> > mean value..
> 
> After extensive tests with a variety of aggregate functions, I can
> say firmly that taking the minimum time is by far the best when it
> comes to assessing the speed of a function.
> 

*Ahem*: http://zedshaw.com/essays/programmer_stats.html

Your claim that the minimum time is sufficient is...ummm...extremely
unorthodox, to say the least. As such, you're going to need a far more
convincing argument than "It worked well for me."

I assume I don't need to preach that "Extraordinary claims require
extraordinary evidence". But, condensing benchmarks and statistics down
to "take the minimum" and saying that's sufficient is one heck of an
extraordinary claim. If you feel that you can provide sufficiently
extraordinary justification, then please do.

Otherwise, I think we'll need richer results. At the very least there
should be an easy way to get at the raw results programmatically
so we can run whatever stats/plots/visualizations/output-formats we
want. I didn't see anything like that browsing through the docs, but
it's possible I may have missed it.

That brings up another question too: I like the idea of a
one-stop-benchmarking-shop, much like we have for unittests, but maybe
reporting shouldn't be so tightly integrated and left more open for
integration with a proper statistics lib and more generalized
output abilities? But of course, that doesn't preclude having a nice
built-in, but optional, default report. (Again though, maybe I'm
overlooking something already in the module?)

One other nitpick: My initial impression is that the
"benchmark_relative_file read" stuff seems a bit kludgey (and
confusing to visually parse). Is there maybe a better way to handle
that? For example, inspired by getopt:

printBenchmarks!(
   "file write", { std.file.write("/tmp/deleteme", "hello, world!"); },
   BenchmarkOption.relative,
   "file read", { std.file.read("/tmp/deleteme"); },
   "array creation", { new char[32]; })
   ();
September 21, 2012
Re: Review of Andrei's std.benchmark
On 9/20/12 10:05 AM, Manu wrote:
> On 20 September 2012 15:36, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org <mailto:SeeWebsiteForEmail@erdani.org>>
> wrote:
>     Let's use the minimum. It is understood it's not what you'll see in
>     production, but it is an excellent proxy for indicative and
>     reproducible performance numbers.
>
>
> If you do more than a single iteration, the minimum will virtually
> always be influenced by ideal cache pre-population, which is
> unrealistic.

To measure performance against cold cache, you could always clear the 
cache using one of the available methods, see 
http://stackoverflow.com/questions/1756825/cpu-cache-flush. Probably 
std.benchmark could include a routine that does that. But performance on 
cold would actually be most unrealistic and uninformative, as loading 
the memory into cache will dominate the work that the algorithm is 
doing, so essentially the benchmark would evaluate the memory bandwidth 
against the working set of the algorithm. That may be occasionally 
useful, but I'd argue that most often the interest in benchmarking is to 
measure repeated application of a function, not occasional use of it.

> Memory locality is often the biggest contributing
> performance hazard in many algorithms, and usually the most
> unpredictable. I want to know about that in my measurements.
> Reproducibility is not important to me as accuracy. And I'd rather be
> conservative(/pessimistic) with the error.
>
> What guideline would you apply to estimate 'real-world' time spent when
> always working with hyper-optimistic measurements?

The purpose of std.benchmark is not to estimate real-world time. (That 
is the purpose of profiling.) Instead, benchmarking measures and 
provides a good proxy of that time for purposes of optimizing the 
algorithm. If work is done on improving the minimum time given by the 
benchmark framework, it is reasonable to expect that performance in-situ 
will also improve.


Andrei
September 21, 2012
Re: Review of Andrei's std.benchmark
On 9/20/12 2:37 PM, Jacob Carlborg wrote:
> On 2012-09-20 14:36, Andrei Alexandrescu wrote:
>
>> Let's use the minimum. It is understood it's not what you'll see in
>> production, but it is an excellent proxy for indicative and reproducible
>> performance numbers.
>
> Why not min, max and average?

For a very simple reason: unless the algorithm under benchmark is very 
long-running, max is completely useless, and it ruins average as well.

For virtually all benchmarks I've run, the distribution of timings is a 
half-Gaussian very concentrated around the minimum. Say you have a 
minimum of e.g. 73 us. Then there would be a lot of results close to 
that; the mode of the distribution would be very close, e.g. 75 us, and 
the more measurements you take, the closer the mode is to the minimum. 
Then you have a few timings up to e.g. 90 us. And finally you will 
inevitably have a few outliers at some milliseconds. Those are orders of 
magnitude larger than anything of interest and are caused by system 
interrupts that happened to fall in the middle of the measurement.

Taking those into consideration and computing the average with those 
outliers simply brings useless noise into the measurement process.


Andrei
September 21, 2012
Re: Review of Andrei's std.benchmark
On 9/20/12 3:01 PM, foobar wrote:
> On Thursday, 20 September 2012 at 12:35:15 UTC, Andrei Alexandrescu wrote:
>> Let's use the minimum. It is understood it's not what you'll see in
>> production, but it is an excellent proxy for indicative and
>> reproducible performance numbers.
>>
>>
>> Andrei
>
>  From the responses on the thread clearly there isn't a "best way".

I don't quite agree. This is a domain in which intuition is having a 
hard time, and at least some of the responses come from an intuitive 
standpoint, as opposed from hard data.

For example, there's this opinion that taking the min, max, and average 
is the "fair" thing to do and the most informative. However, all noise 
in measuring timing is additive. Unless you talk about performance of 
entire large systems with networking, I/O, and the such, algorithms 
running in memory are inevitably spending time doing work, to which 
various sources of noise (system interrupts, clock quantization, 
benchmarking framework) just _add_ some time. Clearly these components 
do affect the visible duration of the algorithm, but if you want to 
improve it you need to remove the noise.

> There are different use-cases with different tradeoffs so why not allow
> the user to choose the policy best suited for their use-case?
> I'd suggest to provide a few reasonable common choices to choose from,
> as well as a way to provide a user defined calculation (function
> pointer/delegate?)

Reasonable choices are great, but in this case it's a bit difficult to 
figure what's reasonable.


Andrei
September 21, 2012
Re: Review of Andrei's std.benchmark
On 9/20/12 11:03 PM, Nick Sabalausky wrote:
> On Tue, 18 Sep 2012 18:02:10 -0400
> Andrei Alexandrescu<SeeWebsiteForEmail@erdani.org>  wrote:
>
>> On 9/18/12 5:07 PM, "Øivind" wrote:
>>> * For all tests, the best run is selected, but would it not be
>>> reasonable in some cases to get the average value? Maybe excluding
>>> the runs that are more than a couple std. deviations away from the
>>> mean value..
>>
>> After extensive tests with a variety of aggregate functions, I can
>> say firmly that taking the minimum time is by far the best when it
>> comes to assessing the speed of a function.
>>
>
> *Ahem*: http://zedshaw.com/essays/programmer_stats.html

I'm not sure I figure how this applies to the discussion at hand.

> Your claim that the minimum time is sufficient is...ummm...extremely
> unorthodox, to say the least.

What would be the orthodoxy? If orthodoxy is what google finds, it's 
good we're not orthodox.

> As such, you're going to need a far more
> convincing argument than "It worked well for me."

Sure. I have just detailed the choices made by std.benchmark in a couple 
of posts.

At Facebook we measure using the minimum, and it's working for us. We've 
tried other approaches (such as taking the mode of the distribution). 
Turns out the minimum is better every time. Take a look at the early 
return in estimateTime():

https://github.com/facebook/folly/blob/master/folly/Benchmark.cpp#L136

> I assume I don't need to preach that "Extraordinary claims require
> extraordinary evidence". But, condensing benchmarks and statistics down
> to "take the minimum" and saying that's sufficient is one heck of an
> extraordinary claim. If you feel that you can provide sufficiently
> extraordinary justification, then please do.

My claim is unremarkable. All I'm saying is the minimum running time of 
an algorithm on a given input is a stable and indicative proxy for the 
behavior of the algorithm in general. So it's a good target for 
optimization.

There might be some confusion that std.benchmark does profiling. That's 
not part of its charter.

> Otherwise, I think we'll need richer results. At the very least there
> should be an easy way to get at the raw results programmatically
> so we can run whatever stats/plots/visualizations/output-formats we
> want. I didn't see anything like that browsing through the docs, but
> it's possible I may have missed it.

Currently std.benchmark does not expose raw results for the sake of 
simplicity. It's easy to expose such, but I'd need a bit more convincing 
about their utility.

> That brings up another question too: I like the idea of a
> one-stop-benchmarking-shop, much like we have for unittests, but maybe
> reporting shouldn't be so tightly integrated and left more open for
> integration with a proper statistics lib and more generalized
> output abilities? But of course, that doesn't preclude having a nice
> built-in, but optional, default report. (Again though, maybe I'm
> overlooking something already in the module?)

That's pretty much what's happening. There's an API for collecting 
timings, and then there's an API for printing those with a default format.

> One other nitpick: My initial impression is that the
> "benchmark_relative_file read" stuff seems a bit kludgey (and
> confusing to visually parse). Is there maybe a better way to handle
> that? For example, inspired by getopt:
>
> printBenchmarks!(
>      "file write", { std.file.write("/tmp/deleteme", "hello, world!"); },
>      BenchmarkOption.relative,
>      "file read", { std.file.read("/tmp/deleteme"); },
>      "array creation", { new char[32]; })
>      ();

The issue here is automating the benchmark of a module, which would 
require some naming convention anyway.


Andrei
September 21, 2012
Re: Review of Andrei's std.benchmark
On 2012-09-21 06:23, Andrei Alexandrescu wrote:

> For a very simple reason: unless the algorithm under benchmark is very
> long-running, max is completely useless, and it ruins average as well.

I may have completely misunderstood this but aren't we talking about 
what do include in the output of the benchmark? In that case, if you 
don't like max and average just don't look at it.

> For virtually all benchmarks I've run, the distribution of timings is a
> half-Gaussian very concentrated around the minimum. Say you have a
> minimum of e.g. 73 us. Then there would be a lot of results close to
> that; the mode of the distribution would be very close, e.g. 75 us, and
> the more measurements you take, the closer the mode is to the minimum.
> Then you have a few timings up to e.g. 90 us. And finally you will
> inevitably have a few outliers at some milliseconds. Those are orders of
> magnitude larger than anything of interest and are caused by system
> interrupts that happened to fall in the middle of the measurement.
>
> Taking those into consideration and computing the average with those
> outliers simply brings useless noise into the measurement process.

After your replay to one of Manu's post, I think I misunderstood the 
std.benchmark module. I was thinking more of profiling. But are these 
quite similar tasks, couldn't std.benchmark work for both?

-- 
/Jacob Carlborg
September 21, 2012
Re: Review of Andrei's std.benchmark
On Friday, 21 September 2012 at 04:44:58 UTC, Andrei Alexandrescu 
wrote:
>> Andrei Alexandrescu<SeeWebsiteForEmail@erdani.org>  wrote:
> My claim is unremarkable. All I'm saying is the minimum running 
> time of an algorithm on a given input is a stable and 
> indicative proxy for the behavior of the algorithm in general. 
> So it's a good target for optimization.

I reached the same conclusion and use the same method at work.

Considering min will converge towards a stable value quite 
quickly... would it not be a reasonable default to auto detect 
when the min is stable with some degree of statistical 
certainty...?
September 21, 2012
Re: Review of Andrei's std.benchmark
On 9/21/12 5:46 AM, Tove wrote:
> Considering min will converge towards a stable value quite quickly...
> would it not be a reasonable default to auto detect when the min is
> stable with some degree of statistical certainty...?

I think that's a great idea!

Andrei
September 21, 2012
Re: Review of Andrei's std.benchmark
On 21 September 2012 07:17, Andrei Alexandrescu <
SeeWebsiteForEmail@erdani.org> wrote:

> On 9/20/12 10:05 AM, Manu wrote:
>
>> Memory locality is often the biggest contributing
>>
> performance hazard in many algorithms, and usually the most
>> unpredictable. I want to know about that in my measurements.
>> Reproducibility is not important to me as accuracy. And I'd rather be
>> conservative(/pessimistic) with the error.
>>
> >
>
>> What guideline would you apply to estimate 'real-world' time spent when
>> always working with hyper-optimistic measurements?
>>
>
> The purpose of std.benchmark is not to estimate real-world time. (That is
> the purpose of profiling.) Instead, benchmarking measures and provides a
> good proxy of that time for purposes of optimizing the algorithm. If work
> is done on improving the minimum time given by the benchmark framework, it
> is reasonable to expect that performance in-situ will also improve.


Okay, I can buy this distinction in terminology.
What I'm typically more interested in is profiling. I do occasionally need
to do some benchmarking by your definition, so I'll find this useful, but
should there then be another module to provide a 'profiling' API? Also
worked into this API?
September 21, 2012
Re: Review of Andrei's std.benchmark
On 9/21/12 10:58 AM, Manu wrote:
> What I'm typically more interested in is profiling. I do occasionally
> need to do some benchmarking by your definition, so I'll find this
> useful, but should there then be another module to provide a 'profiling'
> API? Also worked into this API?

That's a good angle. Profiling is currently done by the -profile switch, 
and there are a couple of library functions associated with it. To my 
surprise, that documentation page has not been ported to the dlang.org 
style: http://digitalmars.com/ctg/trace.html

I haven't yet thought whether std.benchmark should add more 
profiling-related primitives. I'd opine for releasing it without such 
for the time being.


Thanks,

Andrei
1 2 3 4 5 6 7
Top | Discussion index | About this forum | D home