April 08, 2012
Le 08/04/2012 17:48, Dmitry Olshansky a écrit :
> On 08.04.2012 12:16, Somedude wrote:
> [snip]
>>
>> Like it. Would it be a good idea to add a column with an average memory used ?
> 
> In general it's next to impossible and/or entirely OS-specific.
> What can be done I think is adding a query function to GC interface that
> returns amount of RAM currently allocated on it's heap.
> 
> So adding GC heap usage can be done albeit it's not really a stable metric. This way one gets a nice hint on why something is slow ;)
> 

That's what I thought, thank you.
April 08, 2012
On 4/8/12 3:03 PM, Manfred Nowak wrote:
> Andrei Alexandrescu wrote:
>
>> Clearly there is noise during normal use as well, but
>> incorporating it in benchmarks as a matter of course reduces the
>> usefulness of benchmarks
>
> On the contrary:
> 1) The "noise during normal use" has to be measured in order to detect
> the sensibility of the benchmarked program to that noise.

That sounds quite tenuous to me. How do you measure it, and what conclusions do you draw other than there's a more or less other stuff going on on the machine, and the machine itself has complex interactions?

Far as I can tell a time measurement result is:

T = A + Q + N

where:

A > 0 is actual benchmark time

Q > 0 quantization noise (uniform distribution)

N > 0 various other noises (interrupts, task switching, networking, CPU dynamically changing frequency, etc). Many people jump on Gaussian as an approximation, but my tests suggest it's hardly so because it has a lot of jerky outliers.

How do we estimate A given T?

> 2) The noise the benchmarked program produces has to be measured too,
> because the running benchmarked program probably increases the noise
> for all other running programs.

How to measure that? Also, that noise does not need to be measured as much as eliminated to the extent possible. This is because the benchmark app noise is a poor model of the application-induced noise.

> In addition: the noise produced by a machine under heavy load might
> bring the performance of the benchmarked program down to zero.

Of course. That's why the documentation emphasizes the necessity of baselines. A measurement without baselines is irrelevant.


Andrei
April 09, 2012
08.04.2012 21:31, Andrei Alexandrescu пишет:
> On 4/8/12 11:59 AM, Denis Shelomovskij wrote:
>> Very good but minimum isn't a best guess. Personally I (and there will
>> be a lot of such maniacs I suppose) will think that this (minimum) time
>> can be significantly smaller than average time.
>
> I've analyzed this quite a bit at work and the average and median are
> not very informative. You need the mode, but in most benchmarks the mode
> is very close to the minimum, so using the minimum is even better.
>
> In speed measurements, all noise is additive (there's no noise that may
> make a benchmark appear to run faster). There are also quite a few
> outliers. Recording the average will include a fair amount of noise.

Yes of course. I mean than an algorithm itself should follow some restrictions for such measurement. It should run with the same speed every time. Many algorithms follows this convention, but not all.

Why will recording the average produce so much noise? As I see, floating point arithmetic is now used without a strong reason so it looks like a time of this part isn't valuable. Or is it just a temporary solution?

Anyway it should be configurable using a CT parameter, so it will not abuse one who doesn't need it.

>
> Clearly there is noise during normal use as well, but incorporating it
> in benchmarks as a matter of course reduces the usefulness of benchmarks
> as a mean to improve performance of the benchmarked code.

A graph is needed exactly because of that. Without a graph it really gives very little.

>
>> So a parameter (probably with a default value) should be added.
>> Something like enum of flags telling what we want to know. At least
>> these looks usable: minTime, <some mean time>, maxTime,
>> standardDeviation, graph (yes, good old ASCII art).
>
> Standard deviation is also not very useful because it includes all
> outliers (some of which are very far away from the mode).

So a graph is needed.

>
>> Yes, graph is needed.
>
> I am not sure about that. We may provide the raw measurement data for
> programs that want to plot things, but plotting is beyond the charter of
> std.benchmark.
>

Sorry, meant a histogram, not a curve. A histogram can be shown in a console very well. And it is needed because its the easiest way to show benchmarked program behaviour (and noise behaviour). It also requires only about 80 integers to store information and shouldn't produce much noise.

IMHO, a histogram gives lots of information and will be a good addition.


-- 
Денис В. Шеломовский
Denis V. Shelomovskij
April 09, 2012
On 4/9/12 2:06 AM, Denis Shelomovskij wrote:
> Why will recording the average produce so much noise?

As I explained, the average takes noise and outliers (some very large, e.g. milliseconds in a benchmark that takes microseconds) into account. The minimum is shielded from this issue. In the limit, the minimum for infinitely many measurements is the sought-after result.

> As I see, floating
> point arithmetic is now used without a strong reason so it looks like a
> time of this part isn't valuable. Or is it just a temporary solution?

I don't understand "time of this part".

> Anyway it should be configurable using a CT parameter, so it will not
> abuse one who doesn't need it.

We'd like the framework to do the right thing, and the average does not seem to be it.

> IMHO, a histogram gives lots of information and will be a good addition.

I disagree.


Andrei


April 09, 2012
Andrei Alexandrescu wrote:


> all noise is additive (there's no noise that may make a benchmark
> appear to run faster)

This is in doubt, because you yourself wrote "the machine itself has complex interactions". This complex interactions might lower the time needed for an operation of the benchmarked program.

Examples that come to mind:
a) needed data is already in a (faster) cache because it belongs to a
memory block, from which some data is needed by some program not
belonging to the benchmarked set---and that block isnt replaced yet.
b) needed data is stored in a hdd whose I/O scheduler uses the elevator
algorithm and serves the request by pure chance instantly, because the
position of the needed data is between two positions accessed by some
programs not belonging to the benchmarked set.

Especially a hdd, if used, will be responsible for a lot of noise you
define as "quantization noise (uniform distribution)" even if the head
stays at the same cylinder. Not recognizing this noise would only mean
that the data is cached and interpreting the only true read from the
hdd as a jerky outlier sems quite wrong.


>> 1) The "noise during normal use" has to be measured in order to detect the sensibility of the benchmarked program to that noise.
> How do you measure it, and what
> conclusions do you draw other than there's a more or less other
> stuff going on on the machine, and the machine itself has complex
> interactions?
> 
> Far as I can tell a time measurement result is:
> 
> T = A + Q + N

For example by running more than one instance of the benchmarked program in paralell and use the thereby gathered statistical routines to spread T into the additiv components A, Q and N.


>> 2) The noise the benchmarked program produces has to be measured too, because the running benchmarked program probably increases the noise for all other running programs.
> 
> How to measure that?

Similar to the above note.


> Also, that noise does not need to be measured
> as much as eliminated to the extent possible.

I wouldn't define two programs to be equivalent based on the time until completion only. That time might be identical for both programs, but if only one of the programs increases the answering time of the machine to inacceptability I would choose the other.

-manfred
April 09, 2012
Added to trello.

-Steve
April 09, 2012
Why is there so much emphasis on printBenchmarks()?

benchmark() and runBenchmarks() are clearly the core of this library, and yet they are relegated to second-class citizen: "Oh, I guess you can use this". Normally, I wouldn't be so picky, but this is a standard library. Focus should be on functionality.

Providing formatted output is a nice bonus, but to me, it's just a bonus. Any benchmarking part of a large project is bound to format the output itself (to log benchmark results against revisions in a database or something like that).

Also, benchmark() and runBenchmarks() are kind of confusing at first glance. Something along the lines of benchmarkModule() and benchmarkAllModules() would be more sensible.
April 09, 2012
On 4/9/12 10:23 AM, Francois Chabot wrote:
> Why is there so much emphasis on printBenchmarks()?
>
> benchmark() and runBenchmarks() are clearly the core of this library,
> and yet they are relegated to second-class citizen: "Oh, I guess you can
> use this". Normally, I wouldn't be so picky, but this is a standard
> library. Focus should be on functionality.

The functionality is to make benchmark easy to use, meaningful, and easy to interpret. I don't want to add a complicated library for postprocessing benchmarks because most nobody will use it.

The first function in the documentation is what most people will want to bring themselves to using. The functions that provide the data are eminently available so I disagree with the "second-class citizen" characterization. You want to use them, use them. They don't need to be given rockstar billing.


Andrei
April 09, 2012
On 4/9/12 9:25 AM, Manfred Nowak wrote:
> Andrei Alexandrescu wrote:
>> all noise is additive (there's no noise that may make a benchmark
>> appear to run faster)
>
> This is in doubt, because you yourself wrote "the machine itself has
> complex interactions". This complex interactions might lower the time
> needed for an operation of the benchmarked program.
>
> Examples that come to mind:
> a) needed data is already in a (faster) cache because it belongs to a
> memory block, from which some data is needed by some program not
> belonging to the benchmarked set---and that block isnt replaced yet.

Which is great, unless the program wants to measure the cache memory itself, in which case it would use special assembler instructions or large memset()s. (We do such at Facebook.)

> b) needed data is stored in a hdd whose I/O scheduler uses the elevator
> algorithm and serves the request by pure chance instantly, because the
> position of the needed data is between two positions accessed by some
> programs not belonging to the benchmarked set.
>
> Especially a hdd, if used, will be responsible for a lot of noise you
> define as "quantization noise (uniform distribution)" even if the head
> stays at the same cylinder. Not recognizing this noise would only mean
> that the data is cached and interpreting the only true read from the
> hdd as a jerky outlier sems quite wrong.

If the goal is to measure the seek time of the HDD, the benchmark itself should make sure the HDD cache is cleared. (What I recall they do on Linux is unmounting and remounting the drive.) Otherwise, it adds a useless component to the timing.

>>> 1) The "noise during normal use" has to be measured in order to
>>> detect the sensibility of the benchmarked program to that noise.
>> How do you measure it, and what
>> conclusions do you draw other than there's a more or less other
>> stuff going on on the machine, and the machine itself has complex
>> interactions?
>>
>> Far as I can tell a time measurement result is:
>>
>> T = A + Q + N
>
> For example by running more than one instance of the benchmarked
> program in paralell and use the thereby gathered statistical routines
> to spread T into the additiv components A, Q and N.

I disagree with running two benchmarks in parallel because that exposes them to even more noise (scheduling, CPU count, current machine load etc). I don't understand the part of the sentence starting with "...use the thereby...", I'd be grateful if you elaborated.


Andrei
April 09, 2012
> Which is great, unless the program wants to measure the cache memory itself, in which case it would use special assembler instructions or large memset()s. (We do such at Facebook.)

I disagree. If a regression suddenly causes a function to become
heavily cache-bound, it should show up in benchmarks somehow,
regardless of the previous expected behavior of the function.

--
Francois