April 09, 2012
Le 09/04/2012 17:23, Francois Chabot a écrit :
> Why is there so much emphasis on printBenchmarks()?
> 
> benchmark() and runBenchmarks() are clearly the core of this library, and yet they are relegated to second-class citizen: "Oh, I guess you can use this". Normally, I wouldn't be so picky, but this is a standard library. Focus should be on functionality.
> 
> Providing formatted output is a nice bonus, but to me, it's just a bonus. Any benchmarking part of a large project is bound to format the output itself (to log benchmark results against revisions in a database or something like that).
> 
> Also, benchmark() and runBenchmarks() are kind of confusing at first
> glance. Something along the lines of benchmarkModule() and
> benchmarkAllModules() would be more sensible.

The printBenchmark facility is cool and should be included imho.
It helps benchmarking being as standard as unit testing.
We don't want to have to write again and again the same boilerplate code
for such trivial uses.
April 09, 2012
On 4/9/12 11:29 AM, Francois Chabot wrote:
>> Which is great, unless the program wants to measure the cache memory
>> itself, in which case it would use special assembler instructions or
>> large memset()s. (We do such at Facebook.)
>
> I disagree. If a regression suddenly causes a function to become
> heavily cache-bound, it should show up in benchmarks somehow,
> regardless of the previous expected behavior of the function.

But cache binding depends on a variety of cache characteristics, i.e. the machine. The question is whether we accept a heavy dependence of the benchmark on the machine.

Andrei


April 09, 2012
On 4/9/12 11:44 AM, Somedude wrote:
> It helps benchmarking being as standard as unit testing.
> We don't want to have to write again and again the same boilerplate code
> for such trivial uses.

Yes, I had unittest in mind when writing the library. If one needs more than one statement to get an informative benchmark off the ground, we failed.

Andrei
April 09, 2012
Andrei Alexandrescu wrote:

>> For example by running more than one instance of the benchmarked program in paralell and use the thereby gathered statistical routines to spread T into the additiv components A, Q and N.

> I disagree with running two benchmarks in parallel because that exposes them to even more noise (scheduling, CPU count, current machine load etc).

I did not mean to run two or more benchmarks in parallel but only more instances of the benchmarked program _without_ the environment supplied by the benchmakr. Of course may there be more noise. But if so, that noise is dependent on the additional running instances and to nothing else.

Now let T(n) be the time needed to run n instances in parallel on a single processor. According to your definition and your remark above:

§   T(n) = n*A + Q + N + P(n)

where P(n)>=0 and P(1)==0 is the additional noise with which the benchmarked program disturbs itself.

Please observe that Q and N eliminate themselves on subtraction:
§  T(2) - T(1) = A + P(2) - P(1)
§  T(3) - T(2) = A + P(3) - P(2)
...

P(n+1)-P(n) measures the sensitivity of the benchmarked program to its one noise. As you wrote yourself its value should be close to zero until the machine is close to falling flat.


> I don't understand the part of the sentence starting with "...use the thereby...", I'd be grateful if you elaborated.

Ohh..., an unrecognized deletion. I have written:
| use the thereby gathered _data to feed_ statistical routines to
| spread T
as elaborated above.

-manfred
April 10, 2012
Andrei Alexandrescu wrote:
> Hello,
> 
> 
> I finally found the time to complete std.benchmark. I got to a very simple API design, starting where I like it: one line of code.
> 
> Code is in the form of a pull request at https://github.com/D-Programming-Language/phobos/pull/529. (There's some noise in there caused by my git n00biness). Documentation is at http://erdani.com/d/web/phobos-prerelease/std_benchmark.html.
> 
> If reasonable and at all possible, I'd like to bump the priority of this proposal. Clearly D's user base is highly interested in efficiency, and many of the upcoming libraries have efficiency a virtual part of their design. So we should get std.benchmark in soon and require that new addition come with benchmarks for their essential functionality.
> 
> My vision is that in the future Phobos will have a significant benchmarks battery, which will help improving Phobos and porting to new platforms.

I come to think the same.

How come that the times based relative report and the percentage based relative report are mixed in one result? And how do I choose which one I'd like to see in the output.

When benchmarking you can measure different things at the same time. In
this regard the current proposal is limited. It just measures wall clock
time. I believe extending the StopWatch to measure e.g. user CPU time is
a useful addition.
In general, allowing user defined measurements would be great.
E.g. to measure the time spend in user mode.
() => {
         tms t;
         times(&t);
         return t.tms_utime;
      }

Note, that this code does not need to be portable. You can also use
version() else static assert.

Things that come to mind that I'd like to measure.
Time measurements:
  * User CPU time
  * System CPU time
  * Time spent in memory allocations
Count measurements:
  * Memory usage
  * L1/L2/L3 cache misses
  * Number of executed instructions
  * Number of memory allocations

Of course wall clock time is the ultimate measure when benchmarking. But often you need to investigate further (doing more measurements).

Do you think adding this is worthwhile?

All in all the user interface has been greatly simplified.

Jens
April 10, 2012
> I've analyzed this quite a bit at work and the average and median are not very informative. You need the mode, but in most benchmarks the mode is very close to the minimum, so using the minimum is even better.
>
How is it better?

> In speed measurements, all noise is additive (there's no noise that may make a benchmark appear to run faster). There are also quite a few outliers. Recording the average will include a fair amount of noise.
>
The benchmarked function itself could have a relevant variance, e.g.
when using randomized algorithms. You already get that from the LRU table
in the array append cache.
April 10, 2012
09.04.2012 17:26, Andrei Alexandrescu пишет:
> On 4/9/12 2:06 AM, Denis Shelomovskij wrote:
>> Why will recording the average produce so much noise?
>
> As I explained, the average takes noise and outliers (some very large,
> e.g. milliseconds in a benchmark that takes microseconds) into account.
> The minimum is shielded from this issue. In the limit, the minimum for
> infinitely many measurements is the sought-after result.
>
>> As I see, floating
>> point arithmetic is now used without a strong reason so it looks like a
>> time of this part isn't valuable. Or is it just a temporary solution?
>
> I don't understand "time of this part".

Looks like a misunderstanding because of my bad English: "Recording the average will include a fair amount of noise" means for me that a process of "recording the average" produces "a fair amount of noise" itself, not that "a result includes noise".


-- 
Денис В. Шеломовский
Denis V. Shelomovskij
April 14, 2012
(Resuming a few past discussions about std.benchmark.)

On 4/9/12 1:26 PM, Manfred Nowak wrote:
> Andrei Alexandrescu wrote:
>> I disagree with running two benchmarks in parallel because that
>> exposes them to even more noise (scheduling, CPU count, current
>> machine load etc).
>
> I did not mean to run two or more benchmarks in parallel but only
> more instances of the benchmarked program _without_ the environment
> supplied by the benchmakr. Of course may there be more noise. But if
> so, that noise is dependent on the additional running instances and
> to nothing else.

I doubt that's the case. Once we discuss running several instances in parallel, I predict the vagaries of scheduling become much more prevalent.

> Now let T(n) be the time needed to run n instances in parallel on a
> single processor. According to your definition and your remark above:
>
> §   T(n) = n*A + Q + N + P(n)
>
> where P(n)>=0 and P(1)==0 is the additional noise with which the
> benchmarked program disturbs itself.
>
> Please observe that Q and N eliminate themselves on subtraction:
> §  T(2) - T(1) = A + P(2) - P(1)
> §  T(3) - T(2) = A + P(3) - P(2)
> ...
>
> P(n+1)-P(n) measures the sensitivity of the benchmarked program to
> its one noise. As you wrote yourself its value should be close to
> zero until the machine is close to falling flat.

It would be great if that applied, but I'm afraid things get significantly more complex. For example, the noise will have a per-processor component, and the processes will compete for cache among themselves.


Andrei
April 14, 2012
On 4/10/12 5:40 AM, Jens Mueller wrote:
> How come that the times based relative report and the percentage based
> relative report are mixed in one result? And how do I choose which one
> I'd like to see in the output.

It's in the names. If the name of a benchmark starts with benchmark_relative_, then that benchmark is considered relative to the last non-relative benchmark. Using a naming convention allows complete automation in benchmarking a module.

I figure it's fine that all results appear together because the absence of data in the relative column clarifies which is which.

> When benchmarking you can measure different things at the same time. In
> this regard the current proposal is limited. It just measures wall clock
> time. I believe extending the StopWatch to measure e.g. user CPU time is
> a useful addition.

Generally I fear piling too much on StopWatch because every feature adds its own noise. But there's value in collecting the result of times(). What would be the Windows equivalent?

> In general, allowing user defined measurements would be great.
> E.g. to measure the time spend in user mode.
> () =>  {
>           tms t;
>           times(&t);
>           return t.tms_utime;
>        }
>
> Note, that this code does not need to be portable. You can also use
> version() else static assert.
>
> Things that come to mind that I'd like to measure.
> Time measurements:
>    * User CPU time
>    * System CPU time
>    * Time spent in memory allocations
> Count measurements:
>    * Memory usage
>    * L1/L2/L3 cache misses
>    * Number of executed instructions
>    * Number of memory allocations
>
> Of course wall clock time is the ultimate measure when benchmarking.
> But often you need to investigate further (doing more measurements).
>
> Do you think adding this is worthwhile?

Absolutely. I just fear about expanding the charter of the framework too much. Let's see:

* Memory usage is, I think, difficult in Windows.
* Estimating cache misses and executed instructions is significant research
* Number of memory allocations requires instrumenting druntime


Andrei
April 15, 2012
There have been quite a few good comments, but no review manager offer. Could someone please take this role?

Again, it would be great to get std.benchmark in sooner rather than later because it can be used by subsequent submissions (many of which allege efficiency as a major advantage) to show they improve over the state of the art.

Thanks,

Andrei