November 30, 2006
Hi.

I'm currently working on a research project as part of my Honours, and one of the requirements is speed--the code I'm writing has to be very efficient.

Before I started, my supervisor warned me about object-oriented programming and that it seems to be much slower than just flat function calls.

Anyway, I was wondering what the difference between three kinds of function calls would be:

1. Foo x; x.call(); // Where Foo is a struct
2. Foo_call(x); // C-style API
3. auto y = new FooClass; y.call(); // Where call is final

I hooked up a test app which used a loop with 100,000 iterations for each call type, and ran that program 100 times, and averaged the outputs.

#1 was 2.84 times slower than #2, and #3 was 3.15 times slower than #2.
  Are those numbers right??  Is it really that much slower?  I would
have thought that they should have been about the same since each one
needs to pass only one thing: a pointer.  I've attached the test
programs I used; if anyone can offer any insight or even corrections,
I'd be very grateful.

Incidentally, any other insights regarding performance differences between OO-style and flat procedural-style would be very welcome.

	-- Daniel


November 30, 2006
Daniel Keep wrote:
> Hi.
> 
> I'm currently working on a research project as part of my Honours, and one of the requirements is speed--the code I'm writing has to be very efficient.
> 
> Before I started, my supervisor warned me about object-oriented programming and that it seems to be much slower than just flat function calls.
> 
> Anyway, I was wondering what the difference between three kinds of function calls would be:
> 
> 1. Foo x; x.call(); // Where Foo is a struct
> 2. Foo_call(x); // C-style API
> 3. auto y = new FooClass; y.call(); // Where call is final
> 
> I hooked up a test app which used a loop with 100,000 iterations for each call type, and ran that program 100 times, and averaged the outputs.
> 
> #1 was 2.84 times slower than #2, and #3 was 3.15 times slower than #2. 
>  Are those numbers right??  Is it really that much slower?  I would have thought that they should have been about the same since each one needs to pass only one thing: a pointer.  I've attached the test programs I used; if anyone can offer any insight or even corrections, I'd be very grateful.
> 
> Incidentally, any other insights regarding performance differences between OO-style and flat procedural-style would be very welcome.
> 
>     -- Daniel
> 
> 
> ------------------------------------------------------------------------
> 
> #!/bin/bash
> 
> for I in {1..100}; do
>     ./struct_calls
> done | awk '
>     BEGIN {sum1 = 0; sum2 = 0; count = 0;}
>     {sum1 += $5; sum2 += $11; count += 1;
>      print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11;}
>     END {print "Averages: " sum1/count ", " sum2/count}
> ' | tee struct_calls.log
> 
> 
> 
> ------------------------------------------------------------------------
> 
> module struct_calls;
> 
> import std.perf;
> import std.stdio;
> 
> const COUNT = 100_000u;
> 
> struct Foo
> {
>     uint dummy;
> 
>     void call()
>     {
>         //this.dummy = arg;
>     }
> }
> 
> class FooClass
> {
>     uint dummy;
> 
>     final void call()
>     {
>         //this.dummy = arg;
>     }
> }
> 
> void Foo_call(Foo* _this)
> {
>     //_this.dummy = arg;
> }
> 
> void main()
> {
>     Foo x;
> 
>     scope perf = new HighPerformanceCounter();
> 
>     perf.start();
>     for( uint i=0; i<COUNT; i++ )
>         x.call();
>     perf.stop();
> 
>     // count1: x.call()
>     auto count1 = perf.periodCount();
> 
>     perf.start();
>     for( uint i=0; i<COUNT; i++ )
>         Foo_call(&x);
>     perf.stop();
> 
>     // count2: Foo_call(&x)
>     auto count2 = perf.periodCount();
> 
>     scope y = new FooClass();
> 
>     perf.start();
>     for( uint i=0; i<COUNT; i++ )
>         y.call();
>     perf.stop();
> 
>     // count3: y.call()
>     auto count3 = perf.periodCount();
> 
>     writefln("%d / %d = %f -- %d / %d = %f",
>             count1, count2, cast(real)count1 / cast(real)count2,
>             count3, count2, cast(real)count3 / cast(real)count2);
> }
> 


A call to a final method shouldn't be any slower than straight funciton call.  But I have heard grumblings that final in D doesn't actually work as spec'ed.

But anyway the main point of this reply is that even if method call is 3x slower than function call, it really means nothing if function call overhead is 0.01% of your run time to begin with.  There's a popular thing to teach in computer architecture classes called "Amdahl's Law", and basically to paraphrase it says don't fixate on optimizing things that represent trivial fractions of the overall runtime to begin with. For example if method calls represents 0.03% of your overall runtime the BEST speedup you could achieve by eliminating all calls entirely would be 0.03%.  Pretty trivial.  It's like trying to figure out how to improve driving times from home to work, and focusing on increasing your speed in your driveway.  Even if you achieve a 100x speedup in how long you spend driving on your driveway, you've still only sped up the hour long drive by a second or two.  Not worth the effort.

In other words, for all but the most rare of cases, the benifits reaped in terms of code reability, maintainablity, and speed of development when using virtual functions vastly outweighs the tiny cost.

What you should do is put something representative of the work you actually plan to do inside those functions in your test program and see if the 3x difference in call speed actually makes any significant difference.

--bb
November 30, 2006
Daniel Keep wrote:
> I hooked up a test app which used a loop with 100,000 iterations for each call type, and ran that program 100 times, and averaged the outputs.

What flags did you use to compile it?
November 30, 2006
Bill Baxter wrote:
> Daniel Keep wrote:
> 
>> Hi.
>>
>> I'm currently working on a research project as part of my Honours, and one of the requirements is speed--the code I'm writing has to be very efficient.
>>
>> Before I started, my supervisor warned me about object-oriented programming and that it seems to be much slower than just flat function calls.
>>
>> Anyway, I was wondering what the difference between three kinds of function calls would be:
>>
>> 1. Foo x; x.call(); // Where Foo is a struct
>> 2. Foo_call(x); // C-style API
>> 3. auto y = new FooClass; y.call(); // Where call is final
>>
>> I hooked up a test app which used a loop with 100,000 iterations for each call type, and ran that program 100 times, and averaged the outputs.
>>
>> #1 was 2.84 times slower than #2, and #3 was 3.15 times slower than #2.  Are those numbers right??  Is it really that much slower?  I would have thought that they should have been about the same since each one needs to pass only one thing: a pointer.  I've attached the test programs I used; if anyone can offer any insight or even corrections, I'd be very grateful.
>>
>> Incidentally, any other insights regarding performance differences between OO-style and flat procedural-style would be very welcome.
>>
>>     -- Daniel
>>
>>
>> ------------------------------------------------------------------------
>>
>> #!/bin/bash
>>
>> for I in {1..100}; do
>>     ./struct_calls
>> done | awk '
>>     BEGIN {sum1 = 0; sum2 = 0; count = 0;}
>>     {sum1 += $5; sum2 += $11; count += 1;
>>      print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11;}
>>     END {print "Averages: " sum1/count ", " sum2/count}
>> ' | tee struct_calls.log
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> module struct_calls;
>>
>> import std.perf;
>> import std.stdio;
>>
>> const COUNT = 100_000u;
>>
>> struct Foo
>> {
>>     uint dummy;
>>
>>     void call()
>>     {
>>         //this.dummy = arg;
>>     }
>> }
>>
>> class FooClass
>> {
>>     uint dummy;
>>
>>     final void call()
>>     {
>>         //this.dummy = arg;
>>     }
>> }
>>
>> void Foo_call(Foo* _this)
>> {
>>     //_this.dummy = arg;
>> }
>>
>> void main()
>> {
>>     Foo x;
>>
>>     scope perf = new HighPerformanceCounter();
>>
>>     perf.start();
>>     for( uint i=0; i<COUNT; i++ )
>>         x.call();
>>     perf.stop();
>>
>>     // count1: x.call()
>>     auto count1 = perf.periodCount();
>>
>>     perf.start();
>>     for( uint i=0; i<COUNT; i++ )
>>         Foo_call(&x);
>>     perf.stop();
>>
>>     // count2: Foo_call(&x)
>>     auto count2 = perf.periodCount();
>>
>>     scope y = new FooClass();
>>
>>     perf.start();
>>     for( uint i=0; i<COUNT; i++ )
>>         y.call();
>>     perf.stop();
>>
>>     // count3: y.call()
>>     auto count3 = perf.periodCount();
>>
>>     writefln("%d / %d = %f -- %d / %d = %f",
>>             count1, count2, cast(real)count1 / cast(real)count2,
>>             count3, count2, cast(real)count3 / cast(real)count2);
>> }
>>
> 
> 
> A call to a final method shouldn't be any slower than straight funciton call.  But I have heard grumblings that final in D doesn't actually work as spec'ed.
> 
> But anyway the main point of this reply is that even if method call is 3x slower than function call, it really means nothing if function call overhead is 0.01% of your run time to begin with.  There's a popular thing to teach in computer architecture classes called "Amdahl's Law", and basically to paraphrase it says don't fixate on optimizing things that represent trivial fractions of the overall runtime to begin with. For example if method calls represents 0.03% of your overall runtime the BEST speedup you could achieve by eliminating all calls entirely would be 0.03%.  Pretty trivial.  It's like trying to figure out how to improve driving times from home to work, and focusing on increasing your speed in your driveway.  Even if you achieve a 100x speedup in how long you spend driving on your driveway, you've still only sped up the hour long drive by a second or two.  Not worth the effort.
> 
> In other words, for all but the most rare of cases, the benifits reaped in terms of code reability, maintainablity, and speed of development when using virtual functions vastly outweighs the tiny cost.
> 
> What you should do is put something representative of the work you actually plan to do inside those functions in your test program and see if the 3x difference in call speed actually makes any significant difference.
> 
> --bb

I'm well aware that the time it takes to call into the function and pass back out may be a very small amount of the overall time, but when there's a 2-3x difference between supposedly identical code, that puts up red warning flags in my head.  If this is so much slower, what other things are slower?

Also, normally I would agree that the benefits of "elegant" code outweigh the performance drop.  But not in this case.  I'm not sure if I can go into specifics, but the system will potentially be simulating millions of individual agents across a networked cluster, with the eventual aim to become faster than realtime.

As for putting code representative of what it'll be doing... I can't at this stage since I'm still building the utility libraries.  That representative code doesn't exist yet.

I don't want to get bogged down in premature optimisation, but I also want to eliminate any potential bottlenecks early if it isn't too much effort.

In any case, I just want to understand where the difference is coming from.  If I understand the difference, I can hopefully make better decisions.  My supervisor originally wanted the code written in C, and I want to prove that D is more than up to the task.

	-- Daniel

P.S.  To give you an idea of how badly I want to use D: I ported CWEB over to support D :)
November 30, 2006
Walter Bright wrote:
> Daniel Keep wrote:
> 
>> I hooked up a test app which used a loop with 100,000 iterations for each call type, and ran that program 100 times, and averaged the outputs.
> 
> 
> What flags did you use to compile it?

I didn't use any flags; I was worried about the compiler being "smart" and inlining the functions calls which would... well, eliminate them entirely :P

If there are any flags you can recommend that would give more realistic results, I'd love to know.

	-- Daniel

November 30, 2006
Daniel Keep wrote:
> Walter Bright wrote:
> 
>> Daniel Keep wrote:
>>
>>> I hooked up a test app which used a loop with 100,000 iterations for each call type, and ran that program 100 times, and averaged the outputs.
>>
>>
>>
>> What flags did you use to compile it?
> 
> 
> I didn't use any flags; I was worried about the compiler being "smart" and inlining the functions calls which would... well, eliminate them entirely :P
> 
> If there are any flags you can recommend that would give more realistic results, I'd love to know.
> 
>     -- Daniel
> 

I just recompiled and re-ran with some different flags.  The results are... interesting.

(Numbers are #1/#2 and #3/#2)

-O: 1.39892 and 3.61384
-O -release: 1.00703 and 1.0163
-O -inline: 1.89054 and 13.4001
-O -release -inline: 1.8257 and 1.00007

Now I've got *no* idea what to think :P

	-- Daniel
November 30, 2006
Daniel Keep wrote:
> Bill Baxter wrote:
> 
> As for putting code representative of what it'll be doing... I can't at this stage since I'm still building the utility libraries.  That representative code doesn't exist yet.

You can at least put in a loop that adds a bunch of numbers up.
Or writes to elements in an array passed in or something like that.

I think D won't inline anything with a loop, so that would also help to rule out the inlining affecting results.

Another thing the D optimizer does, though, (reportedly) is to turn virtual calls into non-virtual ones when the actual class is known at compile time, so it that could affect your results too.

--bb
November 30, 2006
Daniel Keep wrote:
> I just recompiled and re-ran with some different flags.  The results are... interesting.
> 
> (Numbers are #1/#2 and #3/#2)
> 
> -O: 1.39892 and 3.61384
> -O -release: 1.00703 and 1.0163
> -O -inline: 1.89054 and 13.4001
> -O -release -inline: 1.8257 and 1.00007
> 
> Now I've got *no* idea what to think :P

You really need to use obj2asm on the results. It'll make things a lot more sensible!
November 30, 2006
Walter Bright wrote:
> Daniel Keep wrote:
> 
>> I just recompiled and re-ran with some different flags.  The results are... interesting.
>>
>> (Numbers are #1/#2 and #3/#2)
>>
>> -O: 1.39892 and 3.61384
>> -O -release: 1.00703 and 1.0163
>> -O -inline: 1.89054 and 13.4001
>> -O -release -inline: 1.8257 and 1.00007
>>
>> Now I've got *no* idea what to think :P
> 
> 
> You really need to use obj2asm on the results. It'll make things a lot more sensible!

Isn't obj2asm linux only?  I used ndisasm on the object files, but my x86 assembler isn't quite good enough to work out what's going on :P

	-- Daniel
November 30, 2006
Daniel Keep wrote:
> 
> I just recompiled and re-ran with some different flags.  The results are... interesting.
> 
> (Numbers are #1/#2 and #3/#2)
> 
> -O: 1.39892 and 3.61384
> -O -release: 1.00703 and 1.0163
> -O -inline: 1.89054 and 13.4001
> -O -release -inline: 1.8257 and 1.00007
> 
> Now I've got *no* idea what to think :P

For what it's worth, I've noticed that inlining of struct methods is inconsistent (the last case).  I ran some tests a while back and in some cases simply changing the method to/from a static method was enough to affect whether it was inlined.  And I think how the method is used is a factor as well.  I am using structs with opCall as default comparators in my algorithm code because they have the potential to be inlined, while delegates currently do not, but even:

struct IsEqual(T) { bool opCall(T a, T b) { return a == b; } }

only inlines in some of the algorithms.  And this may be true of inlining in general--I simply haven't looked into the problem enough.


Sean
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home