Thread overview
Method Call Overhead Stats for GDC (function vs virtual vs final)
Jul 25, 2008
eris
Jul 25, 2008
Jason House
Jul 25, 2008
superdan
July 25, 2008
Hello D programmers.

I'm implementing a simulation in D that will likely make sustained and intense use of method calls.  Naturally I wanted to determine the overhead for various types of call invocations.  Throwing all advice against "premature optimization" aside I wrote a short program to test this.

One of the primary questions I wanted answered was "Would marking methods with the 'final' attribute help the compiler optimize those calls into something very close to a simple function call?"

Here are the class definitions to test the idea:

// ***********************
static void fnmeth() {
	j++;
}

class A
{
	void Ameth() {
		j++;
	}
}

class B : A
{
	final void Bmeth() {
		j++;
	}
}

final class C : A
{
	void Cmeth() {
		j++;
	}
}

// *****************

The first definition is a simple static function call.  I'll use this as a baseline.
The second one is a simple virtual method.
The third is a method explicitly marked as 'final'.
The fourth is a class marked as 'final' which would imply that all methods are final.

 The j++ bump is simply something to populate the method and possibly keep it from being optimized into non-existence early on.

I ran each of these calls 4 million times to look at their respective performance.  My program uses the tango StopWatch object to handle the performance timing.  Unfortunately I'm running under VMWare and as I found out later the timers are screwed up when running under a VM.

The upside of all this is that I was forced to disassemble the executable to determined what sort of optimizations gdc performed.  This is actually a good thing since it tells you more about the code generator than you would otherwise be able to see.

Here are the results:

draco% gdc --version gdc (GCC) 4.2.3 20080225 (prerelease gdc 0.25 20071215, using dmd 1.022) (Ubuntu 0.25-4.2.3-2ubuntu2)

(no optimization)

Call Type                   Call Overhead  (x86 instructions)        Internal Method Overhead (x86 ins)
------------                    ---------------------------------------------         ------------------------------------------------
static function call            1 (direct call)                                         4 (push,mov,pop,ret)
virtual method call           7 (indirect call)                                      8 (push,mov,sub,mov,mov,call,leave,ret)
final method call               3 (setup & direct call)                         8 (same as above)
final class                           7 (indirect call)                                     8 (same as above)

(optimization level 1)

Call Type                   Call Overhead  (x86 instructions)        Internal Method Overhead (x86 ins)
------------                    ---------------------------------------------         ------------------------------------------------
static function call            1 (direct call)                                         4 (push,mov,pop,ret)
virtual method call           3 (indirect call)                                      8 (push,mov,sub,mov,mov,call,leave,ret)
final method call               3 (setup & direct call)                         8 (same as above)
final class                           3 (indirect call)                                     8 (same as above)

(optimization level 2 yielded no improvements in call setup or overhead, but inlined the function call)

(optimization level 3 yielded no significant improvements over that)

(Different instructions will of course have different number of cycles, dispatching etc so the function call count is only a general guide).

Interesting conclusion

(1) At -O1, both final and fully virtual methods have the same call overhead and the same internal method overhead.  The only difference is the final method is a direct call.

(2) With no optimization, the final method call setup is 3 instructions while the final class call setup is 7 instructions.  At -O1 the compiler figures out that they're the same.

(2) Internal method overhead for any class method invocation looks to be around 2.5x the call setup overhead at -O1.  This is something I hadn't considered.

Lesson learned:

If you're looking towards efficiency for frequently invoked methods, consider other ways of getting to the data.  In other words avoid over using get()/set() type calls and don't force everything into extreme levels of encapsulation just to be "politically correct".  You will pay the price.

BTW: Does anyone know why gdc is doing this " mov    %esi,(%esp)".  Thats two instructions before a direct call.  Why would you want to store the contents of the string index register into the top of the stack?  I'm not doing any string or byte functions.

Hmm.

Eris

July 25, 2008
A few days ago, I added "final" to many of my speed critical function calls and got a 20% speed gain to my simulation-based application.  Maybe it was from extra inlining?

eris Wrote:

> Hello D programmers.
> 
> I'm implementing a simulation in D that will likely make sustained and intense use of method calls.  Naturally I wanted to determine the overhead for various types of call invocations.  Throwing all advice against "premature optimization" aside I wrote a short program to test this.
> 
> One of the primary questions I wanted answered was "Would marking methods with the 'final' attribute help the compiler optimize those calls into something very close to a simple function call?"
July 25, 2008
Jason House Wrote:

> A few days ago, I added "final" to many of my speed critical function calls and got a 20% speed gain to my simulation-based application.  Maybe it was from extra inlining?

it all connects together. final disables virtual. that in turn allows inlining. after inlining about a brazillion optimizations apply. after those yet another brazillion optimizations apply. some compilers even do some fixed point iteration shit to figure out the best order and sequence of applying optimizations.

this is all an entangled mess making predictions unreliable like shit. measuring on any concrete case (like the op has done) does little in the way of predicting performance improvements for other code.