Jump to page: 1 27  
Page
Thread overview
Faster Virtual Method Dispatch
Apr 23, 2006
Craig Black
Apr 23, 2006
kris
Apr 23, 2006
BCS
Apr 23, 2006
BCS
Apr 23, 2006
Craig Black
Apr 23, 2006
Hasan Aljudy
Apr 23, 2006
Craig Black
Apr 23, 2006
Dave
Apr 23, 2006
kris
Apr 24, 2006
Dan
Apr 24, 2006
Craig Black
Apr 24, 2006
BCS
Apr 24, 2006
Craig Black
Apr 24, 2006
kris
Apr 25, 2006
Craig Black
Apr 25, 2006
kris
Apr 25, 2006
Sean Kelly
Apr 25, 2006
Walter Bright
Apr 25, 2006
Craig Black
Apr 25, 2006
Walter Bright
Apr 25, 2006
Craig Black
Apr 25, 2006
Sean Kelly
Apr 25, 2006
Craig Black
Apr 25, 2006
Sean Kelly
Apr 25, 2006
Mike Capp
Apr 25, 2006
Craig Black
Apr 25, 2006
Sean Kelly
Apr 25, 2006
Craig Black
Apr 25, 2006
Sean Kelly
Apr 25, 2006
Craig Black
C++ vs. D memory management performance (Was: Faster Virtual Method Dispatch)
Apr 27, 2006
Sean Kelly
Apr 27, 2006
Sean Kelly
Apr 27, 2006
Craig Black
Apr 28, 2006
Sean Kelly
Apr 28, 2006
Sean Kelly
Apr 29, 2006
Craig Black
Apr 28, 2006
Sean Kelly
Apr 29, 2006
Craig Black
Apr 29, 2006
Sean Kelly
Apr 26, 2006
Walter Bright
Apr 26, 2006
Mike Capp
Apr 26, 2006
Walter Bright
Apr 24, 2006
Sean Kelly
Apr 26, 2006
James Dunne
Apr 26, 2006
xs0
Apr 26, 2006
James Dunne
Apr 26, 2006
Walter Bright
Apr 26, 2006
James Dunne
Apr 26, 2006
James Dunne
Apr 26, 2006
Dave
Apr 27, 2006
James Dunne
Apr 26, 2006
Walter Bright
Apr 27, 2006
James Dunne
Apr 27, 2006
Walter Bright
Apr 27, 2006
James Dunne
Apr 27, 2006
Walter Bright
Apr 27, 2006
Craig Black
Apr 27, 2006
James Dunne
Apr 27, 2006
Craig Black
Apr 27, 2006
Craig Black
Apr 27, 2006
James Dunne
Apr 27, 2006
Craig Black
Apr 28, 2006
James Dunne
April 23, 2006
Most programmers do not realize the overhead involved with calling a method virtually.  A virtual method call is an O(1) operation.  Just  a vtable lookup and a invoking a function pointer.  However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed.  Thus, virtual method calls incur a large performance pentalty.

Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases.

I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance.  This approach is described below.

Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining.  Algorithmically, however, this is no longer O(1), but O(log n), which is bad.  So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold.  This threshold would be determined by testing and would perhaps be different on different processors.  Thus, in order to ensure that this approach is always faster, a relatively low number should be used.

Like I said I'm not an expert when it comes to low-level optimization strategies.  I don't know how well such an algorithm would work in practice. Comments are obviously welcome.

-Craig


April 23, 2006
Craig Black wrote:
> Most programmers do not realize the overhead involved with calling a method virtually.  A virtual method call is an O(1) operation.  Just  a vtable lookup and a invoking a function pointer.  However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed.  Thus, virtual method calls incur a large performance pentalty.
> 
> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases.
> 
> I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance.  This approach is described below.
> 
> Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining.  Algorithmically, however, this is no longer O(1), but O(log n), which is bad.  So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold.  This threshold would be determined by testing and would perhaps be different on different processors.  Thus, in order to ensure that this approach is always faster, a relatively low number should be used.
> 
> Like I said I'm not an expert when it comes to low-level optimization strategies.  I don't know how well such an algorithm would work in practice. Comments are obviously welcome.
> 
> -Craig 
> 
> 


I'm no expert either, Craig, yet it seems that there's more to it? An indirect address lookup (via a vtable) doesn't trash the pipeline per se; I forget the cycles involved but there's very little overhead at all if the vtable is a "popular" one, or otherwise happens to be in the cache lines. None of the other in-flight instructions have to be flushed at that time. If the vtable is not cached, then a bubble will appear in the pipeline while main-memory is addressed. But, again (as I understand it), this doesn't invalidate any in-flight instructions; just feeds the core less efficiently.

On the other hand, a branch misprediction must invalidate a large number of in-flight instruction; usually all of them (unless multi-path speculative execution is occuring). The P4 is particularly maimed by this, as compared to its peers, because of its particularly long pipeline (which, in turn, was needed to reached the frequencies Intel marketeers felt they needed to drive AMD off the map).

Obviously I don't know enough about this, but it might be interesting to research how one might introduce some locality-of-reference bias on the more common calls made. For instance, some fancy runtime might rewrite "common" calls to a dynamically constructed "global vtable". That would perhaps avoid the misprediction concerns, whilst managing to retain the globally popular set of lookups within cache?

Clearly MS compiler-engineers have noticed something beneficial. On the other hand, it seems a bit like flogging a dead-horse to introduce something that will likely cause at least one branch misprediction? I suppose the idea is that a direct-call (versus an indirect one) will enable speculative execution within the impending function-call itself?

Then again, heavyweight speculative execution is starting to lose favour against a grouping of simple symmetrical cores on one die (a la Cell, Kilocore, Niagara, etc). The idea there is that it doesn't matter if a bubble appears, or a core stalls; if you partition an app to take advantage of more than one execution thread, then it will (overall) execute faster anyway. The silicon saved via simplicity is applied to other areas such as massive on-die bandwidth. Of course, that doesn't help single-threaded applications, but the industry is changing, albeit slowly :)

Wither the Transputer? <g>

2 Cents
April 23, 2006
In article <e2gkms$rur$1@digitaldaemon.com>, kris says...
>
>Craig Black wrote:
>> Most programmers do not realize the overhead involved with calling a method virtually.  A virtual method call is an O(1) operation.  Just  a vtable lookup and a invoking a function pointer.  However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed.  Thus, virtual method calls incur a large performance pentalty.
>> 
>> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases.
>> 
>> I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance.  This approach is described below.
>> 
>> Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining.  Algorithmically, however, this is no longer O(1), but O(log n), which is bad.  So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold.  This threshold would be determined by testing and would perhaps be different on different processors.  Thus, in order to ensure that this approach is always faster, a relatively low number should be used.
>> 
>> Like I said I'm not an expert when it comes to low-level optimization strategies.  I don't know how well such an algorithm would work in practice. Comments are obviously welcome.
>> 
>> -Craig
>> 
>> 
>
>
>I'm no expert either, Craig, yet it seems that there's more to it? An indirect address lookup (via a vtable) doesn't trash the pipeline per se; I forget the cycles involved but there's very little overhead at all if the vtable is a "popular" one, or otherwise happens to be in the

I'm not following you train of thought completely but...
I don't think that the cache hit is the issue, I think that the fact that the
function call is to an interdict address is the issue. The CPU might not be able
to safely predict where it's supposed to go until it get all the way to the
function call. As a result the instruction queue gets completely drained.

>cache lines. None of the other in-flight instructions have to be flushed at that time. If the vtable is not cached, then a bubble will appear in the pipeline while main-memory is addressed. But, again (as I understand it), this doesn't invalidate any in-flight instructions; just feeds the core less efficiently.
>
>On the other hand, a branch misprediction must invalidate a large number of in-flight instruction; usually all of them (unless multi-path speculative execution is occuring). The P4 is particularly maimed by this, as compared to its peers, because of its particularly long pipeline (which, in turn, was needed to reached the frequencies Intel marketeers felt they needed to drive AMD off the map).
>
>Obviously I don't know enough about this, but it might be interesting to research how one might introduce some locality-of-reference bias on the more common calls made. For instance, some fancy runtime might rewrite "common" calls to a dynamically constructed "global vtable". That would perhaps avoid the misprediction concerns, whilst managing to retain the globally popular set of lookups within cache?
>
>Clearly MS compiler-engineers have noticed something beneficial. On the other hand, it seems a bit like flogging a dead-horse to introduce something that will likely cause at least one branch misprediction? I suppose the idea is that a direct-call (versus an indirect one) will enable speculative execution within the impending function-call itself?
>
>Then again, heavyweight speculative execution is starting to lose favour against a grouping of simple symmetrical cores on one die (a la Cell, Kilocore, Niagara, etc). The idea there is that it doesn't matter if a bubble appears, or a core stalls; if you partition an app to take advantage of more than one execution thread, then it will (overall) execute faster anyway. The silicon saved via simplicity is applied to other areas such as massive on-die bandwidth. Of course, that doesn't help single-threaded applications, but the industry is changing, albeit slowly :)
>
>Wither the Transputer? <g>
>
>2 Cents


April 23, 2006
In article <e2ggdq$lat$1@digitaldaemon.com>, Craig Black says...
>
>Most programmers do not realize the overhead involved with calling a method virtually.  A virtual method call is an O(1) operation.  Just  a vtable lookup and a invoking a function pointer.  However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed.  Thus, virtual method calls incur a large performance pentalty.
>
>Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases.
>
>I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance.  This approach is described below.
>
>Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining.  Algorithmically, however, this is no longer O(1), but O(log n), which is bad.  So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a

IIRC what is being searched for isn't which entry in this vtbl, but which vtbl to use. Or maybe I am miss understanding you.

>given threshold.  This threshold would be determined by testing and would perhaps be different on different processors.  Thus, in order to ensure that this approach is always faster, a relatively low number should be used.
>
>Like I said I'm not an expert when it comes to low-level optimization strategies.  I don't know how well such an algorithm would work in practice. Comments are obviously welcome.
>
>-Craig


Maybe some sort of run-and-review optimization could be done. This would entail having the compiler build a heavily instrumented version of the program and then run it a bunch to find out which classes are most commonly used at a given call and then on the next build let the compiler optimize each call based on the measured data.

If this is class ABC
call ABC.foo
else if XYZ
call XYZ.foo
else
use the vtbl


April 23, 2006
"BCS" <BCS_member@pathlink.com> wrote in message news:e2go6m$13u7$1@digitaldaemon.com...
> In article <e2ggdq$lat$1@digitaldaemon.com>, Craig Black says...
>>
>>Most programmers do not realize the overhead involved with calling a
>>method
>>virtually.  A virtual method call is an O(1) operation.  Just  a vtable
>>lookup and a invoking a function pointer.  However, we must also consider
>>that the CPU has no way of performing branch prediction with vtable
>>lookups,
>>so the pipeline is trashed.  Thus, virtual method calls incur a large
>>performance pentalty.
>>
>>Interestly, I found that one of the MSVC++ 2005 compiler's profile guided
>>optimizations is replacing the vtable lookup with if statements for the
>>most
>>commonly called virtual methods followed by the vtable lookup, so that
>>branch prediction can be performed in the majority of cases.
>>
>>I do not consider myself an expert in the area of compiler optimization,
>>but
>>knowing the little that I do know I have come up with a simple approach
>>that
>>may improve performance.  This approach is described below.
>>
>>Performing a binary search using if statements, and then invoking the
>>methods directly rather than using fuction pointers would allow the method
>>dispatch to take advantage of pipelining.  Algorithmically, however, this
>>is
>>no longer O(1), but O(log n), which is bad.  So I would assume that it
>>would
>>only be faster if n (or the number of methods in the vtable) is lower than
>>a
>
> IIRC what is being searched for isn't which entry in this vtbl, but which
> vtbl
> to use. Or maybe I am miss understanding you.

Nope you are right.  Which vtable.  My bad.

>>given threshold.  This threshold would be determined by testing and would
>>perhaps be different on different processors.  Thus, in order to ensure
>>that
>>this approach is always faster, a relatively low number should be used.
>>
>>Like I said I'm not an expert when it comes to low-level optimization
>>strategies.  I don't know how well such an algorithm would work in
>>practice.
>>Comments are obviously welcome.
>>
>>-Craig
>
>
> Maybe some sort of run-and-review optimization could be done. This would
> entail
> having the compiler build a heavily instrumented version of the program
> and then
> run it a bunch to find out which classes are most commonly used at a given
> call
> and then on the next build let the compiler optimize each call based on
> the
> measured data.
>
> If this is class ABC
> call ABC.foo
> else if XYZ
> call XYZ.foo
> else
> use the vtbl

Yes, like MSVC's profile guided optimization.  A more general optimization would be nice too, but you could probably be even faster and more efficient with a profile-guided one.  One real good thing about profile guided optimizations is that the compiler can spend more time on optimizing performance-critical code sections and less time on code that is not pertinent to performance.

-Craig


April 23, 2006
I'm far from being an expert .. but I thought vtable lookups involve two  or three jump/call instructions at the assembly level; at least that's the impression I got from stepping through code in the VC++ debugger.

Craig Black wrote:
> Most programmers do not realize the overhead involved with calling a method virtually.  A virtual method call is an O(1) operation.  Just  a vtable lookup and a invoking a function pointer.  However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed.  Thus, virtual method calls incur a large performance pentalty.
> 
> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases.
> 
> I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance.  This approach is described below.
> 
> Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining.  Algorithmically, however, this is no longer O(1), but O(log n), which is bad.  So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold.  This threshold would be determined by testing and would perhaps be different on different processors.  Thus, in order to ensure that this approach is always faster, a relatively low number should be used.
> 
> Like I said I'm not an expert when it comes to low-level optimization strategies.  I don't know how well such an algorithm would work in practice. Comments are obviously welcome.
> 
> -Craig 
> 
> 
April 23, 2006
Craig Black wrote:
> Most programmers do not realize the overhead involved with calling a method virtually.  A virtual method call is an O(1) operation.  Just  a vtable lookup and a invoking a function pointer.  However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed.  Thus, virtual method calls incur a large performance pentalty.
> 
> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases.
> 
> I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance.  This approach is described below.
> 
> Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining.  Algorithmically, however, this is no longer O(1), but O(log n), which is bad.  So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold.  This threshold would be determined by testing and would perhaps be different on different processors.  Thus, in order to ensure that this approach is always faster, a relatively low number should be used.
> 
> Like I said I'm not an expert when it comes to low-level optimization strategies.  I don't know how well such an algorithm would work in practice. Comments are obviously welcome.
> 
> -Craig 
> 

This is a little OT because it involves removing the need for a lot of VMD altogether before the assembly code is even generated instead of a way to optimize the VMD.

But, I think it would be good to take a look at 'auto-finalization' where the compiler front-end determines that a virtual method is not needed so just does a direct call.

Not only can more direct calls be made, but potentially the vtable gets smaller, the lookups will be less frequent and there will be no need for any branch-prediction for direct calls, getting rid of much of the need for VMD optimizations for many programs in the first place.

Maybe the semantic processing differences of 'import' over #include would make this more straight-forward to do for D compilers over C++.

- Dave
April 23, 2006
"Hasan Aljudy" <hasan.aljudy@gmail.com> wrote in message news:e2gr8j$18r2$1@digitaldaemon.com...
> I'm far from being an expert .. but I thought vtable lookups involve two or three jump/call instructions at the assembly level; at least that's the impression I got from stepping through code in the VC++ debugger.

From one non-expert to another, I have no idea how many assembly language instructions are involved.  However, I have been informed by some very experienced programmers that virtual methods don't mesh well with pipelining.  This is the main performance issue with calling a virtual method.

-Craig


April 23, 2006
Dave wrote:
> Craig Black wrote:
> 
>> Most programmers do not realize the overhead involved with calling a method virtually.  A virtual method call is an O(1) operation.  Just  a vtable lookup and a invoking a function pointer.  However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed.  Thus, virtual method calls incur a large performance pentalty.
>>
>> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases.
>>
>> I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance.  This approach is described below.
>>
>> Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining.  Algorithmically, however, this is no longer O(1), but O(log n), which is bad.  So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold.  This threshold would be determined by testing and would perhaps be different on different processors.  Thus, in order to ensure that this approach is always faster, a relatively low number should be used.
>>
>> Like I said I'm not an expert when it comes to low-level optimization strategies.  I don't know how well such an algorithm would work in practice. Comments are obviously welcome.
>>
>> -Craig
> 
> 
> This is a little OT because it involves removing the need for a lot of VMD altogether before the assembly code is even generated instead of a way to optimize the VMD.
> 
> But, I think it would be good to take a look at 'auto-finalization' where the compiler front-end determines that a virtual method is not needed so just does a direct call.
> 
> Not only can more direct calls be made, but potentially the vtable gets smaller, the lookups will be less frequent and there will be no need for any branch-prediction for direct calls, getting rid of much of the need for VMD optimizations for many programs in the first place.
> 
> Maybe the semantic processing differences of 'import' over #include would make this more straight-forward to do for D compilers over C++.
> 
> - Dave

Yes, I recall seeing Walter write something along these lines somewhere in the D documentation: being able to compile the entire program, versus linking things in from a library, permits much more aggressive finalization of methods as you describe. That's definately a benefit.
April 24, 2006
Actually, having "if statements" is absolutely TERRIBLE for trying to improve branch prediction.  By it's very nature, if statements are branches, so you are causing at least one branch misprediction anyways.

Furthermore, pointers are not O(1) algorithms.  What the computer is doing is
performing a:

LEA (reg) [(v_pointer)]; J(cc)|JMP [(reg)];

My understanding is that the v_pointer address is known at compile time, however it implies:

that the v_table's cache has to be loaded every time a function is called
that extra lea buggers instruction pairing tagging on several cycles
the potential that the v_table is a cache miss (highly unlikely a page miss)

Interestingly however, if one were to prepend some Self Modifying Code to the program that instead loaded the vtable into a set of addresses attached to the vtable, then one could achieve the benefits of static functions and most of the benefits of virtual functions at the cost of having an extra void*.sizeof for every function call, and the SMC code overhead.

The problem is that *some* functions would thereafter be virtual so that a single pointer could be used to point to any of a set of functions - a case not easily rendered with static functions (but still possible remembering that all code is just data)

That said, I think the overhead involved in using virtual functions is quite reasonable and comparable to the overhead in using 'calling conventions' versus a internally-optimized system with primarily register arguments.  It's 7-8 clocks every function call with a little bit of cache overhead and the potential for a cache miss.

Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks.  I love it to death though and wish someone would get their hands dirty making it smaller and sexier.

Thanks!


« First   ‹ Prev
1 2 3 4 5 6 7