On 31 May 2013 20:58, Joseph Rushton Wakeling <joseph.wakeling@webdrake.net> wrote:
On 05/31/2013 08:34 AM, Manu wrote:
> What's taking the most time?
> The lighting loop is so template-tastic, I can't get a feel for how fast that
> loop would be.

Hah, I found this out the hard way recently -- have been doing some experimental
reworking of code where some key inner functions were templatized, and it had a
nasty effect on performance.  I'm guessing it made it impossible for the
compilers to inline these functions :-(

I find that using templates actually makes it more likely for the compiler to properly inline. But I think the totally generic expressions produce cases where the compiler is considering too many possibilities that inhibit many optimisations.
It might also be that the optimisations get a lot more complex when the code fragments span across a complex call tree with optimisation dependencies on non-deterministic inlining.

One of the most important jobs for the optimiser is code re-ordering. Generic code is often written in such a way that makes it hard/impossible for the optimiser to reorder the flattened code properly.
Hand written code can have branches and memory accesses carefully placed at the appropriate locations.
Generic code will usually package those sorts of operations behind little templates that often flatten out in a different order.
The optimiser is rarely able to re-order code across if statements, or pointer accesses. __restrict is very important in generic code to allow the optimiser to reorder across any indirection, otherwise compilers typically have to be conservative and presume that something somewhere may have changed the destination of a pointer, and leave the order as the template expanded. Sadly, D doesn't even support __restrict, and nobody ever uses it in C++ anyway.

I've always has better results with writing precisely what I intend the compiler to do, and using __forceinline where it needs a little extra encouragement.