May 31, 2013
On 31 May 2013 21:05, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 05/31/2013 12:58 PM, Joseph Rushton Wakeling wrote:
>
>> On 05/31/2013 08:34 AM, Manu wrote:
>>
>>> What's taking the most time?
>>> The lighting loop is so template-tastic, I can't get a feel for how fast
>>> that
>>> loop would be.
>>>
>>
>> Hah, I found this out the hard way recently -- have been doing some
>> experimental
>> reworking of code where some key inner functions were templatized, and it
>> had a
>> nasty effect on performance.  I'm guessing it made it impossible for the
>> compilers to inline these functions :-(
>>
>>
> That wouldn't make any sense though, since after template expansion there is no difference between the generated version and a particular handwritten version.
>

Assuming that you would hand-write exactly the same code as the template
expansion...
Typically template expansion leads to countless temporary redundancies,
which you expect the compiler to try and optimise away, but it's not always
able to do so, especially if there is an if() nearby, or worse, a pointer
dereference.


May 31, 2013
On 05/31/2013 01:48 PM, Manu wrote:
> I find that using templates actually makes it more likely for the compiler to properly inline. But I think the totally generic expressions produce cases where the compiler is considering too many possibilities that inhibit many optimisations. It might also be that the optimisations get a lot more complex when the code fragments span across a complex call tree with optimisation dependencies on non-deterministic inlining.

Thanks for the detailed advice. :-)

There are two particular things I noted about my own code.  One is that whereas in the original the template variables were very simple (just a floating-point type) in the new version they are more complex structures that are indeed more generic (the idea was to enable the code to handle both mutable and immutable forms of one particular data structure).

The second is that the templatization gets moved from the mixin to the functions themselves.  I guess that the mixin has the effect of copy-pasting _as if_ I was just writing precisely what I intended.
May 31, 2013
On Friday, 31 May 2013 at 11:49:05 UTC, Manu wrote:
> I find that using templates actually makes it more likely for the compiler
> to properly inline. But I think the totally generic expressions produce
> cases where the compiler is considering too many possibilities that inhibit
> many optimisations.
> It might also be that the optimisations get a lot more complex when the
> code fragments span across a complex call tree with optimisation
> dependencies on non-deterministic inlining.
>
> One of the most important jobs for the optimiser is code re-ordering.
> Generic code is often written in such a way that makes it hard/impossible
> for the optimiser to reorder the flattened code properly.
> Hand written code can have branches and memory accesses carefully placed at
> the appropriate locations.
> Generic code will usually package those sorts of operations behind little
> templates that often flatten out in a different order.
> The optimiser is rarely able to re-order code across if statements, or
> pointer accesses. __restrict is very important in generic code to allow the
> optimiser to reorder across any indirection, otherwise compilers typically
> have to be conservative and presume that something somewhere may have
> changed the destination of a pointer, and leave the order as the template
> expanded. Sadly, D doesn't even support __restrict, and nobody ever uses it
> in C++ anyway.
>
> I've always has better results with writing precisely what I intend the
> compiler to do, and using __forceinline where it needs a little extra
> encouragement.

Thanks for valuable input. Have never had a pleasure to actually try templates in performance-critical code and this a good stuff to remember about. Have added to notes.
May 31, 2013
I actually have some experience with C++ template
meta-programming in HD video codecs. My experience is that it is
possible for generic code through TMP to match or even beat hand
written code. Modern C++ compilers are very good, able to
optimize away most of the temporary variables resulting very
compact object code, provides you can avoid branches and keep the
arguments const refs as much as possible. A real example is my
TMP generic codec beat the original hand optimized c/asm version
(both use sse intrinsics) by as much as 30% with only a fraction
of the line of code. Another example is the Eigen linear algebra
library, through template meta-programming it is able to match
the speed of Intel MKL.

D is very strong at TMP, it provides a lot more tools
specifically designed for TMP, that is vastly superior than C++
which relies on abusing the templates. This is actually the main
reason drawing me to D: TMP in a more pleasant way. IMO one thing
D needs to address is less surprises, eg. innocent looking code
like v[] = [x,x,x] shouldn't cause major performance hit. In c++
memory allocation is explicit, either operator new or malloc, or
indirectly through a method call, otherwise the language would
not do heap allocation for you.

On Friday, 31 May 2013 at 11:51:04 UTC, Manu wrote:
> Assuming that you would hand-write exactly the same code as the template
> expansion...
> Typically template expansion leads to countless temporary redundancies,
> which you expect the compiler to try and optimise away, but it's not always
> able to do so, especially if there is an if() nearby, or worse, a pointer
> dereference.
May 31, 2013
On 31 May 2013 23:07, finalpatch <fengli@gmail.com> wrote:

> I actually have some experience with C++ template meta-programming in HD video codecs. My experience is that it is possible for generic code through TMP to match or even beat hand written code. Modern C++ compilers are very good, able to optimize away most of the temporary variables resulting very compact object code, provides you can avoid branches and keep the arguments const refs as much as possible. A real example is my TMP generic codec beat the original hand optimized c/asm version (both use sse intrinsics) by as much as 30% with only a fraction of the line of code. Another example is the Eigen linear algebra library, through template meta-programming it is able to match the speed of Intel MKL.
>

Just to clarify, I'm not trying to say templates are slow because they're
tempaltes.
There's no reason carefully crafted template code couldn't be identical to
hand crafted code.
What I am saying, is that it introduces the possibility for countless
subtle details to get in the way.
If you want maximum performance from templates, you often need to be really
good at expanding the code in your mind, and visualising it all in expanded
context, so you can then reason whether anything is likely to get in the
way of the optimiser or not.
A lot of people don't possess this skill, and for good reason, it's hard!
It usually takes considerable time to optimise template code, and optimised
template code may often only be optimal in the context you tested against.
At some point, depending on the complexity of your code, it might just be
easier/less time consuming to write the code directly.
It's a fine line, but I've seen so much code that takes it WAAAAY too far.

There's always the unpredictable element too. Imagine a large-ish template
function, and one very small detail inside is customised of otherwise
identical functions.
Let's say 2 routines are generated for int and long; the cost of casting
int -> long and calling the long function in both cases is insignificant,
but using templates, your exe just got bigger, branches less predictable,
icache got more noisy, and there's no way to profile for loss of
performance introduced this way. In-fact, the profiler will typically
erroneously lead you to believe your code is FASTER, but it results in code
that may be slower at net.

I'm attracted to D for the power of it's templates too, but that attraction
is all about simplicity and readability.
In D, you can do more with less. The goal is not to use more and more
templates, but make the few templates I use, more readable and maintainable.

D is very strong at TMP, it provides a lot more tools
> specifically designed for TMP, that is vastly superior than C++ which relies on abusing the templates. This is actually the main reason drawing me to D: TMP in a more pleasant way. IMO one thing D needs to address is less surprises, eg. innocent looking code like v[] = [x,x,x] shouldn't cause major performance hit. In c++ memory allocation is explicit, either operator new or malloc, or indirectly through a method call, otherwise the language would not do heap allocation for you.


Yeah well... I have a constant inner turmoil with this in D.
I want to believe the GC is the future, but I'm still trying to convince
myself of that (and I think the GC is losing the battle at the moment).
Fortunately you can avoid the GC fairly effectively (if you forego large
parts of phobos!).

Buy things like the array initialisation are inexcusable. Array literals
should NOT allocate, this desperately needs to be fixed.
And scope/escape analysis, so local dynamic arrays can be lowered onto the
stack in self-contained situations.
That's the biggest source of difficult-to-control allocations in my
experience.

On Friday, 31 May 2013 at 11:51:04 UTC, Manu wrote:
>
>> Assuming that you would hand-write exactly the same code as the template
>> expansion...
>> Typically template expansion leads to countless temporary redundancies,
>> which you expect the compiler to try and optimise away, but it's not
>> always
>> able to do so, especially if there is an if() nearby, or worse, a pointer
>> dereference.
>>
>


May 31, 2013
On 5/31/13 9:07 AM, finalpatch wrote:
> D is very strong at TMP, it provides a lot more tools
> specifically designed for TMP, that is vastly superior than C++
> which relies on abusing the templates. This is actually the main
> reason drawing me to D: TMP in a more pleasant way. IMO one thing
> D needs to address is less surprises, eg. innocent looking code
> like v[] = [x,x,x] shouldn't cause major performance hit. In c++
> memory allocation is explicit, either operator new or malloc, or
> indirectly through a method call, otherwise the language would
> not do heap allocation for you.

It would be great if we addressed that in 2.064. I'm sure I've seen the report in bugzilla, but the closest I found were:

http://d.puremagic.com/issues/show_bug.cgi?id=9335
http://d.puremagic.com/issues/show_bug.cgi?id=8449


Andrei

May 31, 2013
Namespace:

> I thought GDC or LDC have something like:
> float[$] v = [x, x, x];
> which is converted to
> flot[3] v = [x, x, x];
>
> Am I wrong?
> DMD need something like this too.

Right. Vote (currently only 6 votes):
http://d.puremagic.com/issues/show_bug.cgi?id=481

Bye,
bearophile
May 31, 2013
On Fri, 31 May 2013 10:49:21 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 5/31/13 9:07 AM, finalpatch wrote:
>> D is very strong at TMP, it provides a lot more tools
>> specifically designed for TMP, that is vastly superior than C++
>> which relies on abusing the templates. This is actually the main
>> reason drawing me to D: TMP in a more pleasant way. IMO one thing
>> D needs to address is less surprises, eg. innocent looking code
>> like v[] = [x,x,x] shouldn't cause major performance hit. In c++
>> memory allocation is explicit, either operator new or malloc, or
>> indirectly through a method call, otherwise the language would
>> not do heap allocation for you.
>
> It would be great if we addressed that in 2.064. I'm sure I've seen the report in bugzilla, but the closest I found were:
>
> http://d.puremagic.com/issues/show_bug.cgi?id=9335
> http://d.puremagic.com/issues/show_bug.cgi?id=8449

There was this:

http://d.puremagic.com/issues/show_bug.cgi?id=2356

I know Don has suggested in the past that all array literals be immutable, like strings, and I agree with that.  But it would be a huge breaking change.

I agree with finalpatch that array literals allocating is not obvious or expected in many cases.

I wonder if the compiler can prove that an array literal isn't referenced outside the function (or statement at least), it can allocate it on the stack instead of the heap?  That would be a huge improvement, and good middle ground.

-Steve
May 31, 2013
Manu:

> Yeah, I've actually noticed this too on a few occasions. It would be nice
> if array operations would unroll for short arrays. Particularly so for static arrays!

Thanks to Kenji the latest dmd 2.063 solves part of this problem:
http://d.puremagic.com/issues/show_bug.cgi?id=2356

Maybe this improvement is not yet in LDC/GDC.

But avoiding heap allocations for array literals is a change that needs to be discussed.

Bye,
bearophile
May 31, 2013
Manu:

> Frankly, this is a textbook example of why STL is the spawn of satan. For
> some reason people are TAUGHT that it's reasonable to write code like this.

There are many kinds of D code, not everything is a high performance ray-tracer or 3D game. So I'm sure there are many many situations where using the C++ STL is more than enough. As most tools, you need to know where and when to use them. So it's not a Satan-spawn :-)

Bye,
bearophile