On 31 May 2013 11:26, finalpatch <fengli@gmail.com> wrote:
Recently I ported a simple ray tracer I wrote in C++11 to D. Thanks to the similarity between D and C++ it was almost a line by line translation, in other words, very very close. However, the D verson runs much slower than the C++11 version. On Windows, with MinGW GCC and GDC, the C++ version is twice as fast as the D version. On OSX, I used Clang++ and LDC, and the C++11 version was 4x faster than D verson.  Since the comparison were between compilers that share the same codegen backends I suppose that's a relatively fair comparison.  (flags used for GDC: -O3 -fno-bounds-check -frelease,  flags used for LDC: -O3 -release)

I really like the features offered by D but it's the raw performance that's worrying me. From what I read D should offer similar performance when doing similar things but my own test results is not consistent with this claim. I want to know whether this slowness is inherent to the language or it's something I was not doing right (very possible because I have only a few days of experience with D).

Below is the link to the D and C++ code, in case anyone is interested to have a look.

https://dl.dropboxusercontent.com/u/974356/raytracer.d
https://dl.dropboxusercontent.com/u/974356/raytracer.cpp

Can you paste the disassembly of the inner loop (trace()) for each G++/GDC, Or LDC/Clang++?

That said, I can see almost innumerable red flags (on basically every line).
The fact that it takes 200ms to render a frame in C++ (I would expect <10ms) suggests that your approach is amazingly slow to begin with, at which point I would start looking for much higher level problems.
Once you have an implementation that's approaching optimal, then we can start making comparisons.

Here are some thoughts at first glance:
* The fact that you use STL makes me immediately concerned. Generic code for this sort of work will never run well.
   That said, STL has decades more time spent optimising, so it stands to reason that the C++ compiler will be able to do more to improve the STL code.
* Your vector class both in C++/D are pretty nasty. Use 4d SIMD vectors.
* So many integer divisions!
* There are countless float <-> int casts.
* Innumerable redundant loads/stores.
* I would have raised the virtual-by-default travesty, but Andrei did it for me! ;)
* intersect() should be __forceinline.
* intersect() is full of if's (it's hard to predict if the optimiser can work across those if's. maybe it can...)

What's taking the most time?
The lighting loop is so template-tastic, I can't get a feel for how fast that loop would be.

I believe the reason for the difference is not going to be so easily revealed. It's probably hidden largely in the fact that C++ has had a good decade of optimisation spent on STL over D.
It's also possible that the C++ compiler hooks many of those STL functions as compiler intrinsics with internalised logic.

Frankly, this is a textbook example of why STL is the spawn of satan. For some reason people are TAUGHT that it's reasonable to write code like this.