May 31, 2013
On Friday, 31 May 2013 at 01:26:13 UTC, finalpatch wrote:
> Recently I ported a simple ray tracer I wrote in C++11 to D. Thanks to the similarity between D and C++ it was almost a line by line translation, in other words, very very close. However, the D verson runs much slower than the C++11 version. On Windows, with MinGW GCC and GDC, the C++ version is twice as fast as the D version. On OSX, I used Clang++ and LDC, and the C++11 version was 4x faster than D verson.  Since the comparison were between compilers that share the same codegen backends I suppose that's a relatively fair comparison.  (flags used for GDC: -O3 -fno-bounds-check -frelease,  flags used for LDC: -O3 -release)
>
> I really like the features offered by D but it's the raw performance that's worrying me. From what I read D should offer similar performance when doing similar things but my own test results is not consistent with this claim. I want to know whether this slowness is inherent to the language or it's something I was not doing right (very possible because I have only a few days of experience with D).
>
> Below is the link to the D and C++ code, in case anyone is interested to have a look.
>
> https://dl.dropboxusercontent.com/u/974356/raytracer.d
> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp

Greetings.

After few fast changes I manage to get such results:
[raz@d3 tmp]$ ./a.out
rendering time 276 ms
[raz@d3 tmp]$ ./test
346 ms, 814 μs, and 5 hnsecs


./a.out being binary compiled with clang++ ./test.cxx -std=c++11 -lSDL -O3
./test being binary compiled with ldmd2 -O3 -release -inline -noboundscheck ./test.d (Actually I used rdmd with --compiler=ldmd2 but I omitted it because it was rather long cmd line :p)


Here is source code with changes I applied to D-code (I hope you don't mind repasting it): http://dpaste.dzfl.pl/84bb308d

I am sure there is way more room for improvements and at minimum achieving C++ performance.
May 31, 2013
On 05/31/2013 02:15 AM, nazriel wrote:
> On Friday, 31 May 2013 at 01:26:13 UTC, finalpatch wrote:
>> Recently I ported a simple ray tracer I wrote in C++11 to D. Thanks to the similarity between D and C++ it was almost a line by line translation, in other words, very very close. However, the D verson runs much slower than the C++11 version. On Windows, with MinGW GCC and GDC, the C++ version is twice as fast as the D version. On OSX, I used Clang++ and LDC, and the C++11 version was 4x faster than D verson.  Since the comparison were between compilers that share the same codegen backends I suppose that's a relatively fair comparison.  (flags used for GDC: -O3 -fno-bounds-check -frelease,  flags used for LDC: -O3 -release)
>>
>> I really like the features offered by D but it's the raw performance that's worrying me. From what I read D should offer similar performance when doing similar things but my own test results is not consistent with this claim. I want to know whether this slowness is inherent to the language or it's something I was not doing right (very possible because I have only a few days of experience with D).
>>
>> Below is the link to the D and C++ code, in case anyone is interested to have a look.
>>
>> https://dl.dropboxusercontent.com/u/974356/raytracer.d https://dl.dropboxusercontent.com/u/974356/raytracer.cpp
> 
> Greetings.
> 
> After few fast changes I manage to get such results:
> [raz@d3 tmp]$ ./a.out
> rendering time 276 ms
> [raz@d3 tmp]$ ./test
> 346 ms, 814 μs, and 5 hnsecs
> 
> 
> ./a.out being binary compiled with clang++ ./test.cxx -std=c++11 -lSDL -O3
> ./test being binary compiled with ldmd2 -O3 -release -inline -noboundscheck ./test.d (Actually I used rdmd with
> --compiler=ldmd2 but I omitted it because it was rather long cmd line :p)
> 
> 
> Here is source code with changes I applied to D-code (I hope you don't mind repasting it): http://dpaste.dzfl.pl/84bb308d
> 
> I am sure there is way more room for improvements and at minimum achieving C++ performance.


You might also try changing:

            float[3] t = mixin("v[]"~op~"rhs.v[]");
            return Vec3(t[0], t[1], t[2]);

for:
            Vec3 t;
            t.v[0] = mixin("v[0] "~op~" rhs.v[0]");
            t.v[1] = mixin("v[1] "~op~" rhs.v[1]");
            t.v[2] = mixin("v[2] "~op~" rhs.v[2]");
            return t;

and so on, avoiding the float[3] and the v[] operations (which would
loop, unless the compiler/optimizer unrolls them (didn't check)).

I tested this change (removing v[] ops) in Vec3 and in
normalize(), and it made your version slightly faster
with DMD (didn't check with ldmd2).

--jm


May 31, 2013
On Friday, 31 May 2013 at 05:35:58 UTC, Juan Manuel Cabo wrote:
> On 05/31/2013 02:15 AM, nazriel wrote:
>> On Friday, 31 May 2013 at 01:26:13 UTC, finalpatch wrote:
>>> Recently I ported a simple ray tracer I wrote in C++11 to D. Thanks to the similarity between D and C++ it was almost
>>> a line by line translation, in other words, very very close. However, the D verson runs much slower than the C++11
>>> version. On Windows, with MinGW GCC and GDC, the C++ version is twice as fast as the D version. On OSX, I used Clang++
>>> and LDC, and the C++11 version was 4x faster than D verson.  Since the comparison were between compilers that share
>>> the same codegen backends I suppose that's a relatively fair comparison.  (flags used for GDC: -O3 -fno-bounds-check
>>> -frelease,  flags used for LDC: -O3 -release)
>>>
>>> I really like the features offered by D but it's the raw performance that's worrying me. From what I read D should
>>> offer similar performance when doing similar things but my own test results is not consistent with this claim. I want
>>> to know whether this slowness is inherent to the language or it's something I was not doing right (very possible
>>> because I have only a few days of experience with D).
>>>
>>> Below is the link to the D and C++ code, in case anyone is interested to have a look.
>>>
>>> https://dl.dropboxusercontent.com/u/974356/raytracer.d
>>> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp
>> 
>> Greetings.
>> 
>> After few fast changes I manage to get such results:
>> [raz@d3 tmp]$ ./a.out
>> rendering time 276 ms
>> [raz@d3 tmp]$ ./test
>> 346 ms, 814 μs, and 5 hnsecs
>> 
>> 
>> ./a.out being binary compiled with clang++ ./test.cxx -std=c++11 -lSDL -O3
>> ./test being binary compiled with ldmd2 -O3 -release -inline -noboundscheck ./test.d (Actually I used rdmd with
>> --compiler=ldmd2 but I omitted it because it was rather long cmd line :p)
>> 
>> 
>> Here is source code with changes I applied to D-code (I hope you don't mind repasting it): http://dpaste.dzfl.pl/84bb308d
>> 
>> I am sure there is way more room for improvements and at minimum achieving C++ performance.
>
>
> You might also try changing:
>
>             float[3] t = mixin("v[]"~op~"rhs.v[]");
>             return Vec3(t[0], t[1], t[2]);
>
> for:
>             Vec3 t;
>             t.v[0] = mixin("v[0] "~op~" rhs.v[0]");
>             t.v[1] = mixin("v[1] "~op~" rhs.v[1]");
>             t.v[2] = mixin("v[2] "~op~" rhs.v[2]");
>             return t;
>
> and so on, avoiding the float[3] and the v[] operations (which would
> loop, unless the compiler/optimizer unrolls them (didn't check)).
>
> I tested this change (removing v[] ops) in Vec3 and in
> normalize(), and it made your version slightly faster
> with DMD (didn't check with ldmd2).
>
> --jm


Right, I missed that. Thanks!

Now it is:

[raz@d3 tmp]$ ./a.out
rendering time 276 ms
[raz@d3 tmp]$ ./test
238 ms, 35 μs, and 7 hnsecs

So D version starts to be faster than C++ one.

May 31, 2013
Thanks Nazriel,

It is very cool you are able to narrow the gap to within 1.5x of c++ with a few simple changes.

I checked your version, there are 3 changes (correct me if i missed any):

* Change the (float) constructor from v= [x,x,x] to v[0] = x; v[1] = x; v[2] = x;
* Get rid of the (float[]) constructor and use 3 floats instead
* Change class methods to final

The first change alone shaved off 220ms off the runtime, the 2nd one cuts 130ms
and the 3rd one cuts 60ms.

Lesson learned: by very very careful about dynamic arrays.

On Friday, 31 May 2013 at 05:15:11 UTC, nazriel wrote:
> After few fast changes I manage to get such results:
> [raz@d3 tmp]$ ./a.out
> rendering time 276 ms
> [raz@d3 tmp]$ ./test
> 346 ms, 814 μs, and 5 hnsecs
>
>
> ./a.out being binary compiled with clang++ ./test.cxx -std=c++11 -lSDL -O3
> ./test being binary compiled with ldmd2 -O3 -release -inline -noboundscheck ./test.d (Actually I used rdmd with --compiler=ldmd2 but I omitted it because it was rather long cmd line :p)
>
>
> Here is source code with changes I applied to D-code (I hope you don't mind repasting it): http://dpaste.dzfl.pl/84bb308d
>
> I am sure there is way more room for improvements and at minimum achieving C++ performance.

May 31, 2013
I managed to get it even faster.

[raz@d3 tmp]$ ./a.out
rendering time 282 ms
[raz@d3 tmp]$ ./test
202 ms, 481 μs, and 8 hnsecs

So D version is 1,4x faster than C++ version.
At least on my computer.

Same compilers flags etc

Final code:
http://dpaste.dzfl.pl/61626e88

I guess there is still more room for improvements.

On Friday, 31 May 2013 at 05:49:55 UTC, finalpatch wrote:
> Thanks Nazriel,
>
> It is very cool you are able to narrow the gap to within 1.5x of c++ with a few simple changes.
>
> I checked your version, there are 3 changes (correct me if i missed any):
>
> * Change the (float) constructor from v= [x,x,x] to v[0] = x; v[1] = x; v[2] = x;
Correct

> * Get rid of the (float[]) constructor and use 3 floats instead
It was just for debbuging so compiler would yell at me if I use array literal

> * Change class methods to final
Correct

>
> The first change alone shaved off 220ms off the runtime, the 2nd one cuts 130ms
> and the 3rd one cuts 60ms.
>
> Lesson learned: by very very careful about dynamic arrays.
>

Yeah, it is currently a problem with array literals. They're always allocated on heap even if they shouldn't be.
Final before methods is something that needs to be remembered
May 31, 2013
On 31 May 2013 12:56, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>wrote:

> On 5/30/13 9:26 PM, finalpatch wrote:
>
>> https://dl.dropboxusercontent.**com/u/974356/raytracer.d<https://dl.dropboxusercontent.com/u/974356/raytracer.d> https://dl.dropboxusercontent.**com/u/974356/raytracer.cpp<https://dl.dropboxusercontent.com/u/974356/raytracer.cpp>
>>
>
> Manu's gonna love this one: make all methods final.


Hehe, oh yeah! Coming from you, that's like music to my ears ;)
His observed 5% is a massive speed up, almost 1ms across a 16ms frame!


May 31, 2013
You guys are awesome! I am happy to know that D can indeed offer comparable speed to C++.

But it also shows there is room for the compiler to improve as the C++ version also makes heavy use of loops (or STL algorithms) but they get inlined or unrolled automatically.

On Friday, 31 May 2013 at 05:35:58 UTC, Juan Manuel Cabo wrote:
> You might also try changing:
>
>             float[3] t = mixin("v[]"~op~"rhs.v[]");
>             return Vec3(t[0], t[1], t[2]);
>
> for:
>             Vec3 t;
>             t.v[0] = mixin("v[0] "~op~" rhs.v[0]");
>             t.v[1] = mixin("v[1] "~op~" rhs.v[1]");
>             t.v[2] = mixin("v[2] "~op~" rhs.v[2]");
>             return t;
>
> and so on, avoiding the float[3] and the v[] operations (which would
> loop, unless the compiler/optimizer unrolls them (didn't check)).
>
> I tested this change (removing v[] ops) in Vec3 and in
> normalize(), and it made your version slightly faster
> with DMD (didn't check with ldmd2).
>
> --jm

May 31, 2013
On Friday, 31 May 2013 at 05:59:00 UTC, finalpatch wrote:
> You guys are awesome! I am happy to know that D can indeed offer comparable speed to C++.
>
> But it also shows there is room for the compiler to improve as the C++ version also makes heavy use of loops (or STL algorithms) but they get inlined or unrolled automatically.
>

Have you tried to use GDC or LDC ? They use similar optimizers and code generator than GCC and clang, so should be able to do it as well.
May 31, 2013
On Friday, 31 May 2013 at 05:59:00 UTC, finalpatch wrote:
> You guys are awesome! I am happy to know that D can indeed offer comparable speed to C++.
>
> But it also shows there is room for the compiler to improve as the C++ version also makes heavy use of loops (or STL algorithms) but they get inlined or unrolled automatically.
>
Agree.

I feel big hammer going towards my head from Walter/Andrei side but IMHO  abandoning DMD in the first place would be the best idea. Focusing on LDC or GDC would bring way much more benefits than trying to make anything from DMD. Version compiled with LDC runs in 202 ms and 192 μs. DMD... 1 sec, 891 ms, 571 μs, and 1 hnsec

> On Friday, 31 May 2013 at 05:35:58 UTC, Juan Manuel Cabo wrote:
>> You might also try changing:
>>
>>            float[3] t = mixin("v[]"~op~"rhs.v[]");
>>            return Vec3(t[0], t[1], t[2]);
>>
>> for:
>>            Vec3 t;
>>            t.v[0] = mixin("v[0] "~op~" rhs.v[0]");
>>            t.v[1] = mixin("v[1] "~op~" rhs.v[1]");
>>            t.v[2] = mixin("v[2] "~op~" rhs.v[2]");
>>            return t;
>>
>> and so on, avoiding the float[3] and the v[] operations (which would
>> loop, unless the compiler/optimizer unrolls them (didn't check)).
>>
>> I tested this change (removing v[] ops) in Vec3 and in
>> normalize(), and it made your version slightly faster
>> with DMD (didn't check with ldmd2).
>>
>> --jm
May 31, 2013
Am 31.05.2013 08:11, schrieb deadalnix:
> On Friday, 31 May 2013 at 05:59:00 UTC, finalpatch wrote:
>> You guys are awesome! I am happy to know that D can indeed
>> offer comparable speed to C++.
>>
>> But it also shows there is room for the compiler to improve as
>> the C++ version also makes heavy use of loops (or STL
>> algorithms) but they get inlined or unrolled automatically.
>>
>
> Have you tried to use GDC or LDC ? They use similar optimizers
> and code generator than GCC and clang, so should be able to do it
> as well.
>

he only used GDC,LDC