May 31, 2013
Recently I ported a simple ray tracer I wrote in C++11 to D. Thanks to the similarity between D and C++ it was almost a line by line translation, in other words, very very close. However, the D verson runs much slower than the C++11 version. On Windows, with MinGW GCC and GDC, the C++ version is twice as fast as the D version. On OSX, I used Clang++ and LDC, and the C++11 version was 4x faster than D verson.  Since the comparison were between compilers that share the same codegen backends I suppose that's a relatively fair comparison.  (flags used for GDC: -O3 -fno-bounds-check -frelease,  flags used for LDC: -O3 -release)

I really like the features offered by D but it's the raw performance that's worrying me. From what I read D should offer similar performance when doing similar things but my own test results is not consistent with this claim. I want to know whether this slowness is inherent to the language or it's something I was not doing right (very possible because I have only a few days of experience with D).

Below is the link to the D and C++ code, in case anyone is interested to have a look.

https://dl.dropboxusercontent.com/u/974356/raytracer.d
https://dl.dropboxusercontent.com/u/974356/raytracer.cpp
May 31, 2013
finalpatch:

> I really like the features offered by D but it's the raw performance that's worrying me.

From my experience if you know what you are doing, you are able to write that kind of numerical D code that LDC compiles with a performance very close to C++, and sometimes higher. But you need to be careful about some things.

Don't do this:
foreach (y; (iota(height)))

Use this, because those abstractions are not for free:
foreach (y;  0 .. height)

Be careful with foreach on arrays of structs, because it perform copies that are slow if the structs aren't very small.

Be careful with classes, because on default their methods are virtual. Sometimes in D you want to use structs for performance reasons.

Sometimes in inner loops it's better to use a classic for instead of a foreach.

LDC needs far more flags to compile a raytracer well. LDC even support link time optimization, but you need even more obscure flags.

Also the ending brace of classes and structs doesn't need a semicolon in D.

Bye,
bearophile
May 31, 2013
Hi bearophile,

Thanks for the reply. I changed it to 0..height and it has no measurable effect to the runtime.

The reason I used iota(height) was to test std.parallelism.parallel. On Windows if I do foreach (y; parallel(iota(height))) I do get almost 4x speed up on a quadcore computer. However, on OSX, parallel() either does nothing (LDC) or makes it slower than single threaded(DMD).

On Friday, 31 May 2013 at 01:42:53 UTC, bearophile wrote:
> Don't do this:
> foreach (y; (iota(height)))
>
> Use this, because those abstractions are not for free:
> foreach (y;  0 .. height)
May 31, 2013
finalpatch:

> Thanks for the reply. I changed it to 0..height and it has no measurable effect to the runtime.

Have you also fixed all the other things? :-) Probably you have to keep fixing potentially slow spots until you find the truly slow ones.

Bye,
bearophile
May 31, 2013
I don't know if this is the case with the code in question (I have not looked at it), but sometimes there will be a significant effect on performance caused by the use of the garbage collector. This is an area in need of radical improvements.

You have to minimize situations where there's a lot of allocations going on while the GC is enabled because that will fire up the GC more often than is required and it can slow down your app significantly; A 2x or more performance penalty is certainly possible. It can also make performance unpredictable with large delays at inappropriate points in the execution.

BTW, you should post questions like this into d.learn rather than in the general discussion area.

--rt
May 31, 2013
Hi Rob,

I have tried put GC.disable() and GC.enable() around the rendering call and it made no difference.

On Friday, 31 May 2013 at 02:13:36 UTC, Rob T wrote:
> I don't know if this is the case with the code in question (I have not looked at it), but sometimes there will be a significant effect on performance caused by the use of the garbage collector. This is an area in need of radical improvements.
>
> You have to minimize situations where there's a lot of allocations going on while the GC is enabled because that will fire up the GC more often than is required and it can slow down your app significantly; A 2x or more performance penalty is certainly possible. It can also make performance unpredictable with large delays at inappropriate points in the execution.
>
> BTW, you should post questions like this into d.learn rather than in the general discussion area.
>
> --rt

May 31, 2013
On 5/30/2013 6:26 PM, finalpatch wrote:
> Recently I ported a simple ray tracer I wrote in C++11 to D. Thanks to the
> similarity between D and C++ it was almost a line by line translation, in other
> words, very very close. However, the D verson runs much slower than the C++11
> version. On Windows, with MinGW GCC and GDC, the C++ version is twice as fast as
> the D version. On OSX, I used Clang++ and LDC, and the C++11 version was 4x
> faster than D verson.  Since the comparison were between compilers that share
> the same codegen backends I suppose that's a relatively fair comparison.  (flags
> used for GDC: -O3 -fno-bounds-check -frelease,  flags used for LDC: -O3 -release)

For max speed using dmd, use the flags:

   -O -release -inline -noboundscheck

The -inline is especially important.


> I really like the features offered by D but it's the raw performance that's
> worrying me. From what I read D should offer similar performance when doing
> similar things but my own test results is not consistent with this claim. I want
> to know whether this slowness is inherent to the language or it's something I
> was not doing right (very possible because I have only a few days of experience
> with D).
>
> Below is the link to the D and C++ code, in case anyone is interested to have a
> look.
>
> https://dl.dropboxusercontent.com/u/974356/raytracer.d
> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp

May 31, 2013
Hi Walter,

Thanks for the reply. I have already tried these flags. However, DMD's codegen is lagging behind GCC and LLVM at the moment, so even with these flags, the runtime is ~10x longer than the C++ version compiled with clang++ (2sec with DMD, 200ms with clang++ on a Core2 Mac Pro). I know this is comparing apples to oranges though, that's why I was comparing GDC vs G++ and LDC vs Clang++.

On Friday, 31 May 2013 at 02:19:40 UTC, Walter Bright wrote:
> For max speed using dmd, use the flags:
>
>    -O -release -inline -noboundscheck
>
> The -inline is especially important.
May 31, 2013
On 05/30/2013 11:31 PM, finalpatch wrote:
> Hi Walter,
> 
> Thanks for the reply. I have already tried these flags. However, DMD's codegen is lagging behind GCC and LLVM at the moment, so even with these flags, the runtime is ~10x longer than the C++ version compiled with clang++ (2sec with DMD, 200ms with clang++ on a Core2 Mac Pro). I know this is comparing apples to oranges though, that's why I was comparing GDC vs G++ and LDC vs Clang++.
> 
> On Friday, 31 May 2013 at 02:19:40 UTC, Walter Bright wrote:
>> For max speed using dmd, use the flags:
>>
>>    -O -release -inline -noboundscheck
>>
>> The -inline is especially important.


Have you tried:

     dmd -profile

it compiles in trace generation, so that when you run the program you get a .log file which tells you the slowest functions and other info.

Please not that the resulting code compiled with -profile is slower because it is instrumented.

--jm


May 31, 2013
On 5/30/13 9:26 PM, finalpatch wrote:
> https://dl.dropboxusercontent.com/u/974356/raytracer.d
> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp

Manu's gonna love this one: make all methods final.

Andrei
« First   ‹ Prev
1 2 3 4 5 6 7 8 9 10 11
Top | Discussion index | About this forum | D home