Slow performance compared to C++, ideas? (page 2)

Hi Andrei, Making all methods final helps a little bit, but not by much (4x slower to 3.8x slower). It is really strange because I didn't spent any time optimizing either the D version or the C++ version, but it appears C++ is just fast by default. On Friday, 31 May 2013 at 02:56:25 UTC, Andrei Alexandrescu wrote: > On 5/30/13 9:26 PM, finalpatch wrote: >> https://dl.dropboxusercontent.com/u/974356/raytracer.d >> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp > > Manu's gonna love this one: make all methods final. > > Andrei

There's some issues involving the use of array literals - they get allocated on the heap for no clear reason. Create a version of your vector constructor that uses four floats, then call that instead in your line 324.

On Friday, 31 May 2013 at 03:26:16 UTC, FeepingCreature wrote: > There's some issues involving the use of array literals - they > get allocated on the heap for no clear reason. Create a version > of your vector constructor that uses four floats, then call that > instead in your line 324. Addendum: in general, always, **ALWAYS PROFILE BEFORE OPTIMIZING**. I'm sure there's good profilers for OSX, or you can just use -g -pg and gprof. (perf is good under linux!)

Two quick notes: a) Profile first the optimise. b) This probably wouldn't make 4x difference but in the C++ code you're passing most objects around by ref. In the D version you're passing structs by value. They are only small but there's a tight loop of recursion to consider... That said, I don't know the details of D optimisation all that well and it may be a non-issue in release builds. Stewart

Hi FeepingCreature, Thanks for the tip, getting rid of the array constructor helped a lot, Runtime is down from 800+ms to 583ms (with LDC, still cannot match C++ though). Maybe I should get rid of all arrays and use hardcoded x,y,z member variables instead, or use tuples. On Friday, 31 May 2013 at 03:26:16 UTC, FeepingCreature wrote: > There's some issues involving the use of array literals - they > get allocated on the heap for no clear reason. Create a version > of your vector constructor that uses four floats, then call that > instead in your line 324.

On 5/30/2013 7:31 PM, finalpatch wrote: > Thanks for the reply. I have already tried these flags. However, DMD's codegen > is lagging behind GCC and LLVM at the moment, so even with these flags, the > runtime is ~10x longer than the C++ version compiled with clang++ (2sec with > DMD, 200ms with clang++ on a Core2 Mac Pro). I know this is comparing apples to > oranges though, that's why I was comparing GDC vs G++ and LDC vs Clang++. Yes, you're comparing things the right way.

On Friday, 31 May 2013 at 02:56:25 UTC, Andrei Alexandrescu wrote: > On 5/30/13 9:26 PM, finalpatch wrote: >> https://dl.dropboxusercontent.com/u/974356/raytracer.d >> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp > > Manu's gonna love this one: make all methods final. > I don't think going as far as making thing final by default make sense at this point. But we sure need a way to be able to finalize methods. We had an extensive discussion with Don and Manu at DConf, here are some idea that came out : - Final by default This one is really a plus when it come to performance code. However, virtual by default have proven itself very useful when performance isn't that big of a deal (and it is the case for 90% of a program's code usually) and limiting the usage of some pattern like decorator (that also have been proven to be useful). This is also huge breakage. - Introduce a virtual keyword. Virtual by default isn't such a big deal if you can do final: and reverse the default behavior. However, once you key in the final land, you are trapped here, you can't get out. Introducing a virtual keyword would allow for aggressive final: declarations. - Require explicitly export when you want to create shared objects. This one is an enabler for an optimizer to finalize virtual method. With this mechanism in place, the compile knows all the override and can finalize many calls during LTO. I especially like that one as it allow for stripping many symbols at link time and allow for other LTO in general (for instance, the compiler can choose custom calling conventions for some methods, knowing all call sites). The explicit export one have my preference, however, ti require that symbol used in shared lib are explicitly declared export. I think we shouldn't break the virtual by default behavior, but we still need to figure out a way to make thing more performant on this point.

Profile. Don't even think of asking for help before profiling. Those bugs you fixed here would be trivial to detect with a profiler. GC-dependent stuff usually is (like array literals mentioned here) usually are. As for profiling of something like this, both gprof and DMD's builtin profiler are probably going to be useless. Use perf (if on Linux), or AMD CodeAnalyst, or some other sampling profiler (these two are free).

May 31, 2013

Re: Slow performance compared to C++, ideas?

Posted by Juan Manuel Cabo
in reply to finalpatch

Permalink

Juan Manuel Cabo

Posted in reply to finalpatch

Permalink

On 05/31/2013 12:45 AM, finalpatch wrote:
> Hi FeepingCreature,
> 
> Thanks for the tip, getting rid of the array constructor helped a lot, Runtime is down from 800+ms to 583ms (with LDC, still cannot match C++ though). Maybe I should get rid of all arrays and use hardcoded x,y,z member variables instead, or use tuples.
> 
> On Friday, 31 May 2013 at 03:26:16 UTC, FeepingCreature wrote:
>> There's some issues involving the use of array literals - they get allocated on the heap for no clear reason. Create a version of your vector constructor that uses four floats, then call that instead in your line 324.
> 

I just shaved 1.2 seconds trying with dmd by changing the dot function from:

    float dot(in Vec3 v1, in Vec3 v2)
    {
        auto t = v1.v * v2.v;
        auto p = t.ptr;
        return p[0] + p[1] + p[2];
    }

to:

    float dot(in Vec3 v1, in Vec3 v2)
    {
        auto one = v1.v.ptr;
        auto two = v2.v.ptr;
        return one[0] * two[0]
            + one[1] * two[1]
            + one[2] * two[2];
    }

Before:
	2 secs, 895 ms, 891 μs, and 7 hnsecs
After:
        1 sec, 648 ms, 698 μs, and 1 hnsec


For others who might want to try, I downloaded the necessary derelict files from:

    http://www.dsource.org/projects/derelict/browser/branches/Derelict2

(DerelictUtil and DerelictSDL directories). And compiled with:

    dmd -O -inline -noboundscheck -release raytracer.d \
        derelict/sdl/sdl.d derelict/sdl/sdlfuncs.d \
        derelict/sdl/sdltypes.d derelict/util/compat.d \
        derelict/util/exception.d derelict/util/loader.d \
        derelict/util/sharedlib.d -L-ldl


I also ran it with the -profile switch. Here are the top functions in trace.log:


======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

11834377  688307713   688307713          58     const(bool function(raytracer.Ray, float*)) raytracer.Sphere.intersect
1446294  2922493954   582795433         402     raytracer.Vec3 raytracer.trace(const(raytracer.Ray), raytracer.Scene, int)
      1  1748464181   296122753   296122753     void raytracer.render(raytracer.Scene, derelict.sdl.sdltypes.SDL_Surface*)
 933910   309760738   110563786         118    _D9raytracer5traceFxS9raytracer3RayS .... (lambda)
      1  1829865336    78200113    78200113     _Dmain
 933910    42084879    42084879          45     const(raytracer.Vec3 function(raytracer.Vec3)) raytracer.Sphere.normal
 795095    13423716    13423716          16     const(raytracer.Vec3 function(const(raytracer.Vec3)))
raytracer.Vec3.opBinary!("*").opBinary
 933910    11122934    11122934          11     pure nothrow @trusted float std.math.pow!(float, int).pow(float, int)
 933910   313479603     3718864           3     _D9raytracer5traceFxS9raytracer3RayS9raytracer5SceneiZS   ... (lambda)
      1     3014385     2991659     2991659     void derelict.util.sharedlib.SharedLib.load(immutable(char)[][])
      1      152945      152945      152945     void derelict.util.loader.SharedLibLoader.unload()
1590190       89018       89018           0     const(raytracer.Vec3 function(const(float)))
raytracer.Vec3.opBinary!("*").opBinary
1047016       70383       70383           0     const(float function()) raytracer.Sphere.transparency
    186       66925       66925         359     void derelict.util.loader.SharedLibLoader.bindFunc(void**,
immutable(char)[], bool)

Hi, I think you are using the version(D_SIMD) path, which is my (not very successful) attempt at vectorizing the thing. change the version(D_SIMD) line to version(none) and it will use the scalar path, which has exactly the same dot() function as yours. On Friday, 31 May 2013 at 04:29:19 UTC, Juan Manuel Cabo wrote: > I just shaved 1.2 seconds trying with dmd by changing the dot function from: > > float dot(in Vec3 v1, in Vec3 v2) > { > auto t = v1.v * v2.v; > auto p = t.ptr; > return p[0] + p[1] + p[2]; > } > > to: > > float dot(in Vec3 v1, in Vec3 v2) > { > auto one = v1.v.ptr; > auto two = v2.v.ptr; > return one[0] * two[0] > + one[1] * two[1] > + one[2] * two[2]; > } > > Before: > 2 secs, 895 ms, 891 μs, and 7 hnsecs > After: > 1 sec, 648 ms, 698 μs, and 1 hnsec >

Forums