May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Hi Andrei,
Making all methods final helps a little bit, but not by much (4x slower to 3.8x slower). It is really strange because I didn't spent any time optimizing either the D version or the C++ version, but it appears C++ is just fast by default.
On Friday, 31 May 2013 at 02:56:25 UTC, Andrei Alexandrescu wrote:
> On 5/30/13 9:26 PM, finalpatch wrote:
>> https://dl.dropboxusercontent.com/u/974356/raytracer.d
>> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp
>
> Manu's gonna love this one: make all methods final.
>
> Andrei
|
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to finalpatch | There's some issues involving the use of array literals - they get allocated on the heap for no clear reason. Create a version of your vector constructor that uses four floats, then call that instead in your line 324. |
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to FeepingCreature | On Friday, 31 May 2013 at 03:26:16 UTC, FeepingCreature wrote:
> There's some issues involving the use of array literals - they
> get allocated on the heap for no clear reason. Create a version
> of your vector constructor that uses four floats, then call that
> instead in your line 324.
Addendum: in general, always, **ALWAYS PROFILE BEFORE OPTIMIZING**. I'm sure there's good profilers for OSX, or you can just use -g -pg and gprof. (perf is good under linux!)
|
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to finalpatch | Two quick notes: a) Profile first the optimise. b) This probably wouldn't make 4x difference but in the C++ code you're passing most objects around by ref. In the D version you're passing structs by value. They are only small but there's a tight loop of recursion to consider... That said, I don't know the details of D optimisation all that well and it may be a non-issue in release builds. Stewart |
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to FeepingCreature | Hi FeepingCreature,
Thanks for the tip, getting rid of the array constructor helped a lot, Runtime is down from 800+ms to 583ms (with LDC, still cannot match C++ though). Maybe I should get rid of all arrays and use hardcoded x,y,z member variables instead, or use tuples.
On Friday, 31 May 2013 at 03:26:16 UTC, FeepingCreature wrote:
> There's some issues involving the use of array literals - they
> get allocated on the heap for no clear reason. Create a version
> of your vector constructor that uses four floats, then call that
> instead in your line 324.
|
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to finalpatch | On 5/30/2013 7:31 PM, finalpatch wrote:
> Thanks for the reply. I have already tried these flags. However, DMD's codegen
> is lagging behind GCC and LLVM at the moment, so even with these flags, the
> runtime is ~10x longer than the C++ version compiled with clang++ (2sec with
> DMD, 200ms with clang++ on a Core2 Mac Pro). I know this is comparing apples to
> oranges though, that's why I was comparing GDC vs G++ and LDC vs Clang++.
Yes, you're comparing things the right way.
|
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Friday, 31 May 2013 at 02:56:25 UTC, Andrei Alexandrescu wrote:
> On 5/30/13 9:26 PM, finalpatch wrote:
>> https://dl.dropboxusercontent.com/u/974356/raytracer.d
>> https://dl.dropboxusercontent.com/u/974356/raytracer.cpp
>
> Manu's gonna love this one: make all methods final.
>
I don't think going as far as making thing final by default make sense at this point. But we sure need a way to be able to finalize methods. We had an extensive discussion with Don and Manu at DConf, here are some idea that came out :
- Final by default
This one is really a plus when it come to performance code. However, virtual by default have proven itself very useful when performance isn't that big of a deal (and it is the case for 90% of a program's code usually) and limiting the usage of some pattern like decorator (that also have been proven to be useful). This is also huge breakage.
- Introduce a virtual keyword.
Virtual by default isn't such a big deal if you can do final: and reverse the default behavior. However, once you key in the final land, you are trapped here, you can't get out. Introducing a virtual keyword would allow for aggressive final: declarations.
- Require explicitly export when you want to create shared objects.
This one is an enabler for an optimizer to finalize virtual method. With this mechanism in place, the compile knows all the override and can finalize many calls during LTO. I especially like that one as it allow for stripping many symbols at link time and allow for other LTO in general (for instance, the compiler can choose custom calling conventions for some methods, knowing all call sites).
The explicit export one have my preference, however, ti require that symbol used in shared lib are explicitly declared export. I think we shouldn't break the virtual by default behavior, but we still need to figure out a way to make thing more performant on this point.
|
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to deadalnix | Profile. Don't even think of asking for help before profiling. Those bugs you fixed here would be trivial to detect with a profiler. GC-dependent stuff usually is (like array literals mentioned here) usually are. As for profiling of something like this, both gprof and DMD's builtin profiler are probably going to be useless. Use perf (if on Linux), or AMD CodeAnalyst, or some other sampling profiler (these two are free). |
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to finalpatch | On 05/31/2013 12:45 AM, finalpatch wrote: > Hi FeepingCreature, > > Thanks for the tip, getting rid of the array constructor helped a lot, Runtime is down from 800+ms to 583ms (with LDC, still cannot match C++ though). Maybe I should get rid of all arrays and use hardcoded x,y,z member variables instead, or use tuples. > > On Friday, 31 May 2013 at 03:26:16 UTC, FeepingCreature wrote: >> There's some issues involving the use of array literals - they get allocated on the heap for no clear reason. Create a version of your vector constructor that uses four floats, then call that instead in your line 324. > I just shaved 1.2 seconds trying with dmd by changing the dot function from: float dot(in Vec3 v1, in Vec3 v2) { auto t = v1.v * v2.v; auto p = t.ptr; return p[0] + p[1] + p[2]; } to: float dot(in Vec3 v1, in Vec3 v2) { auto one = v1.v.ptr; auto two = v2.v.ptr; return one[0] * two[0] + one[1] * two[1] + one[2] * two[2]; } Before: 2 secs, 895 ms, 891 μs, and 7 hnsecs After: 1 sec, 648 ms, 698 μs, and 1 hnsec For others who might want to try, I downloaded the necessary derelict files from: http://www.dsource.org/projects/derelict/browser/branches/Derelict2 (DerelictUtil and DerelictSDL directories). And compiled with: dmd -O -inline -noboundscheck -release raytracer.d \ derelict/sdl/sdl.d derelict/sdl/sdlfuncs.d \ derelict/sdl/sdltypes.d derelict/util/compat.d \ derelict/util/exception.d derelict/util/loader.d \ derelict/util/sharedlib.d -L-ldl I also ran it with the -profile switch. Here are the top functions in trace.log: ======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ======== Num Tree Func Per Calls Time Time Call 11834377 688307713 688307713 58 const(bool function(raytracer.Ray, float*)) raytracer.Sphere.intersect 1446294 2922493954 582795433 402 raytracer.Vec3 raytracer.trace(const(raytracer.Ray), raytracer.Scene, int) 1 1748464181 296122753 296122753 void raytracer.render(raytracer.Scene, derelict.sdl.sdltypes.SDL_Surface*) 933910 309760738 110563786 118 _D9raytracer5traceFxS9raytracer3RayS .... (lambda) 1 1829865336 78200113 78200113 _Dmain 933910 42084879 42084879 45 const(raytracer.Vec3 function(raytracer.Vec3)) raytracer.Sphere.normal 795095 13423716 13423716 16 const(raytracer.Vec3 function(const(raytracer.Vec3))) raytracer.Vec3.opBinary!("*").opBinary 933910 11122934 11122934 11 pure nothrow @trusted float std.math.pow!(float, int).pow(float, int) 933910 313479603 3718864 3 _D9raytracer5traceFxS9raytracer3RayS9raytracer5SceneiZS ... (lambda) 1 3014385 2991659 2991659 void derelict.util.sharedlib.SharedLib.load(immutable(char)[][]) 1 152945 152945 152945 void derelict.util.loader.SharedLibLoader.unload() 1590190 89018 89018 0 const(raytracer.Vec3 function(const(float))) raytracer.Vec3.opBinary!("*").opBinary 1047016 70383 70383 0 const(float function()) raytracer.Sphere.transparency 186 66925 66925 359 void derelict.util.loader.SharedLibLoader.bindFunc(void**, immutable(char)[], bool) |
May 31, 2013 Re: Slow performance compared to C++, ideas? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Juan Manuel Cabo | Hi,
I think you are using the version(D_SIMD) path, which is my (not very successful) attempt at vectorizing the thing.
change the version(D_SIMD) line to version(none) and it will use the scalar path, which has exactly the same dot() function as yours.
On Friday, 31 May 2013 at 04:29:19 UTC, Juan Manuel Cabo wrote:
> I just shaved 1.2 seconds trying with dmd by changing the dot function from:
>
> float dot(in Vec3 v1, in Vec3 v2)
> {
> auto t = v1.v * v2.v;
> auto p = t.ptr;
> return p[0] + p[1] + p[2];
> }
>
> to:
>
> float dot(in Vec3 v1, in Vec3 v2)
> {
> auto one = v1.v.ptr;
> auto two = v2.v.ptr;
> return one[0] * two[0]
> + one[1] * two[1]
> + one[2] * two[2];
> }
>
> Before:
> 2 secs, 895 ms, 891 μs, and 7 hnsecs
> After:
> 1 sec, 648 ms, 698 μs, and 1 hnsec
>
|
Copyright © 1999-2021 by the D Language Foundation