Jump to page: 1 24  
Page
Thread overview
D 50% slower than C++. What I'm doing wrong?
Apr 14, 2012
ReneSac
Apr 14, 2012
q66
Apr 14, 2012
q66
Apr 14, 2012
Somedude
Apr 14, 2012
q66
Apr 15, 2012
ReneSac
Apr 15, 2012
Jonathan M Davis
Apr 15, 2012
jerro
Apr 15, 2012
jerro
Apr 15, 2012
ReneSac
Apr 15, 2012
Dmitry Olshansky
Apr 15, 2012
Kevin Cox
Apr 15, 2012
Timon Gehr
Apr 15, 2012
Somedude
Apr 15, 2012
Ashish Myles
Apr 15, 2012
Somedude
Apr 15, 2012
Somedude
Apr 15, 2012
jerro
Apr 15, 2012
Jonathan M Davis
Apr 24, 2012
Marco Leise
Apr 15, 2012
Jonathan M Davis
Apr 16, 2012
Andrea Fontana
Apr 16, 2012
ReneSac
Apr 16, 2012
Timon Gehr
Apr 17, 2012
ReneSac
Apr 17, 2012
jerro
Apr 17, 2012
Oleg Kuporosov
Apr 24, 2012
Marco Leise
Apr 24, 2012
bearophile
April 14, 2012
I have this simple binary arithmetic coder in C++ by Mahoney and translated to D by Maffi. I added "notrow", "final" and "pure"  and "GC.disable" where it was possible, but that didn't made much difference. Adding "const" to the Predictor.p() (as in the C++ version) gave 3% higher performance. Here the two versions:

http://mattmahoney.net/dc/  <-- original zip

http://pastebin.com/55x9dT9C  <-- Original C++ version.
http://pastebin.com/TYT7XdwX  <-- Modified D translation.

The problem is that the D version is 50% slower:

test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

Lang| Comp  | Binary size | Time (lower is better)
C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s


The only diference I could see between the C++ and D versions is that C++ has hints to the compiler about which functions to inline, and I could't find anything similar in D. So I manually inlined the encode and decode functions:

http://pastebin.com/N4nuyVMh  - Manual inline

D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s

Still, the D version is slower. What makes this speed diference? Is there any way to side-step this?

Note that this simple C++ version can be made more than 2 times faster with algoritimical and io optimizations, (ab)using templates, etc. So I'm not asking for generic speed optimizations, but only things that may make the D code "more equal" to the C++ code.
April 14, 2012
On 14/04/12 21:05, ReneSac wrote:
> Lang| Comp | Binary size | Time (lower is better)
> C++ (g++) - 13kb - 2.42s (100%) -O3 -s
> D (DMD) - 230kb - 4.46s (184%) -O -release -inline
> D (GDC) - 1322kb - 3.69s (152%) -O3 -frelease -s

Try using extra optimizations for GDC.  Actually, GDC has a "dmd-like" interface, gdmd, and

   gdmd -O -release -inline

corresponds to

   gdc -O3 -fweb -frelease -finline-functions

... so there may be some optimizations you were missing.  (If you call gdmd with the -vdmd flag, it will tell you exactly what gdc statement it's using.)

> The only diference I could see between the C++ and D versions is that C++ has
> hints to the compiler about which functions to inline, and I could't find
> anything similar in D. So I manually inlined the encode and decode functions:

GDC has all the regular gcc optimization flags available IIRC.  The ones on the GDC man page are just the ones specific to GDC.

> Still, the D version is slower. What makes this speed diference? Is there any
> way to side-step this?

In my (brief and limited) experience GDC produced executables tend to have a noticeable but minor gap compared to equivalent g++ compiled C++ code -- nothing on the order of 150%.

E.g. I have some simulation code which models a reputation system where users rate objects and are then in turn judged on the consistency of their ratings with the general consensus.  A simulation with 1000 users and 1000 objects takes ~22s to run in C++, ~24s in D compiled with gdmd -O -release -inline.

Scale that up to 2000 users and 1000 objects and it's 47s (C++) vs 53s (D).
2000 users and 2000 objects gives 1min 49s (C++) and 2min 4s (D).

So, it's a gap, but not one to write home about really, especially when you count that D is safer and (I think) easier/faster to program in.

It's true that DMD is much slower -- the GCC backend is much better at generating fast code.  If I recall right the DMD backend's encoding of floating point operations is considerably less sophisticated.

> Note that this simple C++ version can be made more than 2 times faster with
> algoritimical and io optimizations, (ab)using templates, etc. So I'm not asking
> for generic speed optimizations, but only things that may make the D code "more
> equal" to the C++ code.

I'm sure you can make various improvements to your D code in a similar way, and there are some things that improve in D when written in idiomatic "D style" as opposed to a more C++ish way of doing things (e.g. if you want to copy 1 vector to another, as happens in my code, write x[] = y[] instead of doing any kind of loop).

Best wishes,

    -- Joe
April 14, 2012
On Saturday, 14 April 2012 at 19:05:40 UTC, ReneSac wrote:
> I have this simple binary arithmetic coder in C++ by Mahoney and translated to D by Maffi. I added "notrow", "final" and "pure"  and "GC.disable" where it was possible, but that didn't made much difference. Adding "const" to the Predictor.p() (as in the C++ version) gave 3% higher performance. Here the two versions:
>
> http://mattmahoney.net/dc/  <-- original zip
>
> http://pastebin.com/55x9dT9C  <-- Original C++ version.
> http://pastebin.com/TYT7XdwX  <-- Modified D translation.
>
> The problem is that the D version is 50% slower:
>
> test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)
>
> Lang| Comp  | Binary size | Time (lower is better)
> C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
> D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
> D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
>
>
> The only diference I could see between the C++ and D versions is that C++ has hints to the compiler about which functions to inline, and I could't find anything similar in D. So I manually inlined the encode and decode functions:
>
> http://pastebin.com/N4nuyVMh  - Manual inline
>
> D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
> D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s
>
> Still, the D version is slower. What makes this speed diference? Is there any way to side-step this?
>
> Note that this simple C++ version can be made more than 2 times faster with algoritimical and io optimizations, (ab)using templates, etc. So I'm not asking for generic speed optimizations, but only things that may make the D code "more equal" to the C++ code.

I wrote a version http://codepad.org/phpLP7cx based on the C++ one.

My commands used to compile:

g++46 -O3 -s fpaq0.cpp -o fpaq0cpp
dmd -O -release -inline -noboundscheck fpaq0.d

G++ 4.6, dmd 2.059.

I did 5 tests for each:

test.fpaq0 (34603008 bytes) -> test.bmp (34610367 bytes)

The C++ average result was 9.99 seconds (varying from 9.98 to 10.01)
The D average result was 12.00 seconds (varying from 11.98 to 12.01)

That means there is 16.8 percent difference in performance that would be cleared out by usage of gdc (which I don't have around currently).
April 14, 2012
Forgot to mention specs: Dualcore Athlon II X2 240 (2.8GHz), 4GB RAM, FreeBSD 9 x64, both compilers are 64bit.
April 14, 2012
Le 14/04/2012 21:53, q66 a écrit :
> On Saturday, 14 April 2012 at 19:05:40 UTC, ReneSac wrote:
>> I have this simple binary arithmetic coder in C++ by Mahoney and translated to D by Maffi. I added "notrow", "final" and "pure"  and "GC.disable" where it was possible, but that didn't made much difference. Adding "const" to the Predictor.p() (as in the C++ version) gave 3% higher performance. Here the two versions:
>>
>> http://mattmahoney.net/dc/  <-- original zip
>>
>> http://pastebin.com/55x9dT9C  <-- Original C++ version. http://pastebin.com/TYT7XdwX  <-- Modified D translation.
>>
>> The problem is that the D version is 50% slower:
>>
>> test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)
>>
>> Lang| Comp  | Binary size | Time (lower is better)
>> C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
>> D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
>> D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
>>
>>
>> The only diference I could see between the C++ and D versions is that C++ has hints to the compiler about which functions to inline, and I could't find anything similar in D. So I manually inlined the encode and decode functions:
>>
>> http://pastebin.com/N4nuyVMh  - Manual inline
>>
>> D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
>> D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s
>>
>> Still, the D version is slower. What makes this speed diference? Is there any way to side-step this?
>>
>> Note that this simple C++ version can be made more than 2 times faster with algoritimical and io optimizations, (ab)using templates, etc. So I'm not asking for generic speed optimizations, but only things that may make the D code "more equal" to the C++ code.
> 
> I wrote a version http://codepad.org/phpLP7cx based on the C++ one.
> 
> My commands used to compile:
> 
> g++46 -O3 -s fpaq0.cpp -o fpaq0cpp
> dmd -O -release -inline -noboundscheck fpaq0.d
> 
> G++ 4.6, dmd 2.059.
> 
> I did 5 tests for each:
> 
> test.fpaq0 (34603008 bytes) -> test.bmp (34610367 bytes)
> 
> The C++ average result was 9.99 seconds (varying from 9.98 to 10.01)
> The D average result was 12.00 seconds (varying from 11.98 to 12.01)
> 
> That means there is 16.8 percent difference in performance that would be cleared out by usage of gdc (which I don't have around currently).

The code is nearly identical (there is a slight difference in update(),
where he accesses the array once more than you), but the main difference
I see is the -noboundscheck compilation option on DMD.
April 14, 2012
On Saturday, 14 April 2012 at 20:58:01 UTC, Somedude wrote:
> Le 14/04/2012 21:53, q66 a écrit :
>> On Saturday, 14 April 2012 at 19:05:40 UTC, ReneSac wrote:
>>> I have this simple binary arithmetic coder in C++ by Mahoney and
>>> translated to D by Maffi. I added "notrow", "final" and "pure"  and
>>> "GC.disable" where it was possible, but that didn't made much
>>> difference. Adding "const" to the Predictor.p() (as in the C++
>>> version) gave 3% higher performance. Here the two versions:
>>>
>>> http://mattmahoney.net/dc/  <-- original zip
>>>
>>> http://pastebin.com/55x9dT9C  <-- Original C++ version.
>>> http://pastebin.com/TYT7XdwX  <-- Modified D translation.
>>>
>>> The problem is that the D version is 50% slower:
>>>
>>> test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)
>>>
>>> Lang| Comp  | Binary size | Time (lower is better)
>>> C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
>>> D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
>>> D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
>>>
>>>
>>> The only diference I could see between the C++ and D versions is that
>>> C++ has hints to the compiler about which functions to inline, and I
>>> could't find anything similar in D. So I manually inlined the encode
>>> and decode functions:
>>>
>>> http://pastebin.com/N4nuyVMh  - Manual inline
>>>
>>> D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
>>> D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s
>>>
>>> Still, the D version is slower. What makes this speed diference? Is
>>> there any way to side-step this?
>>>
>>> Note that this simple C++ version can be made more than 2 times faster
>>> with algoritimical and io optimizations, (ab)using templates, etc. So
>>> I'm not asking for generic speed optimizations, but only things that
>>> may make the D code "more equal" to the C++ code.
>> 
>> I wrote a version http://codepad.org/phpLP7cx based on the C++ one.
>> 
>> My commands used to compile:
>> 
>> g++46 -O3 -s fpaq0.cpp -o fpaq0cpp
>> dmd -O -release -inline -noboundscheck fpaq0.d
>> 
>> G++ 4.6, dmd 2.059.
>> 
>> I did 5 tests for each:
>> 
>> test.fpaq0 (34603008 bytes) -> test.bmp (34610367 bytes)
>> 
>> The C++ average result was 9.99 seconds (varying from 9.98 to 10.01)
>> The D average result was 12.00 seconds (varying from 11.98 to 12.01)
>> 
>> That means there is 16.8 percent difference in performance that would be
>> cleared out by usage of gdc (which I don't have around currently).
>
> The code is nearly identical (there is a slight difference in update(),
> where he accesses the array once more than you), but the main difference
> I see is the -noboundscheck compilation option on DMD.

He also uses a class. And -noboundscheck should be automatically induced by -release.
April 15, 2012
I tested the q66 version in my computer (sandy bridge @ 4.3GHz). Repeating the old timmings here, and the new results are marked as "D-v2":

test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

Lang| Comp  | Binary size | Time (lower is better)
C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
D-v2 (DMD)  -     206kb   -  4.50s  (186%)   -O -release -inline
D-v2 (GDC)  -     852kb   -  3.65s  (151%)   -O3 -frelease -s

So, basically the same thing... Not using clases seems a little slower on DMD, and no difference on GDC. The "if (++ct[cxt][y] > 65534)" made a very small, but measurable difference (those .04s in GDC). The "if ((cxt += cxt + y) >= 512)" only made the code more complicated, with no speed benefit.

But the input file is also important. The file you tested seems to be an already compressed one, or something not very compressible. Here a test with an incompressible file:

pnad9huff.fpaq0 (43443040 bytes) -> test-d.huff (43617049 bytes)

C++  (g++)  -      13kb   -  5.13   (100%)   -O3 -s
D-v2 (DMD)  -     206kb   -  8.03   (156%)   -O -release -inline
D-v2 (GDC)  -     852kb   -  7.09   (138%)   -O3 -frelease -s
D-inl(DMD)  -     228kb   -  6.93   (135%)   -O -release -inline
D-inl(GDC)  -    1318kb   -  6.86   (134%)   -O3 -frelease -s

The C++ advantage becomes smaller in this file. D-inl is my manual inline version, with your small optimization on "Predictor.Update()".

On Saturday, 14 April 2012 at 19:51:21 UTC, Joseph Rushton Wakeling wrote:
> GDC has all the regular gcc optimization flags available IIRC.  The ones on the GDC man page are just the ones specific to GDC.
I'm not talking about compiler flags, but the "inline" keyword in the C++ source code. I saw some discussion about "@inline" but it seems not implemented (yet?). Well, that is not a priority for D anyway.


About compiler optimizations, -finline-functions and -fweb are part of -O3. I tried to compile with -no-bounds-check, but made no diference for DMD and GDC. It probably is part of -release as q66 said.
April 15, 2012
On Sunday, April 15, 2012 03:51:59 ReneSac wrote:
> About compiler optimizations, -finline-functions and -fweb are part of -O3. I tried to compile with -no-bounds-check, but made no diference for DMD and GDC. It probably is part of -release as q66 said.

Not quite. -noboundscheck turns off _all_ array bounds checking, whereas - release turns it off in @system and @trusted functions, but not in @safe functions. But unless you've marked your code with @safe or it uses templated functions which get inferred as @safe, all of your functions are going to be @system functions anyway, in which case, it makes no difference.

- Jonathan M Davis
April 15, 2012
On 14/04/12 23:03, q66 wrote:
> He also uses a class. And -noboundscheck should be automatically induced by
> -release.

Ahh, THAT probably explains why some of my numerical code is so markedly different in speed when compiled using DMD with or without the -release switch.  It's a MAJOR difference -- between code taking say 5min to run, compared to half an hour or more.
April 15, 2012
On 14/04/12 23:03, q66 wrote:
> He also uses a class. And -noboundscheck should be automatically induced by
> -release.

... but the methods are marked as final -- shouldn't that substantially reduce any speed hit from using class instead of struct?

« First   ‹ Prev
1 2 3 4