Thread overview
Rquest for timings
Nov 25, 2011
bearophile
Nov 25, 2011
Andrea Fontana
Nov 25, 2011
bearophile
Nov 25, 2011
Andrea Fontana
Nov 26, 2011
bearophile
Nov 26, 2011
Andrea Fontana
Nov 26, 2011
bearophile
Nov 26, 2011
Jerry
Nov 26, 2011
bearophile
Nov 28, 2011
bearophile
November 25, 2011
This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody

The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization.

This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1

C++ version compiled with:
g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
An input parameter is n= 3_000_000 (but use a larger number if therun time is too much small, like 10 millions or more).

First D2 version (serial):
http://codepad.org/AdRSm2wP

Second D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en

Is someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better.

Bye and thank you,
bearophile
November 25, 2011
Hmm reading code i verified that i can gain a 10% writing:

                    imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
                    imd mag = dt / (dSquared * dSquared^^2);

Around line 115. Give it a try...

Il giorno gio, 24/11/2011 alle 22.37 -0500, bearophile ha scritto:

> This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody
> 
> The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization.
> 
> This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1
> 
> C++ version compiled with:
> g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
> An input parameter is n= 3_000_000 (but use a larger number if therun time is too much small, like 10 millions or more).
> 
> First D2 version (serial):
> http://codepad.org/AdRSm2wP
> 
> Second D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en
> 
> Is someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better.
> 
> Bye and thank you,
> bearophile


November 25, 2011
Andrea Fontana:

> Hmm reading code i verified that i can gain a 10% writing:
> 
>                     imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
>                     imd mag = dt / (dSquared * dSquared^^2);
> 
> Around line 115. Give it a try...

My version performs a sqrt and one multiplication, while your version performs one sqrt and two multiplications. On my PC my version is faster.

Bye,
bearophile
November 25, 2011
Really? Dmd version and system? (here: dmd from git, ubuntu 11.10 64bit)

./app 50000000

Your version:
real	0m14.674s

My version:
real	0m13.644s

Your version:
imd dSquared = dx ^^ 2 + dy ^^ 2 + dz ^^ 2;
imd mag = dt / (dSquared * sqrt(dSquared));

My version:
imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
imd mag = dt / (dSquared * dSquared^^2);  // That is: dt /
(dSquared^^3);

Probably vars evaluation works better on my example?

Btw:

imd dSquared = sqrt(dx*dx + dy*dy + dz*dz);
imd mag = dt / (dSquared*dSquared*dSquared);

real	0m13.574s


Il giorno ven, 25/11/2011 alle 08.14 -0500, bearophile ha scritto:

> Andrea Fontana:
> 
> > Hmm reading code i verified that i can gain a 10% writing:
> > 
> >                     imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
> >                     imd mag = dt / (dSquared * dSquared^^2);
> > 
> > Around line 115. Give it a try...
> 
> My version performs a sqrt and one multiplication, while your version performs one sqrt and two multiplications. On my PC my version is faster.
> 
> Bye,
> bearophile


November 26, 2011
Andrea Fontana Wrote:

Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS.

Bye,
bearophile
November 26, 2011
bearophile <bearophileHUGS@lycos.com> writes:

> This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody
>
> The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization.
>
> This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1
>
> C++ version compiled with:
> g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
> An input parameter is n= 3_000_000 (but use a larger number if therun
> time is too much small, like 10 millions or more).

All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2.  I built with both D and C++ enabled so the back end would be the same.

jlquinn@wyvern:~/d/tests$ ~/gcc/gdc/bin/g++ -O3 -fomit-frame-pointer -march=native -lm -mfpmath=sse -msse3 --std=c++0x nbody.cc -o nbody_c++
jlquinn@wyvern:~/d/tests$ time ./nbody_c++ 50000000
-0.169075164
-0.169059907

real	0m10.209s
user	0m10.180s
sys	0m0.010s


> First D2 version (serial):
> http://codepad.org/AdRSm2wP

~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody.d -o nbody_d
jlquinn@wyvern:~/d/tests$ time ./nbody_d 50000000
-0.169075164
-0.169059907

real	0m9.830s
user	0m9.820s
sys	0m0.000s

jlquinn@wyvern:~/d/tests$ dmd -O -release nbody.d
jlquinn@wyvern:~/d/tests$ time ./nbody_d 50000000
-0.169075164
-0.169059907

real	0m9.828s
user	0m9.830s
sys	0m0.000s

> Second D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en

~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody2.d -o nbody2_d
jlquinn@wyvern:~/d/tests$ time ./nbody2_d 50000000
-0.169075164
-0.169059907

real	0m26.805s
user	0m26.760s
sys	0m0.020s

jlquinn@wyvern:~/d/tests$ dmd -O -release nbody2.d
jlquinn@wyvern:~/d/tests$ time ./nbody2_d 50000000
-0.169075164
-0.169059907

real	0m26.777s
user	0m26.760s
sys	0m0.000s


> Is someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better.
>
> Bye and thank you,
> bearophile

So, the upshot seems like DMD and GDC generate similar code for this test.  And both D compilers generate slightly faster code than the C++ version, therefore the D front end is doing a slightly better optimization job, or your first version is slightly more efficient code.

Jerry
November 26, 2011
Thank you for your suprising timings, Jerry.

> All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2.  I built with both D and C++ enabled so the back end would be the same.

Is your system a 64 bit one?


> So, the upshot seems like DMD and GDC generate similar code for this test.

This is an uncommon thing, expecially on 32 bit systems.


> And both D compilers generate slightly faster code than the C++ version, therefore the D front end is doing a slightly better optimization job, or your first version is slightly more efficient code.

D1 code is often a bit slower than similar C++ code, but in this case I think D2 has allowed to specify more semantics that has produced a faster program. The static foreach I have used that D2 code is not just looking more clean compared to the those C++0x template tricks, but also the assembly output is better.

And the first D2 program is not even the fastest possible: that second D2 program today is slow, but it contains some more semantics that hopefuly someday will allow the second version of the program to be faster than the first one, and about as fast as that Fortran version.

This code that currently doesn't compile (no vector ^^, no vector sqrt, no good sum function):

immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0));


Is currently implemented like this:

double[NPAIR] distance = void;
foreach (i; Iota!(0, NPAIR))
    distance[i] = sqrt(r[i][0] ^^ 2 + r[i][1] ^^ 2 + r[i][2] ^^ 2);


Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax):

align(16) immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0));

Bye,
bearophile
November 26, 2011
Maybe Jerry could test my edits and check for timing...

About "^^": have you tried to benchmark

1/(x ^^ n) vs x ^^ -n ?  (with n>0)

On my setup second version is very very slow.


== Quotato dall`articolo bearophile (bearophileHUGS@lycos.com)
> Andrea Fontana Wrote:
> Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS.
> Bye,
> bearophile

November 26, 2011
Andrea Fontana:

> About "^^": have you tried to benchmark
> 
> 1/(x ^^ n) vs x ^^ -n ?  (with n>0)
> 
> On my setup second version is very very slow.

Take a look at the produced assembly code :-)
Maybe x ^^ n is rewritten with a simple code, while ^^-n calls the pow function.

Bye,
bearophile
November 28, 2011
> Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax):

The 32bit assembly produced by the Intel Fortran compiler on that code, it's heavily optimized and fully inlined: http://codepad.org/h1ilZWVu

It uses only serial square roots (sqrtsd), so the performance improvement has other causes that I don't know. This also probably means the Fortran version is not the faster version possible.

Bye,
bearophile