Thread overview | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
November 25, 2011 Rquest for timings | ||||
---|---|---|---|---|
| ||||
This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization. This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1 C++ version compiled with: g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x An input parameter is n= 3_000_000 (but use a larger number if therun time is too much small, like 10 millions or more). First D2 version (serial): http://codepad.org/AdRSm2wP Second D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en Is someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better. Bye and thank you, bearophile |
November 25, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile Attachments:
| Hmm reading code i verified that i can gain a 10% writing:
imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
imd mag = dt / (dSquared * dSquared^^2);
Around line 115. Give it a try...
Il giorno gio, 24/11/2011 alle 22.37 -0500, bearophile ha scritto:
> This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody
>
> The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization.
>
> This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1
>
> C++ version compiled with:
> g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
> An input parameter is n= 3_000_000 (but use a larger number if therun time is too much small, like 10 millions or more).
>
> First D2 version (serial):
> http://codepad.org/AdRSm2wP
>
> Second D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en
>
> Is someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better.
>
> Bye and thank you,
> bearophile
|
November 25, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrea Fontana | Andrea Fontana:
> Hmm reading code i verified that i can gain a 10% writing:
>
> imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
> imd mag = dt / (dSquared * dSquared^^2);
>
> Around line 115. Give it a try...
My version performs a sqrt and one multiplication, while your version performs one sqrt and two multiplications. On my PC my version is faster.
Bye,
bearophile
|
November 25, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile Attachments:
| Really? Dmd version and system? (here: dmd from git, ubuntu 11.10 64bit)
./app 50000000
Your version:
real 0m14.674s
My version:
real 0m13.644s
Your version:
imd dSquared = dx ^^ 2 + dy ^^ 2 + dz ^^ 2;
imd mag = dt / (dSquared * sqrt(dSquared));
My version:
imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
imd mag = dt / (dSquared * dSquared^^2); // That is: dt /
(dSquared^^3);
Probably vars evaluation works better on my example?
Btw:
imd dSquared = sqrt(dx*dx + dy*dy + dz*dz);
imd mag = dt / (dSquared*dSquared*dSquared);
real 0m13.574s
Il giorno ven, 25/11/2011 alle 08.14 -0500, bearophile ha scritto:
> Andrea Fontana:
>
> > Hmm reading code i verified that i can gain a 10% writing:
> >
> > imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
> > imd mag = dt / (dSquared * dSquared^^2);
> >
> > Around line 115. Give it a try...
>
> My version performs a sqrt and one multiplication, while your version performs one sqrt and two multiplications. On my PC my version is faster.
>
> Bye,
> bearophile
|
November 26, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrea Fontana | Andrea Fontana Wrote: Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS. Bye, bearophile |
November 26, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile <bearophileHUGS@lycos.com> writes: > This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody > > The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization. > > This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1 > > C++ version compiled with: > g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x > An input parameter is n= 3_000_000 (but use a larger number if therun > time is too much small, like 10 millions or more). All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2. I built with both D and C++ enabled so the back end would be the same. jlquinn@wyvern:~/d/tests$ ~/gcc/gdc/bin/g++ -O3 -fomit-frame-pointer -march=native -lm -mfpmath=sse -msse3 --std=c++0x nbody.cc -o nbody_c++ jlquinn@wyvern:~/d/tests$ time ./nbody_c++ 50000000 -0.169075164 -0.169059907 real 0m10.209s user 0m10.180s sys 0m0.010s > First D2 version (serial): > http://codepad.org/AdRSm2wP ~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody.d -o nbody_d jlquinn@wyvern:~/d/tests$ time ./nbody_d 50000000 -0.169075164 -0.169059907 real 0m9.830s user 0m9.820s sys 0m0.000s jlquinn@wyvern:~/d/tests$ dmd -O -release nbody.d jlquinn@wyvern:~/d/tests$ time ./nbody_d 50000000 -0.169075164 -0.169059907 real 0m9.828s user 0m9.830s sys 0m0.000s > Second D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en ~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody2.d -o nbody2_d jlquinn@wyvern:~/d/tests$ time ./nbody2_d 50000000 -0.169075164 -0.169059907 real 0m26.805s user 0m26.760s sys 0m0.020s jlquinn@wyvern:~/d/tests$ dmd -O -release nbody2.d jlquinn@wyvern:~/d/tests$ time ./nbody2_d 50000000 -0.169075164 -0.169059907 real 0m26.777s user 0m26.760s sys 0m0.000s > Is someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better. > > Bye and thank you, > bearophile So, the upshot seems like DMD and GDC generate similar code for this test. And both D compilers generate slightly faster code than the C++ version, therefore the D front end is doing a slightly better optimization job, or your first version is slightly more efficient code. Jerry |
November 26, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jerry | Thank you for your suprising timings, Jerry. > All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2. I built with both D and C++ enabled so the back end would be the same. Is your system a 64 bit one? > So, the upshot seems like DMD and GDC generate similar code for this test. This is an uncommon thing, expecially on 32 bit systems. > And both D compilers generate slightly faster code than the C++ version, therefore the D front end is doing a slightly better optimization job, or your first version is slightly more efficient code. D1 code is often a bit slower than similar C++ code, but in this case I think D2 has allowed to specify more semantics that has produced a faster program. The static foreach I have used that D2 code is not just looking more clean compared to the those C++0x template tricks, but also the assembly output is better. And the first D2 program is not even the fastest possible: that second D2 program today is slow, but it contains some more semantics that hopefuly someday will allow the second version of the program to be faster than the first one, and about as fast as that Fortran version. This code that currently doesn't compile (no vector ^^, no vector sqrt, no good sum function): immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0)); Is currently implemented like this: double[NPAIR] distance = void; foreach (i; Iota!(0, NPAIR)) distance[i] = sqrt(r[i][0] ^^ 2 + r[i][1] ^^ 2 + r[i][2] ^^ 2); Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax): align(16) immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0)); Bye, bearophile |
November 26, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | Maybe Jerry could test my edits and check for timing...
About "^^": have you tried to benchmark
1/(x ^^ n) vs x ^^ -n ? (with n>0)
On my setup second version is very very slow.
== Quotato dall`articolo bearophile (bearophileHUGS@lycos.com)
> Andrea Fontana Wrote:
> Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS.
> Bye,
> bearophile
|
November 26, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrea Fontana | Andrea Fontana:
> About "^^": have you tried to benchmark
>
> 1/(x ^^ n) vs x ^^ -n ? (with n>0)
>
> On my setup second version is very very slow.
Take a look at the produced assembly code :-)
Maybe x ^^ n is rewritten with a simple code, while ^^-n calls the pow function.
Bye,
bearophile
|
November 28, 2011 Re: Rquest for timings | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | > Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax): The 32bit assembly produced by the Intel Fortran compiler on that code, it's heavily optimized and fully inlined: http://codepad.org/h1ilZWVu It uses only serial square roots (sqrtsd), so the performance improvement has other causes that I don't know. This also probably means the Fortran version is not the faster version possible. Bye, bearophile |
Copyright © 1999-2021 by the D Language Foundation