| Thread overview | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
May 30, 2014 Performance | ||||
|---|---|---|---|---|
| ||||
I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC.
The results for one addition are:
D-DMD: 3.1 nanoseconds
D-GDC: 3.8 nanoseconds
C++: 1.0 nanoseconds
Scala: 1.0 nanoseconds
D-Source:
import std.stdio;
import std.datetime;
import std.string;
import core.time;
void main() {
run!(plus)( 1000*1000*1000 );
}
class C {
}
string plus( int steps ) {
double sum = 1.346346;
immutable double p0 = 0.0045;
immutable double p1 = 1.00045452-p0;
auto b = true;
for( int i=0; i<steps; i++){
switch( b ){
case true :
sum += p0;
break;
default:
sum += p1;
break;
}
b = !b;
}
return (format("%s %f","plus\nLast: ", sum) );
// return ("plus\nLast: ", sum );
}
void run( alias func )( int steps )
if( is(typeof(func(steps)) == string)) {
auto begin = Clock.currStdTime();
string output = func( steps );
auto end = Clock.currStdTime();
double nanotime = toNanos(end-begin)/steps;
writeln( output );
writeln( "Time per op: " , nanotime );
writeln( );
}
double toNanos( long hns ) { return hns*100.0; }
Compiler settings for D:
dmd -c -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9C152107A632F/first.o -release -inline -noboundscheck -O -w -version=Have_first -Isource source/perf/testperf.d
gdc ./source/perf/testperf.d -frelease -o testperf
So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me.
Thomas
| ||||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Thomas | On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote: > return (format("%s %f","plus\nLast: ", sum) ); I haven't actually run this but my guess is that the format function is the slowish thing here. Did you create a new string in the C version too? > gdc ./source/perf/testperf.d -frelease -o testperf The -O3 switch might help too, which turns on optimizations. | |||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Thomas | On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote: > I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC. > The results for one addition are: > > D-DMD: 3.1 nanoseconds > D-GDC: 3.8 nanoseconds > C++: 1.0 nanoseconds > Scala: 1.0 nanoseconds > > > D-Source: [...] > Compiler settings for D: [...] > So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me. Sources and command lines for the other languages would be nice for comparison. | |||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Thomas | On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
> I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC.
> The results for one addition are:
>
> D-DMD: 3.1 nanoseconds
> D-GDC: 3.8 nanoseconds
> C++: 1.0 nanoseconds
> Scala: 1.0 nanoseconds
Your code written in a more idiomatic way (I have commented out new language features):
import std.stdio, std.datetime;
double plus(in uint nSteps) pure nothrow @safe /*@nogc*/ {
enum double p0 = 0.0045;
enum double p1 = 1.00045452-p0;
double tot = 1.346346;
auto b = true;
foreach (immutable i; 0 .. nSteps) {
final switch (b) {
case true:
tot += p0;
break;
case false:
tot += p1;
break;
}
b = !b;
}
return tot;
}
void run(alias func, string funcName)(in uint nSteps) {
StopWatch sw;
sw.start;
immutable result = func(nSteps);
sw.stop;
writeln(funcName);
writefln("Last: %f", result);
//writeln("Time per op: ", sw.peek.nsecs / real(nSteps));
writeln("Time per op: ", sw.peek.nsecs / cast(real)nSteps);
}
void main() {
run!(plus, "plus")(1_000_000_000U);
}
(But there is also a benchmark helper around).
ldmd2 -O -release -inline -noboundscheck test.d
Using LDC2 compiler, on my system the output is:
plus
Last: 500227252.496398
Time per op: 9.41424
Bye,
bearophile
| |||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | > double plus(in uint nSteps) pure nothrow @safe /*@nogc*/ {
> enum double p0 = 0.0045;
> enum double p1 = 1.00045452-p0;
>
> double tot = 1.346346;
> auto b = true;
>
> foreach (immutable i; 0 .. nSteps) {
> final switch (b) {
> case true:
> tot += p0;
> break;
> case false:
> tot += p1;
> break;
> }
>
> b = !b;
> }
>
> return tot;
> }
And this is the 32 bit X86 asm generated by ldc2 for the plus function:
__D4test4plusFNaNbNfxkZd:
pushl %ebp
movl %esp, %ebp
pushl %esi
andl $-8, %esp
subl $24, %esp
movsd LCPI0_0, %xmm0
testl %eax, %eax
je LBB0_8
xorl %ecx, %ecx
movb $1, %dl
movsd LCPI0_1, %xmm1
movsd LCPI0_2, %xmm2
.align 16, 0x90
LBB0_2:
testb $1, %dl
jne LBB0_3
addsd %xmm1, %xmm0
jmp LBB0_7
.align 16, 0x90
LBB0_3:
movzbl %dl, %esi
andl $1, %esi
je LBB0_5
addsd %xmm2, %xmm0
LBB0_7:
xorb $1, %dl
incl %ecx
cmpl %eax, %ecx
jb LBB0_2
LBB0_8:
movsd %xmm0, 8(%esp)
fldl 8(%esp)
leal -4(%ebp), %esp
popl %esi
popl %ebp
ret
LBB0_5:
movl $11, 4(%esp)
movl $__D4test12__ModuleInfoZ, (%esp)
calll __d_switch_error
Bye,
bearophile
| |||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | This C++ code:
double plus(const unsigned int nSteps) {
const double p0 = 0.0045;
const double p1 = 1.00045452-p0;
double tot = 1.346346;
bool b = true;
for (unsigned int i = 0; i < nSteps; i++) {
switch (b) {
case true:
tot += p0;
break;
case false:
tot += p1;
break;
}
b = !b;
}
return tot;
}
G++ 4.8.0 gives the asm (using -Ofast, that implies unsafe FP optimizations):
__Z4plusj:
movl 4(%esp), %ecx
testl %ecx, %ecx
je L7
fldl LC0
xorl %edx, %edx
movl $1, %eax
fldl LC2
jmp L6
.p2align 4,,7
L11:
fxch %st(1)
addl $1, %edx
xorl $1, %eax
cmpl %ecx, %edx
faddl LC1
je L12
fxch %st(1)
L6:
cmpb $1, %al
je L11
addl $1, %edx
xorl $1, %eax
cmpl %ecx, %edx
fadd %st, %st(1)
jne L6
fstp %st(0)
jmp L10
.p2align 4,,7
L12:
fstp %st(1)
L10:
rep ret
L7:
fldl LC0
ret
Bye,
bearophile
| |||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Thomas | On Fri, 2014-05-30 at 13:35 +0000, Thomas via Digitalmars-d wrote: > I made the following performance test, which adds 10^9 Double’s > on Linux with the latest dmd compiler in the Eclipse IDE and with > the Gdc-Compiler also on Linux. Then the same test was done with > C++ on Linux and with Scala in the Java ecosystem on Linux. All > the testing was done on the same PC. > The results for one addition are: > > D-DMD: 3.1 nanoseconds > D-GDC: 3.8 nanoseconds > C++: 1.0 nanoseconds > Scala: 1.0 nanoseconds A priori I would believe there a problem with these numbers: my experience of CPU-bound D code is that it is generally as fast as C++. […] > Compiler settings for D: > > dmd -c -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9C152107A632F/first.o -release -inline -noboundscheck -O -w -version=Have_first -Isource source/perf/testperf.d > > gdc ./source/perf/testperf.d -frelease -o testperf > > So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me. What is the C++ code you compare against? What is the Scala code you compare against? Did you try Java and static Groovy as well? What command lines did you use for the generation of all the binaries. Without the data to compare it is hard to compare and help. One obvious thing though the gdc command line has no optimization turned on you probably want the -O3 or at least -O2 there. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder | |||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | On 5/30/2014 9:30 AM, bearophile wrote:
>> double plus(in uint nSteps) pure nothrow @safe /*@nogc*/ {
>> enum double p0 = 0.0045;
>> enum double p1 = 1.00045452-p0;
>>
>> double tot = 1.346346;
>> auto b = true;
>>
>> foreach (immutable i; 0 .. nSteps) {
>> final switch (b) {
>> case true:
>> tot += p0;
>> break;
>> case false:
>> tot += p1;
>> break;
>> }
>>
>> b = !b;
>> }
>>
>> return tot;
>> }
>
> And this is the 32 bit X86 asm generated by ldc2 for the plus function:
>
> __D4test4plusFNaNbNfxkZd:
> pushl %ebp
> movl %esp, %ebp
> pushl %esi
> andl $-8, %esp
> subl $24, %esp
> movsd LCPI0_0, %xmm0
> testl %eax, %eax
> je LBB0_8
> xorl %ecx, %ecx
> movb $1, %dl
> movsd LCPI0_1, %xmm1
> movsd LCPI0_2, %xmm2
> .align 16, 0x90
> LBB0_2:
> testb $1, %dl
> jne LBB0_3
> addsd %xmm1, %xmm0
> jmp LBB0_7
> .align 16, 0x90
> LBB0_3:
> movzbl %dl, %esi
> andl $1, %esi
> je LBB0_5
> addsd %xmm2, %xmm0
> LBB0_7:
> xorb $1, %dl
> incl %ecx
> cmpl %eax, %ecx
> jb LBB0_2
> LBB0_8:
> movsd %xmm0, 8(%esp)
> fldl 8(%esp)
> leal -4(%ebp), %esp
> popl %esi
> popl %ebp
> ret
> LBB0_5:
> movl $11, 4(%esp)
> movl $__D4test12__ModuleInfoZ, (%esp)
> calll __d_switch_error
>
> Bye,
> bearophile
Well, I'd argue that in fact neither the C++ nor D code generated the
fastest possible code here, as this code will result in at least 3,
likely more, potentially even every, branch being mispredicted. I would
argue, after checking the throughput numbers for fadd (only checked
haswell), that the fastest code here would actually compute both sides
of the branch and use a set of 4 cmov's (due to the fact it's x86 and
we're working with doubles) to determine which one is the one we need to
use going forward.
| |||
May 30, 2014 Re: Performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Russel Winder | Russel Winder:
> A priori I would believe there a problem with these numbers: my
> experience of CPU-bound D code is that it is generally as fast as C++.
The C++ code I've shown above if compiled with -Ofast seems faster than the D code compiled with ldc2.
Bye,
bearophile
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply