Jump to page: 1 24  
Page
Thread overview
Performance
May 30, 2014
Thomas
May 30, 2014
Adam D. Ruppe
May 30, 2014
anonymous
May 30, 2014
bearophile
May 30, 2014
bearophile
May 30, 2014
bearophile
May 30, 2014
Orvid King
May 30, 2014
Russel Winder
May 30, 2014
bearophile
May 31, 2014
Russel Winder
May 30, 2014
Walter Bright
May 30, 2014
David Nadlinger
May 31, 2014
Marco Leise
May 31, 2014
Thomas
Jun 01, 2014
Marco Leise
Jun 02, 2014
Thomas
Jun 03, 2014
Marco Leise
Jun 03, 2014
John Colvin
May 31, 2014
dennis luehring
May 31, 2014
Russel Winder
May 31, 2014
dennis luehring
May 31, 2014
dennis luehring
May 31, 2014
Russel Winder
May 31, 2014
Russel Winder
Jun 01, 2014
Russel Winder
May 31, 2014
John Colvin
Jun 01, 2014
Narendra Modi
May 30, 2014
I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC.
The results for one addition are:

D-DMD: 3.1 nanoseconds
D-GDC: 3.8 nanoseconds
C++: 1.0 nanoseconds
Scala: 1.0 nanoseconds


D-Source:

import std.stdio;
import std.datetime;
import std.string;
import core.time;


void main() {
  run!(plus)( 1000*1000*1000 );
}

class C {
}

string plus( int steps  )  {
  double sum = 1.346346;
  immutable double p0 = 0.0045;
  immutable double p1 = 1.00045452-p0;
  auto b = true;
  for( int i=0; i<steps; i++){
  	switch( b ){
	case true :
	  sum += p0;
	  break;
	default:
	  sum += p1;
	  break;
	}
	b = !b;	
  }
  return (format("%s  %f","plus\nLast: ", sum) );
//  return ("plus\nLast: ", sum );
}


void run( alias func )( int steps )
  if( is(typeof(func(steps)) == string)) {
  auto begin = Clock.currStdTime();
  string output = func( steps );
  auto end =  Clock.currStdTime();
  double nanotime = toNanos(end-begin)/steps;
  writeln( output );
  writeln( "Time per op: " , nanotime );
  writeln( );
}

double toNanos( long hns ) { return hns*100.0; }


Compiler settings for D:

dmd -c -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9C152107A632F/first.o -release -inline -noboundscheck -O -w -version=Have_first -Isource source/perf/testperf.d

gdc ./source/perf/testperf.d -frelease -o testperf

So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me.


Thomas


May 30, 2014
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
>   return (format("%s  %f","plus\nLast: ", sum) );

I haven't actually run this but my guess is that the format function is the slowish thing here. Did you create a new string in the C version too?

> gdc ./source/perf/testperf.d -frelease -o testperf

The -O3 switch might help too, which turns on optimizations.
May 30, 2014
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
> I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC.
> The results for one addition are:
>
> D-DMD: 3.1 nanoseconds
> D-GDC: 3.8 nanoseconds
> C++: 1.0 nanoseconds
> Scala: 1.0 nanoseconds
>
>
> D-Source:
[...]
> Compiler settings for D:
[...]
> So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me.

Sources and command lines for the other languages would be nice
for comparison.
May 30, 2014
On Friday, 30 May 2014 at 13:35:59 UTC, Thomas wrote:
> I made the following performance test, which adds 10^9 Double’s on Linux with the latest dmd compiler in the Eclipse IDE and with the Gdc-Compiler also on Linux. Then the same test was done with C++ on Linux and with Scala in the Java ecosystem on Linux. All the testing was done on the same PC.
> The results for one addition are:
>
> D-DMD: 3.1 nanoseconds
> D-GDC: 3.8 nanoseconds
> C++: 1.0 nanoseconds
> Scala: 1.0 nanoseconds

Your code written in a more idiomatic way (I have commented out new language features):


import std.stdio, std.datetime;

double plus(in uint nSteps) pure nothrow @safe /*@nogc*/ {
    enum double p0 = 0.0045;
    enum double p1 = 1.00045452-p0;

    double tot = 1.346346;
    auto b = true;

    foreach (immutable i; 0 .. nSteps) {
        final switch (b) {
            case true:
                tot += p0;
                break;
            case false:
                tot += p1;
                break;
        }

        b = !b;
    }

    return tot;
}

void run(alias func, string funcName)(in uint nSteps) {
    StopWatch sw;
    sw.start;
    immutable result = func(nSteps);
    sw.stop;
    writeln(funcName);
    writefln("Last: %f", result);
    //writeln("Time per op: ", sw.peek.nsecs / real(nSteps));
    writeln("Time per op: ", sw.peek.nsecs / cast(real)nSteps);
}

void main() {
    run!(plus, "plus")(1_000_000_000U);
}

(But there is also a benchmark helper around).

ldmd2 -O -release -inline -noboundscheck test.d

Using LDC2 compiler, on my system the output is:

plus
Last: 500227252.496398
Time per op: 9.41424

Bye,
bearophile
May 30, 2014
> double plus(in uint nSteps) pure nothrow @safe /*@nogc*/ {
>     enum double p0 = 0.0045;
>     enum double p1 = 1.00045452-p0;
>
>     double tot = 1.346346;
>     auto b = true;
>
>     foreach (immutable i; 0 .. nSteps) {
>         final switch (b) {
>             case true:
>                 tot += p0;
>                 break;
>             case false:
>                 tot += p1;
>                 break;
>         }
>
>         b = !b;
>     }
>
>     return tot;
> }

And this is the 32 bit X86 asm generated by ldc2 for the plus function:

__D4test4plusFNaNbNfxkZd:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%esi
	andl	$-8, %esp
	subl	$24, %esp
	movsd	LCPI0_0, %xmm0
	testl	%eax, %eax
	je	LBB0_8
	xorl	%ecx, %ecx
	movb	$1, %dl
	movsd	LCPI0_1, %xmm1
	movsd	LCPI0_2, %xmm2
	.align	16, 0x90
LBB0_2:
	testb	$1, %dl
	jne	LBB0_3
	addsd	%xmm1, %xmm0
	jmp	LBB0_7
	.align	16, 0x90
LBB0_3:
	movzbl	%dl, %esi
	andl	$1, %esi
	je	LBB0_5
	addsd	%xmm2, %xmm0
LBB0_7:
	xorb	$1, %dl
	incl	%ecx
	cmpl	%eax, %ecx
	jb	LBB0_2
LBB0_8:
	movsd	%xmm0, 8(%esp)
	fldl	8(%esp)
	leal	-4(%ebp), %esp
	popl	%esi
	popl	%ebp
	ret
LBB0_5:
	movl	$11, 4(%esp)
	movl	$__D4test12__ModuleInfoZ, (%esp)
	calll	__d_switch_error

Bye,
bearophile
May 30, 2014
This C++ code:

double plus(const unsigned int nSteps) {
    const double p0 = 0.0045;
    const double p1 = 1.00045452-p0;

    double tot = 1.346346;
    bool b = true;

    for (unsigned int i = 0; i < nSteps; i++) {
        switch (b) {
            case true:
                tot += p0;
                break;
            case false:
                tot += p1;
                break;
        }

        b = !b;
    }

    return tot;
}


G++ 4.8.0 gives the asm (using -Ofast, that implies unsafe FP optimizations):

__Z4plusj:
	movl	4(%esp), %ecx
	testl	%ecx, %ecx
	je	L7
	fldl	LC0
	xorl	%edx, %edx
	movl	$1, %eax
	fldl	LC2
	jmp	L6
	.p2align 4,,7
L11:
	fxch	%st(1)
	addl	$1, %edx
	xorl	$1, %eax
	cmpl	%ecx, %edx
	faddl	LC1
	je	L12
	fxch	%st(1)
L6:
	cmpb	$1, %al
	je	L11
	addl	$1, %edx
	xorl	$1, %eax
	cmpl	%ecx, %edx
	fadd	%st, %st(1)
	jne	L6
	fstp	%st(0)
	jmp	L10
	.p2align 4,,7
L12:
	fstp	%st(1)
L10:
	rep ret
L7:
	fldl	LC0
	ret

Bye,
bearophile
May 30, 2014
On Fri, 2014-05-30 at 13:35 +0000, Thomas via Digitalmars-d wrote:
> I made the following performance test, which adds 10^9 Double’s
> on Linux with the latest dmd compiler in the Eclipse IDE and with
> the Gdc-Compiler also on Linux. Then the same test was done with
> C++ on Linux and with Scala in the Java ecosystem on Linux. All
> the testing was done on the same PC.
> The results for one addition are:
> 
> D-DMD: 3.1 nanoseconds
> D-GDC: 3.8 nanoseconds
> C++: 1.0 nanoseconds
> Scala: 1.0 nanoseconds

A priori I would believe there a problem with these numbers: my experience of CPU-bound D code is that it is generally as fast as C++.

[…]
> Compiler settings for D:
> 
> dmd -c -of.dub/build/application-release-nobounds-linux.posix-x86-dmd-DF74188E055ED2E8ADD9C152107A632F/first.o -release -inline -noboundscheck -O -w -version=Have_first -Isource source/perf/testperf.d
> 
> gdc ./source/perf/testperf.d -frelease -o testperf
> 
> So what is the problem ? Are the compiler switches wrong ? Or is D on the used compilers so slow ? Can you help me.

What is the C++ code you compare against?

What is the Scala code you compare against? Did you try Java and static Groovy as well?

What command lines did you use for the generation of all the binaries.

Without the data to compare it is hard to compare and help.

One obvious thing though the gdc command line has no optimization turned on you probably want the -O3 or at least -O2 there.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 30, 2014
On 5/30/2014 9:30 AM, bearophile wrote:
>> double plus(in uint nSteps) pure nothrow @safe /*@nogc*/ {
>>     enum double p0 = 0.0045;
>>     enum double p1 = 1.00045452-p0;
>>
>>     double tot = 1.346346;
>>     auto b = true;
>>
>>     foreach (immutable i; 0 .. nSteps) {
>>         final switch (b) {
>>             case true:
>>                 tot += p0;
>>                 break;
>>             case false:
>>                 tot += p1;
>>                 break;
>>         }
>>
>>         b = !b;
>>     }
>>
>>     return tot;
>> }
>
> And this is the 32 bit X86 asm generated by ldc2 for the plus function:
>
> __D4test4plusFNaNbNfxkZd:
>      pushl    %ebp
>      movl    %esp, %ebp
>      pushl    %esi
>      andl    $-8, %esp
>      subl    $24, %esp
>      movsd    LCPI0_0, %xmm0
>      testl    %eax, %eax
>      je    LBB0_8
>      xorl    %ecx, %ecx
>      movb    $1, %dl
>      movsd    LCPI0_1, %xmm1
>      movsd    LCPI0_2, %xmm2
>      .align    16, 0x90
> LBB0_2:
>      testb    $1, %dl
>      jne    LBB0_3
>      addsd    %xmm1, %xmm0
>      jmp    LBB0_7
>      .align    16, 0x90
> LBB0_3:
>      movzbl    %dl, %esi
>      andl    $1, %esi
>      je    LBB0_5
>      addsd    %xmm2, %xmm0
> LBB0_7:
>      xorb    $1, %dl
>      incl    %ecx
>      cmpl    %eax, %ecx
>      jb    LBB0_2
> LBB0_8:
>      movsd    %xmm0, 8(%esp)
>      fldl    8(%esp)
>      leal    -4(%ebp), %esp
>      popl    %esi
>      popl    %ebp
>      ret
> LBB0_5:
>      movl    $11, 4(%esp)
>      movl    $__D4test12__ModuleInfoZ, (%esp)
>      calll    __d_switch_error
>
> Bye,
> bearophile


Well, I'd argue that in fact neither the C++ nor D code generated the
fastest possible code here, as this code will result in at least 3,
likely more, potentially even every, branch being mispredicted. I would
argue, after checking the throughput numbers for fadd (only checked
haswell), that the fastest code here would actually compute both sides
of the branch and use a set of 4 cmov's (due to the fact it's x86 and
we're working with doubles) to determine which one is the one we need to
use going forward.
May 30, 2014
Russel Winder:

> A priori I would believe there a problem with these numbers: my
> experience of CPU-bound D code is that it is generally as fast as C++.

The C++ code I've shown above if compiled with -Ofast seems faster than the D code compiled with ldc2.

Bye,
bearophile
May 30, 2014
On 5/30/2014 6:35 AM, Thomas wrote:
> So what is the problem ?

Usually, the problem will be obvious from looking at the generated assembler.

« First   ‹ Prev
1 2 3 4