February 15, 2008
downs wrote:
> The weird thing is: even if I inline the one spot where gdc ignores
> its opportunity to inline a function, so that I have the _same_
> call-counts as G++ (as measured with -g -pg), even then, the D code
> is slower. So it doesn't depend on missing inlining opportunities. Or
> am I missing something?

It's often worthwhile to run obj2asm on the output of each, and compare.
February 15, 2008
Another interesting observation.

If I change all my opFoo's to opFooAssign's, and use those instead, speed goes up from 16s to 13s; indicating that returning large structs (12 bytes/vector) causes a significant speed hit. Still not close to the C++ version though. The weird thing is that all those ops have been inlined (or so says the assembler dump). Weird.

 --downs
February 15, 2008
downs wrote:
> Another interesting observation.
> 
> If I change all my opFoo's to opFooAssign's, and use those instead, speed goes up from 16s to 13s; indicating that returning large structs (12 bytes/vector) causes a significant speed hit. Still not close to the C++ version though. The weird thing is that all those ops have been inlined (or so says the assembler dump). Weird.
> 
>  --downs
Excuse me. 24 bytes.
February 15, 2008
Another other observation: GDC's std.math functions still aren't being inlined properly, forcing me to use the intrinsics manually.

That didn't cause the speed difference though.

Still, it would be nice to see it fixed some time soon, seeing as I filed the bug in November :)

 --downs
February 15, 2008
downs wrote:
> Another interesting observation.
> 
> If I change all my opFoo's to opFooAssign's, and use those instead, speed goes up from 16s to 13s; indicating that returning large structs (12 bytes/vector) causes a significant speed hit. Still not close to the C++ version though. The weird thing is that all those ops have been inlined (or so says the assembler dump). Weird.
> 
>  --downs

Yeah, I was about to say the same.  See here:

http://paste.dprogramming.com/dpolmzhw

It's ugly, but no struct returning.

On my machine it's about a second slower than g++ (8.9s vs. 7.8s)
compiled via:

gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb -frelease -finline-functions

and

g++ -O3 -fomit-frame-pointer -fweb -finline-functions

There's probably some other optimizations that could be made.  But really I think this comes down to the compiler not being as mature.  The stuff that I did should all be done by an optimizing compiler.  You're basically tricking the compiler into moving less bits around.

Tim.
February 15, 2008
Tim Burrell wrote:
> downs wrote:
>> Another interesting observation.
>>
>> If I change all my opFoo's to opFooAssign's, and use those instead, speed goes up from 16s to 13s; indicating that returning large structs (12 bytes/vector) causes a significant speed hit. Still not close to the C++ version though. The weird thing is that all those ops have been inlined (or so says the assembler dump). Weird.
>>
>>  --downs
> 
> Yeah, I was about to say the same.  See here:
> 
> http://paste.dprogramming.com/dpolmzhw
> 
> It's ugly, but no struct returning.
> 
> On my machine it's about a second slower than g++ (8.9s vs. 7.8s)
> compiled via:
> 
> gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb -frelease -finline-functions
> 
> and
> 
> g++ -O3 -fomit-frame-pointer -fweb -finline-functions
> 
> There's probably some other optimizations that could be made.  But really I think this comes down to the compiler not being as mature.  The stuff that I did should all be done by an optimizing compiler.  You're basically tricking the compiler into moving less bits around.
> 
> Tim.

But even using your compiler flags, I'm still looking at 12.8s (D) vs 8.1s (C++) .. 11.4 (D) vs 7.8 (C++) using -march=nocona.

:ten minutes later:

... Okay, now I'm confused.
Your program is three seconds faster than my op*Assign version.
Is there a generic problem with operator overloading?

I rewrote my version for freestanding functions .. 9.5s :confused: Why do struct members (which are inlined, I checked) take such a speed hit?

Ah well. Let's hope LLVMDC does a better job .. someday.

 --downs
February 15, 2008
downs:
>f I change all my opFoo's to opFooAssign's, and use those instead, speed goes up from 16s to 13s; indicating that returning large structs (12 bytes/vector) causes a significant speed hit.<

Tim Burrell:
> Yeah, I was about to say the same.  See here:

Yep, see my TinyVector ;-)

Bye,
bearophile
February 15, 2008
"downs" <default_357-line@yahoo.de> wrote in message news:fp4593$1kko$1@digitalmars.com...

> I rewrote my version for freestanding functions .. 9.5s :confused: Why do struct members (which are inlined, I checked) take such a speed hit?

I think other people have come to this bizarre realization as well.  It really doesn't make any sense.

Have you compared the assembly of calling a struct member function and calling a free function?


February 15, 2008
I ran a comparison of struct vector methods vs freestanding, and the GDC generated assembler code is precisely identical.

Here's my test source

struct foo {
  double x, y, z;
  void opAddAssign(ref foo bar) {
    x += bar.x; y += bar.y; z += bar.z;
  }
}

void foo_add(ref foo bar, ref foo baz) {
  baz.x += bar.x; baz.y += bar.y; baz.z += bar.z;
}

// prevents overzealous optimization
// really just returns 0, 0, 0
extern(C) foo complex_external_function();

import std.stdio;
void main() {
  foo a = complex_external_function(), b = complex_external_function();
  asm { int 3; }
  a += b;
  asm { int 3; }
  foo c = complex_external_function(), d = complex_external_function();
  asm { int 3; }
  foo_add(d, c);
  asm { int 3; }
  writefln(a, b, c, d);
}

And here are the relevant two bits of assembler.


#APP
	int	$3
#NO_APP
	fldl	-120(%ebp)
	faddl	-96(%ebp)
	fstpl	-120(%ebp)
	fldl	-112(%ebp)
	faddl	-88(%ebp)
	fstpl	-112(%ebp)
	fldl	-104(%ebp)
	faddl	-80(%ebp)
	fstpl	-104(%ebp)



#APP
	int	$3
#NO_APP
	fldl	-72(%ebp)
	faddl	-48(%ebp)
	fstpl	-72(%ebp)
	fldl	-64(%ebp)
	faddl	-40(%ebp)
	fstpl	-64(%ebp)
	fldl	-56(%ebp)
	faddl	-32(%ebp)
	fstpl	-56(%ebp)

No difference. But then why the obvious speed difference? Color me confused ._.

 --downs
February 15, 2008
downs wrote:
> I rewrote my version for freestanding functions .. 9.5s :confused: Why do struct members (which are inlined, I checked) take such a speed hit?
> 

My version had a bug. x__X

The correct version takes 11.2s again.

 --downs