August 05, 2011
Trass3r:

> > are you willing and able to show me the asm before it gets assembled? (with gcc you do it with the -S switch). (I also suggest to use only the C standard library, with time() and printf() to produce a smaller asm output: http://codepad.org/12EUo16J ).

You are a person of few words :-) Thank you for the asm.

Apparently the program was not compiled in release mode (or with nobounds. With DMD it's the same thing, maybe with gdc it's not the same thing). It contains the calls, but they aren't to the next line, they were for the array bounds:

    call    _d_assert
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_assert_msg
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
    call    _d_array_bounds
	call	_d_assert_msg

But I think this doesn't fully explain the low performance, I have seen too many instructions like:

	movss	DWORD PTR [rsp+32], xmm1
	movss	DWORD PTR [rsp+16], xmm2
	movss	DWORD PTR [rsp+48], xmm3

If you want to go on with this exploration, then I suggest you to find a way to disable bound tests.

Bye,
bearophile
August 05, 2011
> If you want to go on with this exploration, then I suggest you to find a way to disable bound tests.

Ok, now I get up to 32930000 skinned vertices per second.
Still a bit worse than LDC.
August 05, 2011
Am 04.08.2011, 04:07 Uhr, schrieb Trass3r <un@known.com>:

>> C++:
>> Skinned vertices per second: 48660000
>>
>> C++ no SIMD:
>> Skinned vertices per second: 42420000
>>
>>
>> D dmd:
>> Skinned vertices per second: 159046
>>
>> D gdc:
>> Skinned vertices per second: 23450000
>
>
> D ldc:
> Skinned vertices per second: 37910000
>
> ldc2 -O3 -release -enable-inlining dver.d


D gdc with added -frelease -fno-bounds-check:
Skinned vertices per second: 37710000
August 05, 2011
Trass3r:

> >> C++ no SIMD:
> >> Skinned vertices per second: 42420000
>...
> D gdc with added -frelease -fno-bounds-check:
> Skinned vertices per second: 37710000

I'd like to know why the GCC back-end is able to produce a more efficient binary from the C++ code (compared to the D code), but now the problem is not large, as before.

It seems I've found a benchmark coming from real-world code that's a worst case for DMD (GDC here produces code about 237 times faster than DMD).

Bye,
bearophile
August 05, 2011
> I'd like to know why the GCC back-end is able to produce a more efficient binary from the C++ code (compared to the D code), but now the problem is not large, as before.

I attached both asm versions ;)

August 06, 2011
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> Trass3r:
> > C++ no SIMD:
> > Skinned vertices per second: 42420000
> >
> ...
> > D gdc:
> > Skinned vertices per second: 23450000
> Are you able and willing to show me the asm produced by gdc? There's a problem
there.
> Bye,
> bearophile


Notes from me:

- Options -fno-bounds-check and -frelease can be just as important in GDC as they
are in DMD under certain instances.
- You can output asm in intel dialect using -masm=intel if at&t is that difficult
for you to read. 8-)

I will look into this later from my workstation.
August 06, 2011
Iain Buclaw:

> I will look into this later from my workstation.

The remaining thing to look at is just the small performance difference between the D-GDC version and the C++-G++ version.

Bye,
bearophile
August 06, 2011
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> Iain Buclaw:
> > I will look into this later from my workstation.
> The remaining thing to look at is just the small performance difference between
the D-GDC version and the C++-G++ version.
> Bye,
> bearophile

Three things that helped improve performance in a minor way for me:
1) using pointers over dynamic arrays. (5% speedup)
2) removing the calls to CalVector4's constructor (5.7% speedup)
3) using core.stdc.time over std.datetime. (1.6% speedup)

Point one is pretty well known issue in D as far as I'm aware.
Point two is not an issue with inlining (all methods are marked 'inline'), but it
did help remove quite a few movss instructions being emitted.
Point three is interesting, it seems that "sw.peek().msecs" slows down the number
of iterations in the while loop.


With those changes, D implementation is still 21% slower than C++ implementation without SIMD.

http://ideone.com/4PP2D
August 06, 2011
Iain Buclaw:

Are you using GDC2-64 bit on Linux?

> Three things that helped improve performance in a minor way for me:
> 1) using pointers over dynamic arrays. (5% speedup)
> 2) removing the calls to CalVector4's constructor (5.7% speedup)
> 3) using core.stdc.time over std.datetime. (1.6% speedup)
> 
> Point one is pretty well known issue in D as far as I'm aware.

Really? I don't remember discussions about it. What is its cause?


> Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted.

This too is something worth fixing. Is this issue in Bugzilla already?


> Point three is interesting, it seems that "sw.peek().msecs" slows down the number of iterations in the while loop.

This needs to be fixed.


> With those changes, D implementation is still 21% slower than C++ implementation
> without SIMD.
> http://ideone.com/4PP2D

This is a lot still.

Thank you for your work. I think all three issues are worth fixing, eventually.

Bye,
bearophile
August 06, 2011
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> Iain Buclaw:
> Are you using GDC2-64 bit on Linux?

GDC2-32 bit on Linux.


> > Three things that helped improve performance in a minor way for me:
> > 1) using pointers over dynamic arrays. (5% speedup)
> > 2) removing the calls to CalVector4's constructor (5.7% speedup)
> > 3) using core.stdc.time over std.datetime. (1.6% speedup)
> >
> > Point one is pretty well known issue in D as far as I'm aware.
> Really? I don't remember discussions about it. What is its cause?

I can't remember the exact discussion, but it was something about a benchmark of passing by value vs passing by ref vs passing by pointer.

> > Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted.
> This too is something worth fixing. Is this issue in Bugzilla already?

I don't think its an issue really. But of course, there is a difference between what you say and what you mean with regards to the code here (that being, with the first version, lots of temp vars get created and moved around the place).


Regards
Iain