August 03, 2015
On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
> gets me down to 0.182s with ldc on OS X

Yeah, I tried dmd with the final and didn't get a difference but gdc with final (and -frelease, very important for max speed here since without it the method calls are surrounded by various assertions) and got similar speed to the hand written one too.
August 03, 2015
On 8/3/15 12:50 PM, John Colvin wrote:
> On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:
>> You can try a few potential optimizations in the D version yourself
>> and see if it makes a difference.
>>
>> Devirtualization has a very small impact. Test this by making `test`
>> take `SubFoo` and making `bar` final, or making `bar` a stand-alone
>> function.
>>
>> That's not it.
>
> Making SubFoo a final class and test take SubFoo gives a >10x speedup
> for me.

Let's make sure we're all comparing apples to apples here.

FWIW, I suspect the inlining to be the most significant improvement, which is impossible for virtual functions in D.

ALSO, make SURE you are compiling in release mode, so you aren't calling a virtual invariant function before/after every call.

-Steve
August 03, 2015
On Monday, 3 August 2015 at 16:53:30 UTC, Adam D. Ruppe wrote:
> On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
>> gets me down to 0.182s with ldc on OS X
>
> Yeah, I tried dmd with the final and didn't get a difference but gdc with final (and -frelease, very important for max speed here since without it the method calls are surrounded by various assertions) and got similar speed to the hand written one too.

ouch, yeah those assertions cause me a 30x slowdown!
August 03, 2015
On 03-Aug-2015 19:54, Steven Schveighoffer wrote:
> On 8/3/15 12:50 PM, John Colvin wrote:
>> On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:
>>> You can try a few potential optimizations in the D version yourself
>>> and see if it makes a difference.
>>>
>>> Devirtualization has a very small impact. Test this by making `test`
>>> take `SubFoo` and making `bar` final, or making `bar` a stand-alone
>>> function.
>>>
>>> That's not it.
>>
>> Making SubFoo a final class and test take SubFoo gives a >10x speedup
>> for me.
>
> Let's make sure we're all comparing apples to apples here.
>
> FWIW, I suspect the inlining to be the most significant improvement,
> which is impossible for virtual functions in D.

Should be trivial in this particular case. You just keep the original virtual call where it cannot be deduced.

>
> ALSO, make SURE you are compiling in release mode, so you aren't calling
> a virtual invariant function before/after every call.

This one is critical. Actually why do we have an extra call for trivial null-check on any object that doesn't even have invariant?


-- 
Dmitry Olshansky
August 03, 2015
On Monday, 3 August 2015 at 16:50:42 UTC, John Colvin wrote:
> Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.

Right, gdc and ldc will the the aggressive inlining and local data optimizations automatically once it is able to devirtualize the calls (at least when you use the -O flags).

dmd, however, even with -inline, doesn't make the local copy of the variable - it disassembles to this:

08098740 <_D1l4testFC1l6SubFooiZi>:
 8098740:       55                      push   ebp
 8098741:       8b ec                   mov    ebp,esp
 8098743:       89 c1                   mov    ecx,eax
 8098745:       53                      push   ebx
 8098746:       31 d2                   xor    edx,edx
 8098748:       8b 5d 08                mov    ebx,DWORD PTR [ebp+0x8]
 809874b:       56                      push   esi
 809874c:       85 c9                   test   ecx,ecx
 809874e:       7e 0f                   jle    809875f <_D1l4testFC1l6SubFooiZi+0x1f>
 8098750:       8b 43 08                mov    eax,DWORD PTR [ebx+0x8]
 8098753:       8d 74 40 01             lea    esi,[eax+eax*2+0x1]
 8098757:       42                      inc    edx
 8098758:       89 73 08                mov    DWORD PTR [ebx+0x8],esi
 809875b:       39 ca                   cmp    edx,ecx
 809875d:       7c f1                   jl     8098750 <_D1l4testFC1l6SubFooiZi+0x10>
 809875f:       8b 43 08                mov    eax,DWORD PTR [ebx+0x8]
 8098762:       5e                      pop    esi
 8098763:       5b                      pop    ebx
 8098764:       5d                      pop    ebp
 8098765:       c2 04 00                ret    0x4



There's no call in there, but there is still indirect memory access for the variable, so it doesn't get the caching benefits of the stack.



It isn't news that dmd's optimizer is pretty bad next to.... well, pretty much everyone else nowdays, whether gdc, ldc, or Java, but it is sometimes nice to take a look at why.



The biggest magic of Java IMO here is being CPU cache friendly!
August 03, 2015
On 8/3/15 12:59 PM, Dmitry Olshansky wrote:
> On 03-Aug-2015 19:54, Steven Schveighoffer wrote:

>> ALSO, make SURE you are compiling in release mode, so you aren't calling
>> a virtual invariant function before/after every call.
>
> This one is critical. Actually why do we have an extra call for trivial
> null-check on any object that doesn't even have invariant?

Actually, that the call to the invariant should be avoidable if the object doesn't have one. It should be easy to check the vtable pointer to see if it points at the "default" invariant (which does nothing).

-Steve

August 03, 2015
On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
> changing two lines:
> final class SubFoo : Foo {
> int test(F)(F obj, int repeat) {

I tried it. DMD is no change, while GDC gets acceptable score.
D(DMD 2.067.1): 2.445
D(GDC 4.9.2/2.066): 0.928

Now I got a hint how to improve the code by hand.
Thanks, John.
But the original Java code that I'm porting is
about 10,000 lines of code.
And the performance is about 3 times different.
Yes! Java is 3 times faster than D in my app.
I hope the future DMD/GDC compiler will do the
similar optimization automatically, not by hand.

Aki.

August 03, 2015
On 03-Aug-2015 20:05, Steven Schveighoffer wrote:
> On 8/3/15 12:59 PM, Dmitry Olshansky wrote:
>> On 03-Aug-2015 19:54, Steven Schveighoffer wrote:
>
>>> ALSO, make SURE you are compiling in release mode, so you aren't calling
>>> a virtual invariant function before/after every call.
>>
>> This one is critical. Actually why do we have an extra call for trivial
>> null-check on any object that doesn't even have invariant?
>
> Actually, that the call to the invariant should be avoidable if the
> object doesn't have one. It should be easy to check the vtable pointer
> to see if it points at the "default" invariant (which does nothing).
>
https://issues.dlang.org/show_bug.cgi?id=14865


-- 
Dmitry Olshansky
August 03, 2015
On Monday, 3 August 2015 at 17:33:30 UTC, aki wrote:
> On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
>> changing two lines:
>> final class SubFoo : Foo {
>> int test(F)(F obj, int repeat) {
>
> I tried it. DMD is no change, while GDC gets acceptable score.
> D(DMD 2.067.1): 2.445
> D(GDC 4.9.2/2.066): 0.928
>
> Now I got a hint how to improve the code by hand.
> Thanks, John.
> But the original Java code that I'm porting is
> about 10,000 lines of code.
> And the performance is about 3 times different.
> Yes! Java is 3 times faster than D in my app.
> I hope the future DMD/GDC compiler will do the
> similar optimization automatically, not by hand.
>
> Aki.

LLVM might be able to do achieve Java's optimization for your use case using profile-guided optimization. In principle, it's hard to choose which function to inline without the function call counts, but LLVM has a back-end with sampling support.

http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization

Whether or not this is or will be available soon for D in LDC is a different matter.
1 2
Next ›   Last »