Thread overview | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
October 08, 2014 Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Hello, I have a machine learning library and I'm porting it from C++ to D right now. There is a number crunching benchmark in it that doing a simple gradient descent learning on a small multilayer perceptron neural network. The core of the benchmark is about some loops doing basic computations on numbers in float[] arrays (add, mul, exp, abs). The reference is the C++ version compiled with Clang: 0.044 secs. D results: DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs LDC2 0.14 -O3 -release : 0.051 secs GDC 4.9 -O3 -release : 0.031 secs I think my benchmark code would hugely benefit from auto vectorization, so that might be the cause of the above results. I've found some vectorization compiler options for ldc2 but they seems have no effect on performance whatsoever. Any suggestions? |
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gabor Mezo | Try with '-O3 -release -vectorize-slp-aggressive -g -pass-remarks-analysis="loop-vectorize|loop-unroll" -pass-remarks=loop-unroll' Note that the D situation is a mess in general (correct me if I'm wrong): * Never ever use std.math as you will get the insane 80-bit functions. * core.math has some hacks to use llvm builtins but also mostly using type real. * core.stdc.math supports all types but uses suffixes and maps to C functions. * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. * you can also use ldc.intrinsics to kill portability. Hello C++. And there's no fast-math yet: https://github.com/ldc-developers/ldc/issues/722 |
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Trass3r | On Wednesday, 8 October 2014 at 11:29:30 UTC, Trass3r wrote: > Try with '-O3 -release -vectorize-slp-aggressive -g -pass-remarks-analysis="loop-vectorize|loop-unroll" -pass-remarks=loop-unroll' > > Note that the D situation is a mess in general (correct me if I'm wrong): > * Never ever use std.math as you will get the insane 80-bit functions. > * core.math has some hacks to use llvm builtins but also mostly using type real. > * core.stdc.math supports all types but uses suffixes and maps to C functions. > * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. > * you can also use ldc.intrinsics to kill portability. Hello C++. > > And there's no fast-math yet: > https://github.com/ldc-developers/ldc/issues/722 I get: Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' Unknown command line argument '-pass-remarks=loop-unroll' |
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Trass3r Attachments: | On 08/10/14 12:29, Trass3r via digitalmars-d-ldc wrote: […] > * Never ever use std.math as you will get the insane 80-bit functions. What can one use to avoid this, and use the 64-bit numbers. > * core.math has some hacks to use llvm builtins but also mostly using type real. * core.stdc.math supports all types but uses suffixes and maps to C functions. * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. * you can also use ldc.intrinsics to kill portability. Hello C++. > > And there's no fast-math yet: https://github.com/ldc-developers/ldc/issues/722 Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues? - -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder |
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gabor Mezo | > Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll'
> Unknown command line argument '-pass-remarks=loop-unroll'
They were added to llvm in April/May.
-help-hidden lists all available options.
|
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Trass3r | On Wednesday, 8 October 2014 at 15:04:10 UTC, Trass3r wrote: >> Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' >> Unknown command line argument '-pass-remarks=loop-unroll' > > They were added to llvm in April/May. > -help-hidden lists all available options. I can confirm that there are no pass-remarks options found. I'm using 0.14.0 from here: https://github.com/ldc-developers/ldc/releases/tag/v0.14.0 |
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Russel Winder | >> * Never ever use std.math as you will get the insane 80-bit >> functions. > > What can one use to avoid this, and use the 64-bit numbers. Note that this is less of an issue for x86 code using x87 by default. On x64 though this results in really bad code switching between SSE and x87 registers. But vectorization is usually killed in any case. I personally use core.stdc.tgmath atm. >> * [...] > Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues? I think there have been threads debating the unreasonable 'real by default' attitude. No clue if there's any result. And core.math is a big question mark to me. I don't know about gdc. Its runtime doesn't look much different. |
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Trass3r | On Wednesday, 8 October 2014 at 15:32:46 UTC, Trass3r wrote:
>>> * Never ever use std.math as you will get the insane 80-bit
>>> functions.
>>
>> What can one use to avoid this, and use the 64-bit numbers.
>
> Note that this is less of an issue for x86 code using x87 by default.
> On x64 though this results in really bad code switching between SSE and x87 registers.
> But vectorization is usually killed in any case.
>
> I personally use core.stdc.tgmath atm.
>
>>> * [...]
>> Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues?
>
> I think there have been threads debating the unreasonable 'real by default' attitude.
> No clue if there's any result.
>
> And core.math is a big question mark to me.
>
> I don't know about gdc. Its runtime doesn't look much different.
Just for the record my benchmark code doesn't use math libraries, I'm using logistics function approximations. That's why I thought that the cause of my results hafta be the lack of the auto vectorization.
|
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gabor Mezo | Just check it with '-output-ll' or '-output-s -x86-asm-syntax=intel' ;) |
October 08, 2014 Re: Disappointing math performance compared to GDC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Trass3r | On Wednesday, 8 October 2014 at 16:02:17 UTC, Trass3r wrote: > Just check it with '-output-ll' or '-output-s -x86-asm-syntax=intel' ;) I'm not an ASM expert, but as far as I can see it indeed use some SIMD registers and instructions. For examlple: .LBB0_16: mov rcx, qword ptr [rax] mov rdi, rax call qword ptr [rcx + 56] test rax, rax jne .LBB0_18 movss xmm1, dword ptr [rsp + 116] jmp .LBB0_20 .align 16, 0x90 .LBB0_18: mov rcx, rbx imul rcx, rax add r12, rcx movss xmm1, dword ptr [rsp + 116] .align 16, 0x90 .LBB0_19: movss xmm0, dword ptr [rdx] mulss xmm0, dword ptr [r12] addss xmm1, xmm0 add rdx, 4 add r12, 4 dec rax jne .LBB0_19 .LBB0_20: movss dword ptr [rsp + 116], xmm1 inc r14 cmp r14, r15 jne .LBB0_12 .LBB0_21: mov rax, qword ptr [rsp + 80] mov rdi, qword ptr [rax] mov rax, qword ptr [rdi] call qword ptr [rax + 40] test eax, eax mov rbp, qword ptr [rsp + 104] jne .LBB0_24 movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers7sigmoidFNbffZf mov rax, qword ptr [rsp + 64] movss dword ptr [rax + 4*rbp], xmm0 xor edx, edx xor ecx, ecx mov r8d, _D11TypeInfo_Af6__initZ mov rdi, qword ptr [rsp + 48] mov rsi, qword ptr [rsp + 96] call _adEq2 test eax, eax jne .LBB0_27 movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers12sigmoidDerivFNbffZf mov rax, qword ptr [rsp + 96] jmp .LBB0_26 .align 16, 0x90 .LBB0_24: movss xmm0, dword ptr [rsp + 92] movss xmm1, dword ptr [rsp + 116] call _D8nhelpers6linearFNbffZf mov rax, qword ptr [rsp + 64] movss dword ptr [rax + 4*rbp], xmm0 xor edx, edx xor ecx, ecx mov r8d, _D11TypeInfo_Af6__initZ mov rdi, qword ptr [rsp + 48] mov rsi, qword ptr [rsp + 96] call _adEq2 test eax, eax jne .LBB0_27 mov rax, qword ptr [rsp + 96] movss xmm0, dword ptr [rsp + 92] .LBB0_26: movss dword ptr [rax + 4*rbp], xmm0 .LBB0_27: inc rbp add rbx, 4 cmp rbp, qword ptr [rsp + 72] jne .LBB0_9 .LBB0_28: mov rax, qword ptr [rsp + 24] inc rax cmp rax, qword ptr [rsp + 8] mov rbp, qword ptr [rsp + 16] jne .LBB0_1 .LBB0_29: add rsp, 120 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret |
Copyright © 1999-2021 by the D Language Foundation