Disappointing math performance compared to GDC

Hello, I have a machine learning library and I'm porting it from C++ to D right now. There is a number crunching benchmark in it that doing a simple gradient descent learning on a small multilayer perceptron neural network. The core of the benchmark is about some loops doing basic computations on numbers in float[] arrays (add, mul, exp, abs). The reference is the C++ version compiled with Clang: 0.044 secs. D results: DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs LDC2 0.14 -O3 -release : 0.051 secs GDC 4.9 -O3 -release : 0.031 secs I think my benchmark code would hugely benefit from auto vectorization, so that might be the cause of the above results. I've found some vectorization compiler options for ldc2 but they seems have no effect on performance whatsoever. Any suggestions?

Try with '-O3 -release -vectorize-slp-aggressive -g -pass-remarks-analysis="loop-vectorize|loop-unroll" -pass-remarks=loop-unroll' Note that the D situation is a mess in general (correct me if I'm wrong): * Never ever use std.math as you will get the insane 80-bit functions. * core.math has some hacks to use llvm builtins but also mostly using type real. * core.stdc.math supports all types but uses suffixes and maps to C functions. * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. * you can also use ldc.intrinsics to kill portability. Hello C++. And there's no fast-math yet: https://github.com/ldc-developers/ldc/issues/722

On Wednesday, 8 October 2014 at 11:29:30 UTC, Trass3r wrote: > Try with '-O3 -release -vectorize-slp-aggressive -g -pass-remarks-analysis="loop-vectorize|loop-unroll" -pass-remarks=loop-unroll' > > Note that the D situation is a mess in general (correct me if I'm wrong): > * Never ever use std.math as you will get the insane 80-bit functions. > * core.math has some hacks to use llvm builtins but also mostly using type real. > * core.stdc.math supports all types but uses suffixes and maps to C functions. > * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. > * you can also use ldc.intrinsics to kill portability. Hello C++. > > And there's no fast-math yet: > https://github.com/ldc-developers/ldc/issues/722 I get: Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' Unknown command line argument '-pass-remarks=loop-unroll'

On 08/10/14 12:29, Trass3r via digitalmars-d-ldc wrote: […] > * Never ever use std.math as you will get the insane 80-bit functions. What can one use to avoid this, and use the 64-bit numbers. > * core.math has some hacks to use llvm builtins but also mostly using type real. * core.stdc.math supports all types but uses suffixes and maps to C functions. * core.stdc.tgmath gets rid of the suffixes at least. Best way imo to write code if you disregard auto-vectorization. * you can also use ldc.intrinsics to kill portability. Hello C++. > > And there's no fast-math yet: https://github.com/ldc-developers/ldc/issues/722 Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues? - -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder

> Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' > Unknown command line argument '-pass-remarks=loop-unroll' They were added to llvm in April/May. -help-hidden lists all available options.

On Wednesday, 8 October 2014 at 15:04:10 UTC, Trass3r wrote: >> Unknown command line argument '-pass-remarks-analysis=loop-vectorize|loop-unroll' >> Unknown command line argument '-pass-remarks=loop-unroll' > > They were added to llvm in April/May. > -help-hidden lists all available options. I can confirm that there are no pass-remarks options found. I'm using 0.14.0 from here: https://github.com/ldc-developers/ldc/releases/tag/v0.14.0

>> * Never ever use std.math as you will get the insane 80-bit >> functions. > > What can one use to avoid this, and use the 64-bit numbers. Note that this is less of an issue for x86 code using x87 by default. On x64 though this results in really bad code switching between SSE and x87 registers. But vectorization is usually killed in any case. I personally use core.stdc.tgmath atm. >> * [...] > Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues? I think there have been threads debating the unreasonable 'real by default' attitude. No clue if there's any result. And core.math is a big question mark to me. I don't know about gdc. Its runtime doesn't look much different.

On Wednesday, 8 October 2014 at 15:32:46 UTC, Trass3r wrote: >>> * Never ever use std.math as you will get the insane 80-bit >>> functions. >> >> What can one use to avoid this, and use the 64-bit numbers. > > Note that this is less of an issue for x86 code using x87 by default. > On x64 though this results in really bad code switching between SSE and x87 registers. > But vectorization is usually killed in any case. > > I personally use core.stdc.tgmath atm. > >>> * [...] >> Is there any work to handle the above? Does GDC actually suffer the same (or analogous) issues? > > I think there have been threads debating the unreasonable 'real by default' attitude. > No clue if there's any result. > > And core.math is a big question mark to me. > > I don't know about gdc. Its runtime doesn't look much different. Just for the record my benchmark code doesn't use math libraries, I'm using logistics function approximations. That's why I thought that the cause of my results hafta be the lack of the auto vectorization.

October 08, 2014

Re: Disappointing math performance compared to GDC

Posted by Gabor Mezo
in reply to Trass3r

Permalink

Gabor Mezo

Posted in reply to Trass3r

Permalink

On Wednesday, 8 October 2014 at 16:02:17 UTC, Trass3r wrote:
> Just check it with '-output-ll' or '-output-s -x86-asm-syntax=intel' ;)

I'm not an ASM expert, but as far as I can see it indeed use some SIMD registers and instructions. For examlple:

.LBB0_16:
	mov	rcx, qword ptr [rax]
	mov	rdi, rax
	call	qword ptr [rcx + 56]
	test	rax, rax
	jne	.LBB0_18
	movss	xmm1, dword ptr [rsp + 116]
	jmp	.LBB0_20
	.align	16, 0x90
.LBB0_18:
	mov	rcx, rbx
	imul	rcx, rax
	add	r12, rcx
	movss	xmm1, dword ptr [rsp + 116]
	.align	16, 0x90
.LBB0_19:
	movss	xmm0, dword ptr [rdx]
	mulss	xmm0, dword ptr [r12]
	addss	xmm1, xmm0
	add	rdx, 4
	add	r12, 4
	dec	rax
	jne	.LBB0_19
.LBB0_20:
	movss	dword ptr [rsp + 116], xmm1
	inc	r14
	cmp	r14, r15
	jne	.LBB0_12
.LBB0_21:
	mov	rax, qword ptr [rsp + 80]
	mov	rdi, qword ptr [rax]
	mov	rax, qword ptr [rdi]
	call	qword ptr [rax + 40]
	test	eax, eax
	mov	rbp, qword ptr [rsp + 104]
	jne	.LBB0_24
	movss	xmm0, dword ptr [rsp + 92]
	movss	xmm1, dword ptr [rsp + 116]
	call	_D8nhelpers7sigmoidFNbffZf
	mov	rax, qword ptr [rsp + 64]
	movss	dword ptr [rax + 4*rbp], xmm0
	xor	edx, edx
	xor	ecx, ecx
	mov	r8d, _D11TypeInfo_Af6__initZ
	mov	rdi, qword ptr [rsp + 48]
	mov	rsi, qword ptr [rsp + 96]
	call	_adEq2
	test	eax, eax
	jne	.LBB0_27
	movss	xmm0, dword ptr [rsp + 92]
	movss	xmm1, dword ptr [rsp + 116]
	call	_D8nhelpers12sigmoidDerivFNbffZf
	mov	rax, qword ptr [rsp + 96]
	jmp	.LBB0_26
	.align	16, 0x90
.LBB0_24:
	movss	xmm0, dword ptr [rsp + 92]
	movss	xmm1, dword ptr [rsp + 116]
	call	_D8nhelpers6linearFNbffZf
	mov	rax, qword ptr [rsp + 64]
	movss	dword ptr [rax + 4*rbp], xmm0
	xor	edx, edx
	xor	ecx, ecx
	mov	r8d, _D11TypeInfo_Af6__initZ
	mov	rdi, qword ptr [rsp + 48]
	mov	rsi, qword ptr [rsp + 96]
	call	_adEq2
	test	eax, eax
	jne	.LBB0_27
	mov	rax, qword ptr [rsp + 96]
	movss	xmm0, dword ptr [rsp + 92]
.LBB0_26:
	movss	dword ptr [rax + 4*rbp], xmm0
.LBB0_27:
	inc	rbp
	add	rbx, 4
	cmp	rbp, qword ptr [rsp + 72]
	jne	.LBB0_9
.LBB0_28:
	mov	rax, qword ptr [rsp + 24]
	inc	rax
	cmp	rax, qword ptr [rsp + 8]
	mov	rbp, qword ptr [rsp + 16]
	jne	.LBB0_1
.LBB0_29:
	add	rsp, 120
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret

Forums