February 14, 2020
Hello
I noticed a strange behaviour of the DMD compiler when it has to call a function with float arguments.

I build with the flags "-mcpu=avx2 -O  -m64" under windows 64 bits using "DMD32 D Compiler v2.090.1-dirty"

I have the following function :
   float mul_add(float a, float b, float c); //Return a * b + c

When I try to call it :
   float f = d_mul_add(1.0, 2.0, 3.0);

I tested with other functions with float parameters, and there is the same problem.

Then the following instructions are generated :
        //Loads the values, as it can be expected
   	vmovss xmm2,dword [rel 0x64830]
	vmovss xmm1,dword [rel 0x64834]
	vmovss xmm0,dword [rel 0x64838]
        //Why ?
	movq r8,xmm2
	movq rdx,xmm1
	movq rcx,xmm0
        //
	call 0x400   //0x400 is where the mul_add function is located

My questions are :
 - Is there a reason why the registers xmm0/1/2 are saved in rcx/rdx/r8 before calling ? The calling convention specifies that the floating point parameters have to be put in xmm registers, and not GPR, unless you are using your own calling convention.
 - Why is it done using non-avx instructions ? Mixing AVX and non-AVX instructions may impact the speed greatly.

Any idea ? Thank you in advance.
February 15, 2020
On Friday, 14 February 2020 at 22:36:20 UTC, PatateVerte wrote:
> Hello
> I noticed a strange behaviour of the DMD compiler when it has to call a function with float arguments.
>
> I build with the flags "-mcpu=avx2 -O  -m64" under windows 64 bits using "DMD32 D Compiler v2.090.1-dirty"
>
> I have the following function :
>    float mul_add(float a, float b, float c); //Return a * b + c
>
> When I try to call it :
>    float f = d_mul_add(1.0, 2.0, 3.0);
>
> I tested with other functions with float parameters, and there is the same problem.
>
> Then the following instructions are generated :
>         //Loads the values, as it can be expected
>    	vmovss xmm2,dword [rel 0x64830]
> 	vmovss xmm1,dword [rel 0x64834]
> 	vmovss xmm0,dword [rel 0x64838]
>         //Why ?
> 	movq r8,xmm2
> 	movq rdx,xmm1
> 	movq rcx,xmm0
>         //
> 	call 0x400   //0x400 is where the mul_add function is located
>
> My questions are :
>  - Is there a reason why the registers xmm0/1/2 are saved in rcx/rdx/r8 before calling ? The calling convention specifies that the floating point parameters have to be put in xmm registers, and not GPR, unless you are using your own calling convention.
>  - Why is it done using non-avx instructions ? Mixing AVX and non-AVX instructions may impact the speed greatly.
>
> Any idea ? Thank you in advance.

It's simply the bad codegen (or rather a missed opportunity to optimize) from DMD, its backend doesn't see that the parameters are already in the right order and in the right registers so it copy them and put them in the regs for the inner func call.

I had observed this in the past too, i.e unexplained round tripping from GP to SSE regs. For good FP codegen use LDC2 or GDC or write iasm (but loose inlining).

For other people who'd like to observe the problem: https://godbolt.org/z/gvqEqz.
By the way I had to deactivate AVX2 targeting because otherwise the result is even more weird (https://godbolt.org/z/T9NwMc)