Thread overview |
---|
February 07, 2019 poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Given the following code... module ohreally; import std.math; float foo(float y, float x) { float ax = fabs(x); float ay = fabs(y); return ax*ay/3.142f; } the compiler outputs the following for the function body, (excluding prolog and epilogue code)... movss -010h[RBP],XMM0 movss -8[RBP],XMM1 fld float ptr -010h[RBP] fabs fstp qword ptr -020h[RBP] movsd XMM0,-020h[RBP] cvtsd2ss XMM0,XMM0 fld float ptr -8[RBP] fabs fstp qword ptr -020h[RBP] movsd XMM1,-020h[RBP] cvtsd2ss XMM2,XMM1 mulss XMM0,XMM2 movss XMM3,FLAT:.rodata[00h][RIP] divss XMM0,XMM3 So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single. And the div by 3.142f, is there a reason it cant be converted to a multiply? I know I can coax the multiply by doing *(1.0f/3.142f) instead, but I wondered if there's some reasoning in why its not done automatically? Is any of this worth add to the bug tracker? |
February 07, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to NaN | On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d wrote: > Given the following code... [...] > the compiler outputs the following for the function body, (excluding > prolog and epilogue code)... [...] > So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single. Which compiler are you using? For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd. It's well-known that dmd codegen tends to lag behind ldc/gdc as far as efficiency / optimization is concerned. These days, I don't even look at dmd output anymore when I'm looking for performance. IME, dmd consistently produces code that's about 20-30% slower than ldc or gdc produced code, sometimes even as high as 40%. > And the div by 3.142f, is there a reason it cant be converted to a multiply? I know I can coax the multiply by doing *(1.0f/3.142f) instead, but I wondered if there's some reasoning in why its not done automatically? > > Is any of this worth add to the bug tracker? If this problem is specific to dmd, you can post a bug against dmd, I suppose, but I wouldn't hold my breath for dmd codegen to significantly improve in the near future. Walter is far too overloaded with other language issues to do significant work on the optimizer at the moment. OTOH, when it comes to floating-point operations, the optimizer's hands may be tied because of IEEE 754 dictated semantics. There may be some corner cases where multiplying rather than dividing may produce different results, and therefore the optimizer is not free to simply substitute one for the other, even if in this case it works fine. You may need to spell it out yourself if what you want is a multiply rather than a divide. T -- IBM = I'll Buy Microsoft! |
February 08, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
> On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d
>> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.
>
> Which compiler are you using?
>
> For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.
Or just open std.math to see the simple reason for the old FPU being used:
```
real fabs(real x) @safe pure nothrow @nogc { pragma(inline, true); return core.math.fabs(x); }
//FIXME
///ditto
double fabs(double x) @safe pure nothrow @nogc { return fabs(cast(real) x); }
//FIXME
///ditto
float fabs(float x) @safe pure nothrow @nogc { return fabs(cast(real) x); }
```
Just one of many functions still operating with `real` precision only.
|
February 08, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
> On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d wrote:
>> Given the following code...
> [...]
>> the compiler outputs the following for the function body, (excluding
>> prolog and epilogue code)...
> [...]
>> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.
>
> Which compiler are you using?
>
> For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd. It's well-known that dmd codegen tends to lag behind ldc/gdc as far as efficiency / optimization is concerned. These days, I don't even look at dmd output anymore when I'm looking for performance. IME, dmd consistently produces code that's about 20-30% slower than ldc or gdc produced code, sometimes even as high as 40%.
I use LDC primarily, just that wasnt inlining the fabs calls, and figured I would check to see what DMD was doing and that was screwy in a different way.
Wasnt sure if it was something worth reporting but it looks like its a known issue from what kinke posted.
|
February 08, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to kinke | On Friday, 8 February 2019 at 00:09:55 UTC, kinke wrote:
> On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
>> On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d
>>> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.
>>
>> Which compiler are you using?
>>
>> For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.
>
> Or just open std.math to see the simple reason for the old FPU being used:
Im embarrassed to admit I did look at the source and didn't spot that they were all being upcast to real precision.
|
February 07, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to kinke | On Fri, Feb 08, 2019 at 12:09:55AM +0000, kinke via Digitalmars-d wrote: [...] > Or just open std.math to see the simple reason for the old FPU being used: > > ``` > real fabs(real x) @safe pure nothrow @nogc { pragma(inline, true); > return core.math.fabs(x); } > //FIXME > ///ditto > double fabs(double x) @safe pure nothrow @nogc { return fabs(cast(real) x); > } > //FIXME > ///ditto > float fabs(float x) @safe pure nothrow @nogc { return fabs(cast(real) x); } > ``` > > Just one of many functions still operating with `real` precision only. Ugh. Not this again. :-( Didn't somebody clean up std.math recently to add double/float overloads? Or was that limited to only a few functions? This really needs to be fixed sooner rather than later. It's an embarrassment to D for anyone who cares about floating-point performance. T -- This is not a sentence. |
February 09, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to NaN | On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote: > Given the following code... > > > module ohreally; > > import std.math; > > float foo(float y, float x) > { > float ax = fabs(x); > float ay = fabs(y); > return ax*ay/3.142f; > } > > > the compiler outputs the following for the function body, (excluding prolog and epilogue code)... > > movss -010h[RBP],XMM0 > movss -8[RBP],XMM1 > fld float ptr -010h[RBP] > fabs > fstp qword ptr -020h[RBP] > movsd XMM0,-020h[RBP] > cvtsd2ss XMM0,XMM0 > fld float ptr -8[RBP] > fabs > fstp qword ptr -020h[RBP] > movsd XMM1,-020h[RBP] > > cvtsd2ss XMM2,XMM1 > mulss XMM0,XMM2 > movss XMM3,FLAT:.rodata[00h][RIP] > divss XMM0,XMM3 > > So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single. > > And the div by 3.142f, is there a reason it cant be converted to a multiply? I know I can coax the multiply by doing *(1.0f/3.142f) instead, but I wondered if there's some reasoning in why its not done automatically? > > Is any of this worth add to the bug tracker? Essentially the problem is that fabs() is always done with the FPU so with DMD the values always trip between the SSE registers, the stack of temporaries and the FPU registers. But fabs() doesn't have to be made in extended precision, it's not like the trigo operations after all, it's just about a **single bit**... On amd64 fabs (for single and double) could be done using SSE **only**. I don't know how compiler intrinsics work but the SSE version is not a single instruction. It's either 3 (generate a mask + logical and) or 2 (left shift by 1, right shift by 1 to clear the sign) A SSE-only would be more something like: pcmpeqd xmm2, xmm2 psrld xmm2, 01h andps xmm0, xmm2 andps xmm1, xmm2 mulss xmm0, xmm1 mulss xmm0, dword ptr [<address of constant>] ret in iasm (note: sadly the constant cannot be set to a static immutable): extern(C) float foo2(float y, float x, const float z = 1.0f / 3.142f) { asm pure nothrow { naked; pcmpeqd XMM3, XMM3; psrld XMM3, 1; andps XMM0, XMM3; andps XMM1, XMM3; mulss XMM0, XMM1; mulss XMM0, XMM2; ret; } } LDC2 does almost that, excepted that the logical AND is in a sub program: push rax movss dword ptr [rsp+04h], xmm1 call 000000000045A020h movss dword ptr [rsp], xmm0 movss xmm0, dword ptr [rsp+04h] call 000000000045A020h mulss xmm0, dword ptr [rsp] mulss xmm0, dword ptr [<address of constant>] pop rax ret 000000000049DA20h: andps xmm0, dqword ptr [<address of mask>] ret To come back to the bug of the "tripping values", it's known and it can even happen when the FPU is not used [1] [1] https://issues.dlang.org/show_bug.cgi?id=17965 |
February 09, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Saturday, 9 February 2019 at 03:28:41 UTC, Basile B. wrote: > On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote: >> Is any of this worth add to the bug tracker? I think so : https://issues.dlang.org/show_bug.cgi?id=19663 |
February 09, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Saturday, 9 February 2019 at 03:28:41 UTC, Basile B. wrote: > On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote: >> Given the following code... > LDC2 does almost that, excepted that the logical AND is in a sub program: > > push rax > movss dword ptr [rsp+04h], xmm1 > call 000000000045A020h > movss dword ptr [rsp], xmm0 > movss xmm0, dword ptr [rsp+04h] > call 000000000045A020h > mulss xmm0, dword ptr [rsp] > mulss xmm0, dword ptr [<address of constant>] > pop rax > ret What flags are you passing LDC? I cant get it to convert the division into a multiply by it's inverse unless i specifically change /3.142f to /(1.0f/3.142f). and FWIW im using... float fabs(float x) // need cause math.fabs not being inlined { uint tmp = *(cast(int*)&x) & 0x7fffffff; float f = *(cast(float*) &tmp); return f; } that compiles down to a single "andps" instruction and is inlined if it's in the same module. I tried cross module inlining as suggested in the LDC forum but it caused my program to hang. |
February 09, 2019 Re: poor codegen for abs(), and div by literal? | ||||
---|---|---|---|---|
| ||||
Posted in reply to NaN | On Saturday, 9 February 2019 at 15:14:47 UTC, NaN wrote:
> What flags are you passing LDC? I cant get it to convert the division into a multiply by it's inverse unless i specifically change /3.142f to /(1.0f/3.142f).
Use `-ffast-math`. If you need more fine-grained control (e.g., `enable-unsafe-fp-math`), dare invoking `ldc2 --help-hidden`.
|
Copyright © 1999-2021 by the D Language Foundation