Jump to page: 1 2
Thread overview
poor codegen for abs(), and div by literal?
Feb 07, 2019
NaN
Feb 07, 2019
H. S. Teoh
Feb 08, 2019
kinke
Feb 08, 2019
NaN
Feb 08, 2019
H. S. Teoh
Feb 08, 2019
NaN
Feb 09, 2019
Basile B.
Feb 09, 2019
Basile B.
Feb 09, 2019
NaN
Feb 09, 2019
kinke
Feb 09, 2019
Basile B.
February 07, 2019
Given the following code...


module ohreally;

import std.math;

float foo(float y, float x)
{
    float ax = fabs(x);
    float ay = fabs(y);
    return ax*ay/3.142f;
}


the compiler outputs the following for the function body, (excluding prolog and epilogue code)...

   movss   -010h[RBP],XMM0
   movss   -8[RBP],XMM1
   fld     float ptr -010h[RBP]
   fabs
   fstp    qword ptr -020h[RBP]
   movsd   XMM0,-020h[RBP]
   cvtsd2ss        XMM0,XMM0
   fld     float ptr -8[RBP]
   fabs
   fstp    qword ptr -020h[RBP]
   movsd   XMM1,-020h[RBP]

   cvtsd2ss        XMM2,XMM1
   mulss   XMM0,XMM2
   movss   XMM3,FLAT:.rodata[00h][RIP]
   divss   XMM0,XMM3

So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.

And the div by 3.142f, is there a reason it cant be converted to a multiply? I know I can coax the multiply by doing *(1.0f/3.142f) instead, but I wondered if there's some reasoning in why its not done automatically?

Is any of this worth add to the bug tracker?

February 07, 2019
On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d wrote:
> Given the following code...
[...]
> the compiler outputs the following for the function body, (excluding
> prolog and epilogue code)...
[...]
> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.

Which compiler are you using?

For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.  It's well-known that dmd codegen tends to lag behind ldc/gdc as far as efficiency / optimization is concerned.  These days, I don't even look at dmd output anymore when I'm looking for performance.  IME, dmd consistently produces code that's about 20-30% slower than ldc or gdc produced code, sometimes even as high as 40%.


> And the div by 3.142f, is there a reason it cant be converted to a multiply?  I know I can coax the multiply by doing *(1.0f/3.142f) instead, but I wondered if there's some reasoning in why its not done automatically?
> 
> Is any of this worth add to the bug tracker?

If this problem is specific to dmd, you can post a bug against dmd, I suppose, but I wouldn't hold my breath for dmd codegen to significantly improve in the near future.  Walter is far too overloaded with other language issues to do significant work on the optimizer at the moment.

OTOH, when it comes to floating-point operations, the optimizer's hands may be tied because of IEEE 754 dictated semantics. There may be some corner cases where multiplying rather than dividing may produce different results, and therefore the optimizer is not free to simply substitute one for the other, even if in this case it works fine.  You may need to spell it out yourself if what you want is a multiply rather than a divide.


T

-- 
IBM = I'll Buy Microsoft!
February 08, 2019
On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
> On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d
>> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.
>
> Which compiler are you using?
>
> For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.

Or just open std.math to see the simple reason for the old FPU being used:

```
real fabs(real x) @safe pure nothrow @nogc { pragma(inline, true); return core.math.fabs(x); }
//FIXME
///ditto
double fabs(double x) @safe pure nothrow @nogc { return fabs(cast(real) x); }
//FIXME
///ditto
float fabs(float x) @safe pure nothrow @nogc { return fabs(cast(real) x); }
```

Just one of many functions still operating with `real` precision only.
February 08, 2019
On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
> On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d wrote:
>> Given the following code...
> [...]
>> the compiler outputs the following for the function body, (excluding
>> prolog and epilogue code)...
> [...]
>> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.
>
> Which compiler are you using?
>
> For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.  It's well-known that dmd codegen tends to lag behind ldc/gdc as far as efficiency / optimization is concerned.  These days, I don't even look at dmd output anymore when I'm looking for performance.  IME, dmd consistently produces code that's about 20-30% slower than ldc or gdc produced code, sometimes even as high as 40%.

I use LDC primarily, just that wasnt inlining the fabs calls, and figured I would check to see what DMD was doing and that was screwy in a different way.

Wasnt sure if it was something worth reporting but it looks like its a known issue from what kinke posted.




February 08, 2019
On Friday, 8 February 2019 at 00:09:55 UTC, kinke wrote:
> On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
>> On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d
>>> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.
>>
>> Which compiler are you using?
>>
>> For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.
>
> Or just open std.math to see the simple reason for the old FPU being used:

Im embarrassed to admit I did look at the source and didn't spot that they were all being upcast to real precision.
February 07, 2019
On Fri, Feb 08, 2019 at 12:09:55AM +0000, kinke via Digitalmars-d wrote: [...]
> Or just open std.math to see the simple reason for the old FPU being used:
> 
> ```
> real fabs(real x) @safe pure nothrow @nogc { pragma(inline, true);
> return core.math.fabs(x); }
> //FIXME
> ///ditto
> double fabs(double x) @safe pure nothrow @nogc { return fabs(cast(real) x);
> }
> //FIXME
> ///ditto
> float fabs(float x) @safe pure nothrow @nogc { return fabs(cast(real) x); }
> ```
> 
> Just one of many functions still operating with `real` precision only.

Ugh.  Not this again. :-(  Didn't somebody clean up std.math recently to add double/float overloads?  Or was that limited to only a few functions?

This really needs to be fixed sooner rather than later.  It's an embarrassment to D for anyone who cares about floating-point performance.


T

-- 
This is not a sentence.
February 09, 2019
On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
> Given the following code...
>
>
> module ohreally;
>
> import std.math;
>
> float foo(float y, float x)
> {
>     float ax = fabs(x);
>     float ay = fabs(y);
>     return ax*ay/3.142f;
> }
>
>
> the compiler outputs the following for the function body, (excluding prolog and epilogue code)...
>
>    movss   -010h[RBP],XMM0
>    movss   -8[RBP],XMM1
>    fld     float ptr -010h[RBP]
>    fabs
>    fstp    qword ptr -020h[RBP]
>    movsd   XMM0,-020h[RBP]
>    cvtsd2ss        XMM0,XMM0
>    fld     float ptr -8[RBP]
>    fabs
>    fstp    qword ptr -020h[RBP]
>    movsd   XMM1,-020h[RBP]
>
>    cvtsd2ss        XMM2,XMM1
>    mulss   XMM0,XMM2
>    movss   XMM3,FLAT:.rodata[00h][RIP]
>    divss   XMM0,XMM3
>
> So to do the abs(), it stores to memory from XMM reg, loads into x87 FPU regs, does the abs with the old FPU instruction, then for some reason stores the result as a double, loads that back into an XMM, converts it back to single.
>
> And the div by 3.142f, is there a reason it cant be converted to a multiply? I know I can coax the multiply by doing *(1.0f/3.142f) instead, but I wondered if there's some reasoning in why its not done automatically?
>
> Is any of this worth add to the bug tracker?

Essentially the problem is that fabs() is always done with the FPU so with DMD the values always trip between the SSE registers, the stack of temporaries and the FPU registers. But fabs() doesn't have to be made in extended precision, it's not like the trigo operations after all, it's just about a **single bit**...

On amd64 fabs (for single and double) could be done using SSE **only**. I don't know how compiler intrinsics work but the SSE version is not a single instruction. It's either 3 (generate a mask + logical and) or 2 (left shift by 1, right shift by 1 to clear the sign)

A SSE-only would be more something like:

  pcmpeqd xmm2, xmm2
  psrld xmm2, 01h
  andps xmm0, xmm2
  andps xmm1, xmm2
  mulss xmm0, xmm1
  mulss xmm0, dword ptr [<address of constant>]
  ret

in iasm (note: sadly the constant cannot be set to a static immutable):

  extern(C) float foo2(float y, float x, const float z = 1.0f / 3.142f)
  {
    asm pure nothrow
    {
      naked;
      pcmpeqd XMM3, XMM3;
      psrld   XMM3, 1;
      andps   XMM0, XMM3;
      andps   XMM1, XMM3;
      mulss   XMM0, XMM1;
      mulss   XMM0, XMM2;
      ret;
    }
  }

LDC2 does almost that, excepted that the logical AND is in a sub program:

  push rax
  movss dword ptr [rsp+04h], xmm1
  call 000000000045A020h
  movss dword ptr [rsp], xmm0
  movss xmm0, dword ptr [rsp+04h]
  call 000000000045A020h
  mulss xmm0, dword ptr [rsp]
  mulss xmm0, dword ptr [<address of constant>]
  pop rax
  ret

000000000049DA20h:
  andps xmm0, dqword ptr [<address of mask>]
  ret

To come back to the bug of the "tripping values", it's known and it can even happen when the FPU is not used [1]

[1] https://issues.dlang.org/show_bug.cgi?id=17965

February 09, 2019
On Saturday, 9 February 2019 at 03:28:41 UTC, Basile B. wrote:
> On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
>> Is any of this worth add to the bug tracker?

I think so : https://issues.dlang.org/show_bug.cgi?id=19663

February 09, 2019
On Saturday, 9 February 2019 at 03:28:41 UTC, Basile B. wrote:
> On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
>> Given the following code...

> LDC2 does almost that, excepted that the logical AND is in a sub program:
>
>   push rax
>   movss dword ptr [rsp+04h], xmm1
>   call 000000000045A020h
>   movss dword ptr [rsp], xmm0
>   movss xmm0, dword ptr [rsp+04h]
>   call 000000000045A020h
>   mulss xmm0, dword ptr [rsp]
>   mulss xmm0, dword ptr [<address of constant>]
>   pop rax
>   ret

What flags are you passing LDC? I cant get it to convert the division into a multiply by it's inverse unless i specifically change /3.142f to /(1.0f/3.142f).

and FWIW im using...

float fabs(float x) // need cause math.fabs not being inlined
{
    uint tmp = *(cast(int*)&x) & 0x7fffffff;
    float f = *(cast(float*) &tmp);
    return f;
}

that compiles down to a single "andps" instruction and is inlined if it's in the same module. I tried cross module inlining as suggested in the LDC forum but it caused my program to hang.


February 09, 2019
On Saturday, 9 February 2019 at 15:14:47 UTC, NaN wrote:
> What flags are you passing LDC? I cant get it to convert the division into a multiply by it's inverse unless i specifically change /3.142f to /(1.0f/3.142f).

Use `-ffast-math`. If you need more fine-grained control (e.g., `enable-unsafe-fp-math`), dare invoking `ldc2 --help-hidden`.
« First   ‹ Prev
1 2