std.math performance (SSE vs. real) (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » std.math performance (SSE vs. real) (page 2)

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by Manu
in reply to David Nadlinger

Manu

Posted in reply to David Nadlinger

On 27 June 2014 11:31, David Nadlinger via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> Hi all,
>
> right now, the use of std.math over core.stdc.math can cause a huge performance problem in typical floating point graphics code. An instance of this has recently been discussed here in the "Perlin noise benchmark speed" thread [1], where even LDC, which already beat DMD by a factor of two, generated code more than twice as slow as that by Clang and GCC. Here, the use of floor() causes trouble. [2]
>
> Besides the somewhat slow pure D implementations in std.math, the biggest problem is the fact that std.math almost exclusively uses reals in its API. When working with single- or double-precision floating point numbers, this is not only more data to shuffle around than necessary, but on x86_64 requires the caller to transfer the arguments from the SSE registers onto the x87 stack and then convert the result back again. Needless to say, this is a serious performance hazard. In fact, this accounts for an 1.9x slowdown in the above benchmark with LDC.
>
> Because of this, I propose to add float and double overloads (at the very
> least the double ones) for all of the commonly used functions in std.math.
> This is unlikely to break much code, but:
>  a) Somebody could rely on the fact that the calls effectively widen the
> calculation to 80 bits on x86 when using type deduction.
>  b) Additional overloads make e.g. "&floor" ambiguous without context, of
> course.
>
> What do you think?
>
> Cheers,
> David
>
>
> [1] http://forum.dlang.org/thread/lo19l7$n2a$1@digitalmars.com
> [2] Fun fact: As the program happens only deal with positive numbers, the
> author could have just inserted an int-to-float cast, sidestepping the issue
> altogether. All the other language implementations have the floor() call
> too, though, so it doesn't matter for this discussion.

Totally agree.
Maintaining commitment to deprecated hardware which could be removed
from the silicone at any time is a bit of a problem looking forwards.
Regardless of the decision about whether overloads are created, at
very least, I'd suggest x64 should define real as double, since the
x87 is deprecated, and x64 ABI uses the SSE unit. It makes no sense at
all to use real under any general circumstances in x64 builds.

And aside from that, if you *think* you need real for precision, the
truth is, you probably have bigger problems.
Double already has massive precision. I find it's extremely rare to
have precision problems even with float under most normal usage
circumstances, assuming you are conscious of the relative magnitudes
of your terms.

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by John Colvin
in reply to Manu

John Colvin

Posted in reply to Manu

On Friday, 27 June 2014 at 10:51:05 UTC, Manu via Digitalmars-d wrote:
> On 27 June 2014 11:31, David Nadlinger via Digitalmars-d
> <digitalmars-d@puremagic.com> wrote:
>> Hi all,
>>
>> right now, the use of std.math over core.stdc.math can cause a huge
>> performance problem in typical floating point graphics code. An instance of
>> this has recently been discussed here in the "Perlin noise benchmark speed"
>> thread [1], where even LDC, which already beat DMD by a factor of two,
>> generated code more than twice as slow as that by Clang and GCC. Here, the
>> use of floor() causes trouble. [2]
>>
>> Besides the somewhat slow pure D implementations in std.math, the biggest
>> problem is the fact that std.math almost exclusively uses reals in its API.
>> When working with single- or double-precision floating point numbers, this
>> is not only more data to shuffle around than necessary, but on x86_64
>> requires the caller to transfer the arguments from the SSE registers onto
>> the x87 stack and then convert the result back again. Needless to say, this
>> is a serious performance hazard. In fact, this accounts for an 1.9x slowdown
>> in the above benchmark with LDC.
>>
>> Because of this, I propose to add float and double overloads (at the very
>> least the double ones) for all of the commonly used functions in std.math.
>> This is unlikely to break much code, but:
>>  a) Somebody could rely on the fact that the calls effectively widen the
>> calculation to 80 bits on x86 when using type deduction.
>>  b) Additional overloads make e.g. "&floor" ambiguous without context, of
>> course.
>>
>> What do you think?
>>
>> Cheers,
>> David
>>
>>
>> [1] http://forum.dlang.org/thread/lo19l7$n2a$1@digitalmars.com
>> [2] Fun fact: As the program happens only deal with positive numbers, the
>> author could have just inserted an int-to-float cast, sidestepping the issue
>> altogether. All the other language implementations have the floor() call
>> too, though, so it doesn't matter for this discussion.
>
> Totally agree.
> Maintaining commitment to deprecated hardware which could be removed
> from the silicone at any time is a bit of a problem looking forwards.
> Regardless of the decision about whether overloads are created, at
> very least, I'd suggest x64 should define real as double, since the
> x87 is deprecated, and x64 ABI uses the SSE unit. It makes no sense at
> all to use real under any general circumstances in x64 builds.
>
> And aside from that, if you *think* you need real for precision, the
> truth is, you probably have bigger problems.
> Double already has massive precision. I find it's extremely rare to
> have precision problems even with float under most normal usage
> circumstances, assuming you are conscious of the relative magnitudes
> of your terms.

I think real should stay how it is, as the largest hardware-supported floating point type on a system. What needs to change is dmd and phobos' default usage of real. Double should be the standard. People should be able to reach for real if they really need it, but normal D code should target the sweet spot that is double*.

I understand why the current situation exists. In 2000 x87 was the standard and the 80bit precision came for free.

*The number of algorithms that are both numerically stable/correct and benefit significantly from > 64bit doubles is very small. The same can't be said for 32bit floats.

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by Remo
in reply to John Colvin

Remo

Posted in reply to John Colvin

On Friday, 27 June 2014 at 11:10:57 UTC, John Colvin wrote:
> On Friday, 27 June 2014 at 10:51:05 UTC, Manu via Digitalmars-d wrote:
>> On 27 June 2014 11:31, David Nadlinger via Digitalmars-d
>> <digitalmars-d@puremagic.com> wrote:
>>> Hi all,
>>>
>>> right now, the use of std.math over core.stdc.math can cause a huge
>>> performance problem in typical floating point graphics code. An instance of
>>> this has recently been discussed here in the "Perlin noise benchmark speed"
>>> thread [1], where even LDC, which already beat DMD by a factor of two,
>>> generated code more than twice as slow as that by Clang and GCC. Here, the
>>> use of floor() causes trouble. [2]
>>>
>>> Besides the somewhat slow pure D implementations in std.math, the biggest
>>> problem is the fact that std.math almost exclusively uses reals in its API.
>>> When working with single- or double-precision floating point numbers, this
>>> is not only more data to shuffle around than necessary, but on x86_64
>>> requires the caller to transfer the arguments from the SSE registers onto
>>> the x87 stack and then convert the result back again. Needless to say, this
>>> is a serious performance hazard. In fact, this accounts for an 1.9x slowdown
>>> in the above benchmark with LDC.
>>>
>>> Because of this, I propose to add float and double overloads (at the very
>>> least the double ones) for all of the commonly used functions in std.math.
>>> This is unlikely to break much code, but:
>>> a) Somebody could rely on the fact that the calls effectively widen the
>>> calculation to 80 bits on x86 when using type deduction.
>>> b) Additional overloads make e.g. "&floor" ambiguous without context, of
>>> course.
>>>
>>> What do you think?
>>>
>>> Cheers,
>>> David
>>>
>>>
>>> [1] http://forum.dlang.org/thread/lo19l7$n2a$1@digitalmars.com
>>> [2] Fun fact: As the program happens only deal with positive numbers, the
>>> author could have just inserted an int-to-float cast, sidestepping the issue
>>> altogether. All the other language implementations have the floor() call
>>> too, though, so it doesn't matter for this discussion.
>>
>> Totally agree.
>> Maintaining commitment to deprecated hardware which could be removed
>> from the silicone at any time is a bit of a problem looking forwards.
>> Regardless of the decision about whether overloads are created, at
>> very least, I'd suggest x64 should define real as double, since the
>> x87 is deprecated, and x64 ABI uses the SSE unit. It makes no sense at
>> all to use real under any general circumstances in x64 builds.
>>
>> And aside from that, if you *think* you need real for precision, the
>> truth is, you probably have bigger problems.
>> Double already has massive precision. I find it's extremely rare to
>> have precision problems even with float under most normal usage
>> circumstances, assuming you are conscious of the relative magnitudes
>> of your terms.
>
> I think real should stay how it is, as the largest hardware-supported floating point type on a system. What needs to change is dmd and phobos' default usage of real. Double should be the standard. People should be able to reach for real if they really need it, but normal D code should target the sweet spot that is double*.
>
> I understand why the current situation exists. In 2000 x87 was the standard and the 80bit precision came for free.
>
> *The number of algorithms that are both numerically stable/correct and benefit significantly from > 64bit doubles is very small. The same can't be said for 32bit floats.


Totally agree!
Please add float and double overloads and make double default.
Sometimes float is just enough, but in most times double should be used.

If some one need more precision as double can provide then 80bit will probably be not enough any way.

IMHO intrinsics should be used as default if possible.

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by Russel Winder
in reply to John Colvin

Russel Winder

Posted in reply to John Colvin

On Fri, 2014-06-27 at 11:10 +0000, John Colvin via Digitalmars-d wrote: […]
> I understand why the current situation exists. In 2000 x87 was the standard and the 80bit precision came for free.

Real programmers have been using 128-bit floating point for decades. All this namby-pamby 80-bit stuff is just an aberration and should never have happened.

[…]

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by Iain Buclaw
in reply to David Nadlinger

Iain Buclaw

Posted in reply to David Nadlinger

On 27 June 2014 11:47, David Nadlinger via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On Friday, 27 June 2014 at 09:37:54 UTC, hane wrote:
>>
>> On Friday, 27 June 2014 at 06:48:44 UTC, Iain Buclaw via Digitalmars-d wrote:
>>>
>>> Can you test with this?
>>>
>>> https://github.com/D-Programming-Language/phobos/pull/2274
>>>
>>> Float and Double implementations of floor/ceil are trivial and I can add later.
>>
>>
>> Nice! I tested with the Perlin noise benchmark, and it got faster(in my
>> environment, 1.030s -> 0.848s).
>> But floor still consumes almost half of the execution time.
>
>
> Wait, so DMD and GDC did actually emit a memcpy/… here? LDC doesn't, and the change didn't have much of an impact on performance.
>

Yes, IIRC _d_arraycopy to be exact (so we loose doubly so!)


> What _does_ have a significant impact, however, is that the whole of floor()
> for doubles can be optimized down to
>     roundsd <…>,<…>,0x1
> when targeting SSE 4.1, or
>     vroundsd <…>,<…>,<…>,0x1
> when targeting AVX.
>
> This is why std.math will need to build on top of compiler-recognizable primitives. Iain, Don, how do you think we should handle this?

My opinion is that we should have never have pushed a variable sized as the baseline for all floating point computations in the first place.

But as we can't backtrace now, overloads will just have to do.  I would welcome a DIP to add new core.math intrinsics that could be proven to be useful for the sake of maintainability (and portability).

Regards
Iain

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by dennis luehring
in reply to Russel Winder

dennis luehring

Posted in reply to Russel Winder

Am 27.06.2014 14:20, schrieb Russel Winder via Digitalmars-d:
> On Fri, 2014-06-27 at 11:10 +0000, John Colvin via Digitalmars-d wrote:
> [âŠ]
>> I understand why the current situation exists. In 2000 x87 was
>> the standard and the 80bit precision came for free.
>
> Real programmers have been using 128-bit floating point for decades. All
> this namby-pamby 80-bit stuff is just an aberration and should never
> have happened.

what consumer hardware and compiler supports 128-bit floating points?

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by John Colvin
in reply to dennis luehring

John Colvin

Posted in reply to dennis luehring

On Friday, 27 June 2014 at 13:04:31 UTC, dennis luehring wrote:
> Am 27.06.2014 14:20, schrieb Russel Winder via Digitalmars-d:
>> On Fri, 2014-06-27 at 11:10 +0000, John Colvin via Digitalmars-d wrote:
>> [âŠ]
>>> I understand why the current situation exists. In 2000 x87 was
>>> the standard and the 80bit precision came for free.
>>
>> Real programmers have been using 128-bit floating point for decades. All
>> this namby-pamby 80-bit stuff is just an aberration and should never
>> have happened.
>
> what consumer hardware and compiler supports 128-bit floating points?

I think he was joking :)

No consumer hardware supports IEEE binary128 as far as I know. Wikipedia suggests that Sparc used to have some support.

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by Element 126
in reply to dennis luehring

Element 126

Posted in reply to dennis luehring

On 06/27/2014 03:04 PM, dennis luehring wrote:
> Am 27.06.2014 14:20, schrieb Russel Winder via Digitalmars-d:
>> On Fri, 2014-06-27 at 11:10 +0000, John Colvin via Digitalmars-d wrote:
>> [âŠ]
>>> I understand why the current situation exists. In 2000 x87 was
>>> the standard and the 80bit precision came for free.
>>
>> Real programmers have been using 128-bit floating point for decades. All
>> this namby-pamby 80-bit stuff is just an aberration and should never
>> have happened.
>
> what consumer hardware and compiler supports 128-bit floating points?
>

I noticed that std.math mentions partial support for big endian non-IEEE doubledouble. I first thought that it was a software implemetation like the QD library [1][2][3], but I could not find how to use it on x86_64.
It looks like it is only available for the PowerPC architecture.
Does anyone know about it ?

[1] http://crd-legacy.lbl.gov/~dhbailey/mpdist/
[2] http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf
[3] www.davidhbailey.com/dhbpapers/quad-double.pdf

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by Iain Buclaw
in reply to Element 126

Iain Buclaw

Posted in reply to Element 126

On 27 June 2014 14:24, Element 126 via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On 06/27/2014 03:04 PM, dennis luehring wrote:
>>
>> Am 27.06.2014 14:20, schrieb Russel Winder via Digitalmars-d:
>>>
>>> On Fri, 2014-06-27 at 11:10 +0000, John Colvin via Digitalmars-d wrote: [â€Š]
>>>>
>>>> I understand why the current situation exists. In 2000 x87 was the standard and the 80bit precision came for free.
>>>
>>>
>>> Real programmers have been using 128-bit floating point for decades. All this namby-pamby 80-bit stuff is just an aberration and should never have happened.
>>
>>
>> what consumer hardware and compiler supports 128-bit floating points?
>>
>
> I noticed that std.math mentions partial support for big endian non-IEEE
> doubledouble. I first thought that it was a software implemetation like the
> QD library [1][2][3], but I could not find how to use it on x86_64.
> It looks like it is only available for the PowerPC architecture.
> Does anyone know about it ?
>

We only support native types in std.math.  And partial support is saying more than what there actually is. :-)

June 27, 2014

Re: std.math performance (SSE vs. real)

Posted by Iain Buclaw

Iain Buclaw

On 27 June 2014 07:48, Iain Buclaw <ibuclaw@gdcproject.org> wrote:
> On 27 June 2014 07:14, Iain Buclaw <ibuclaw@gdcproject.org> wrote:
>> On 27 June 2014 02:31, David Nadlinger via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>>> Hi all,
>>>
>>> right now, the use of std.math over core.stdc.math can cause a huge performance problem in typical floating point graphics code. An instance of this has recently been discussed here in the "Perlin noise benchmark speed" thread [1], where even LDC, which already beat DMD by a factor of two, generated code more than twice as slow as that by Clang and GCC. Here, the use of floor() causes trouble. [2]
>>>
>>> Besides the somewhat slow pure D implementations in std.math, the biggest problem is the fact that std.math almost exclusively uses reals in its API. When working with single- or double-precision floating point numbers, this is not only more data to shuffle around than necessary, but on x86_64 requires the caller to transfer the arguments from the SSE registers onto the x87 stack and then convert the result back again. Needless to say, this is a serious performance hazard. In fact, this accounts for an 1.9x slowdown in the above benchmark with LDC.
>>>
>>> Because of this, I propose to add float and double overloads (at the very
>>> least the double ones) for all of the commonly used functions in std.math.
>>> This is unlikely to break much code, but:
>>>  a) Somebody could rely on the fact that the calls effectively widen the
>>> calculation to 80 bits on x86 when using type deduction.
>>>  b) Additional overloads make e.g. "&floor" ambiguous without context, of
>>> course.
>>>
>>> What do you think?
>>>
>>> Cheers,
>>> David
>>>
>>
>> This is the reason why floor is slow, it has an array copy operation.
>>
>> ---
>>   auto vu = *cast(ushort[real.sizeof/2]*)(&x);
>> ---
>>
>> I didn't like it at the time I wrote, but at least it prevented the compiler (gdc) from removing all bit operations that followed.
>>
>> If there is an alternative to the above, then I'd imagine that would speed up floor by tenfold.
>>
>
> Can you test with this?
>
> https://github.com/D-Programming-Language/phobos/pull/2274
>
> Float and Double implementations of floor/ceil are trivial and I can add later.


Added float/double implementations.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation