std.math performance (SSE vs. real) (page 15) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » std.math performance (SSE vs. real) (page 15)

July 01, 2014

Re: std.math performance (SSE vs. real)

Posted by Ola Fosheim Grøstad
in reply to Walter Bright

Ola Fosheim Grøstad

Posted in reply to Walter Bright

On Tuesday, 1 July 2014 at 16:58:55 UTC, Walter Bright wrote:
> Click on "compiling for strict IEEE conformance"

It also states that compliance affect performance.

>> not IEEE754 compliant. VPF is almost compliant, but does not support subnormal
>> numbers and flush them to zero. Which can be a disaster…
>
> It wouldn't be any different if the D spec says "floating point is, ya know, whatevah". You can't fix stuff like this in the spec.

Well, the difference is that you can make VPF mostly compliant, but that means you trap subnormal numbers and fix it in software. Which affects performance.

If the specs require IEEE754 compliance then you default to software emulation and have to turn it off through compiler switches.

Here is another example: the parallella Coprocessor from Adapteva:

«The CPU is compatible with the IEEE-754 single-precision format, with the following
exceptions:
- No support for inexact flags.
- NAN inputs generate an invalid exception and return a quiet NAN. When one or both of the
inputs are NANs, the sign bit of the operation is set as an XOR of the signs of the input sign
bits.
- Denormal operands are flushed to zero when input to a computation unit and do not generate
an underflow exception. Any denormal or underflow result from an arithmetic operation is
flushed to zero and an underflow exception is generated.
- Round-to-±infinity is not supported.»

> Besides, Java and Javascript, for example, both require IEEE conformance.

But they aren't system level programming languages… and we probably cannot expect future many-core processors or transputer-like processors to waste die space in order to conform.

So sure you can specify the IEEE-754 spec and just make D run at full rate on typical CISC CPUs, but the alternative is to constrain the promise of compliance to what is typical for efficient CPUs. Then have a "-strict" flag for the few applications that are scientific in nature. That would be more in line with system level programming.

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Iain Buclaw
in reply to Ola Fosheim Grøstad

Iain Buclaw

Posted in reply to Ola Fosheim Grøstad

On 1 July 2014 20:36, via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On Tuesday, 1 July 2014 at 16:58:55 UTC, Walter Bright wrote:
>>
>> Click on "compiling for strict IEEE conformance"
>
>
> It also states that compliance affect performance.
>
>
>>> not IEEE754 compliant. VPF is almost compliant, but does not support
>>> subnormal
>>> numbers and flush them to zero. Which can be a disaster…
>>
>>
>> It wouldn't be any different if the D spec says "floating point is, ya know, whatevah". You can't fix stuff like this in the spec.
>
>
> Well, the difference is that you can make VPF mostly compliant, but that means you trap subnormal numbers and fix it in software. Which affects performance.
>
> If the specs require IEEE754 compliance then you default to software emulation and have to turn it off through compiler switches.
>
> Here is another example: the parallella Coprocessor from Adapteva:
>
> «The CPU is compatible with the IEEE-754 single-precision format, with the
> following
> exceptions:
> - No support for inexact flags.
> - NAN inputs generate an invalid exception and return a quiet NAN. When one
> or both of the
> inputs are NANs, the sign bit of the operation is set as an XOR of the signs
> of the input sign
> bits.
> - Denormal operands are flushed to zero when input to a computation unit and
> do not generate
> an underflow exception. Any denormal or underflow result from an arithmetic
> operation is
> flushed to zero and an underflow exception is generated.
> - Round-to-±infinity is not supported.»
>

The crucial thing here is that the layout of float/double types are IEEE 754 are compatible. :)

These behaviours you describe only affect the FP control functions in std.math, which are the only thing platform specific, and can only be written in inline assembly anyway...

Regards
Iain

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Don
in reply to Walter Bright

Don

Posted in reply to Walter Bright

On Tuesday, 1 July 2014 at 17:00:30 UTC, Walter Bright wrote:
> On 7/1/2014 3:26 AM, Don wrote:
>> Yes, it's complicated. The interesting thing is that there are no 128 bit
>> registers. The temporaries exist only while the FMA operation is in progress.
>> You cannot even preserve them between consecutive FMA operations.
>>
>> An important consequence is that allowing intermediate calculations to be
>> performed at higher precision than the operands, is crucial, and applies outside
>> of x86. This is something we've got right.
>>
>> But it's not possible to say that "the intermediate calculations are done at the
>> precision of 'real'". This is the semantics which I think we currently have
>> wrong. Our model is too simplistic.
>>
>> On modern x86, calculations on float operands may have intermediate calculations
>> done at only 32 bits (if using straight SSE), 80 bits (if using x87), or 64 bits
>> (if using float FMA). And for double operands, they may be 64 bits, 80 bits, or
>> 128 bits.
>> Yet, in the FMA case, non-FMA operations will be performed at lower precision.
>> It's entirely possible for all three intermediate precisions to be active at the
>> same time!
>>
>> I'm not sure that we need to change anything WRT code generation. But I think
>> our style recommendations aren't quite right. And we have at least one missing
>> primitive operation (discard all excess precision).
>
> What do you recommend?

It needs some thought. But some things are clear.

Definitely, discarding excess precision is a crucial operation. C and C++ tried to do it implicitly with "sequence points", but that kills optimisation possibilities so much that compilers don't respect it. I think it's actually quite similar to write barriers in multithreaded programming. C got it wrong, but we're currently in an even worse situation because it doesn't necessarily happen at all.

We need a builtin operation -- and not in std.math, this is as crucial as addition, and it's purely a signal to the optimiser. It's very similar to a casting operation. I wonder if we can do it as an attribute?  .exact_float, .restrict_float, .force_float, .spill_float or something similar?

With D's current floating point semantics, it's actually impossible to write correct floating-point code. Everything that works right now, is technically only working by accident.

But if we get this right, we can have very nice semantics for when things like FMA are allowed to happen -- essentially the optimiser would have free reign between these explicit discard_excess_precision sequence points.

After that, I'm a bit less sure. It does seem to me that we're trying to make 'real' do double-duty as meaning both "x87 80 bit floating-point number" and also as something like a storage class that is specific to double: "compiler, don't discard excess precision". Which are both useful concepts, but aren't identical. The two concepts did coincide on x86 32-bit, but they're different on x86-64. I think we need to distinguish the two.

Ideally, I think we'd have a __real80 type. On x86 32 bit this would be the same as 'real', while on x86-64 __real80 would be available but probably 'real' would alias to double. But I'm a lot less certain about this.

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Iain Buclaw
in reply to Don

Iain Buclaw

Posted in reply to Don

On 2 July 2014 09:53, Don via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On Tuesday, 1 July 2014 at 17:00:30 UTC, Walter Bright wrote:
>>
>> On 7/1/2014 3:26 AM, Don wrote:
>>>
>>> Yes, it's complicated. The interesting thing is that there are no 128 bit
>>> registers. The temporaries exist only while the FMA operation is in
>>> progress.
>>> You cannot even preserve them between consecutive FMA operations.
>>>
>>> An important consequence is that allowing intermediate calculations to be
>>> performed at higher precision than the operands, is crucial, and applies
>>> outside
>>> of x86. This is something we've got right.
>>>
>>> But it's not possible to say that "the intermediate calculations are done
>>> at the
>>> precision of 'real'". This is the semantics which I think we currently
>>> have
>>> wrong. Our model is too simplistic.
>>>
>>> On modern x86, calculations on float operands may have intermediate
>>> calculations
>>> done at only 32 bits (if using straight SSE), 80 bits (if using x87), or
>>> 64 bits
>>> (if using float FMA). And for double operands, they may be 64 bits, 80
>>> bits, or
>>> 128 bits.
>>> Yet, in the FMA case, non-FMA operations will be performed at lower
>>> precision.
>>> It's entirely possible for all three intermediate precisions to be active
>>> at the
>>> same time!
>>>
>>> I'm not sure that we need to change anything WRT code generation. But I
>>> think
>>> our style recommendations aren't quite right. And we have at least one
>>> missing
>>> primitive operation (discard all excess precision).
>>
>>
>> What do you recommend?
>
>
> It needs some thought. But some things are clear.
>
> Definitely, discarding excess precision is a crucial operation. C and C++ tried to do it implicitly with "sequence points", but that kills optimisation possibilities so much that compilers don't respect it. I think it's actually quite similar to write barriers in multithreaded programming. C got it wrong, but we're currently in an even worse situation because it doesn't necessarily happen at all.
>
> We need a builtin operation -- and not in std.math, this is as crucial as addition, and it's purely a signal to the optimiser. It's very similar to a casting operation. I wonder if we can do it as an attribute?  .exact_float, .restrict_float, .force_float, .spill_float or something similar?
>
> With D's current floating point semantics, it's actually impossible to write correct floating-point code. Everything that works right now, is technically only working by accident.
>
> But if we get this right, we can have very nice semantics for when things like FMA are allowed to happen -- essentially the optimiser would have free reign between these explicit discard_excess_precision sequence points.
>

Fixing this is the goal I assume. :)

---
import std.stdio;

void test(double x, double y)
{
  double y2 = x + 1.0;
  if (y != y2) writeln("error");   // Prints 'error' under -O2
}

void main()
{
  immutable double x = .012;  // Removing 'immutable' and it works.
  double y = x + 1.0;

  test(x, y);
}
---


>
> After that, I'm a bit less sure. It does seem to me that we're trying to make 'real' do double-duty as meaning both "x87 80 bit floating-point number" and also as something like a storage class that is specific to double: "compiler, don't discard excess precision". Which are both useful concepts, but aren't identical. The two concepts did coincide on x86 32-bit, but they're different on x86-64. I think we need to distinguish the two.
>
> Ideally, I think we'd have a __real80 type. On x86 32 bit this would be the same as 'real', while on x86-64 __real80 would be available but probably 'real' would alias to double. But I'm a lot less certain about this.

There are flags for that in gdc:

-mlong-double-64
-mlong-double-80
-mlong-double-128

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Ola Fosheim Grøstad
in reply to Iain Buclaw

Ola Fosheim Grøstad

Posted in reply to Iain Buclaw

On Wednesday, 2 July 2014 at 08:52:25 UTC, Iain Buclaw via Digitalmars-d wrote:
> The crucial thing here is that the layout of float/double types are
> IEEE 754 are compatible. :)

That's nice of course, if you import/export, but hardly the only crucial thing.

Implied correctness and warnings when assumptions break is also important…

> These behaviours you describe only affect the FP control functions in
> std.math, which are the only thing platform specific, and can only be
> written in inline assembly anyway...

It affects the backend.
It affects vectorization. (NEON is not IEEE754 AFAIK)
It affects what a conforming D compiler is allowed to do.
It affects versioning.

E.g. you can have a flags IEEE754_STRICT or IEE754_HAS_NAN etc and use versioning that dectects the wrong compiler-mode.

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Wanderer
in reply to Ola Fosheim Grøstad

Wanderer

Posted in reply to Ola Fosheim Grøstad

On Tuesday, 1 July 2014 at 19:36:26 UTC, Ola Fosheim Grøstad wrote:
>> Besides, Java and Javascript, for example, both require IEEE conformance.
>
> But they aren't system level programming languages… and we probably cannot expect future many-core processors or transputer-like processors to waste die space in order to conform.

So it's okay for "system level programming languages" to produce numeric garbage when it comes to floating points? ;-)

Sorry, couldn't resist. Some people keep claiming that D is a "system level programming language" despite there is no single OS and no single hardware driver written in D yet. Which sorta puts D and Java into the same boat. Besides - gasp! - Java *is* a system level programming language after all. There are platforms/devices which support Java bytecode natively. For system programming, it's extra important that the same algorithm produces exactly the same results on different environments. And the speed is not always the most important factor.

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Ola Fosheim Grøstad
in reply to Wanderer

Ola Fosheim Grøstad

Posted in reply to Wanderer

On Wednesday, 2 July 2014 at 12:16:18 UTC, Wanderer wrote:
> So it's okay for "system level programming languages" to produce numeric garbage when it comes to floating points? ;-)

System level programming languages provide direct easy access to the underlying hardware foundation.

> Sorry, couldn't resist. Some people keep claiming that D is a "system level programming language" despite there is no single OS and no single hardware driver written in D yet.

D is not even production ready, so why should there be? Who in their right mind would use a language in limbo for building a serious operating system or do embedded work? You need language stability and compiler maturity first.

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Iain Buclaw
in reply to Ola Fosheim Grøstad

Iain Buclaw

Posted in reply to Ola Fosheim Grøstad

On 2 July 2014 12:42, via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On Wednesday, 2 July 2014 at 08:52:25 UTC, Iain Buclaw via Digitalmars-d wrote:
>>
>> The crucial thing here is that the layout of float/double types are IEEE 754 are compatible. :)
>
>
> That's nice of course, if you import/export, but hardly the only crucial thing.
>
> Implied correctness and warnings when assumptions break is also important…
>
>
>> These behaviours you describe only affect the FP control functions in std.math, which are the only thing platform specific, and can only be written in inline assembly anyway...
>
>
> It affects the backend.

Only matters if you have to implement it in your backend

> It affects vectorization. (NEON is not IEEE754 AFAIK)

Vectors are treated differently from floats

> It affects what a conforming D compiler is allowed to do.

That depends what you mean be 'conforming'.

> It affects versioning.
>
> E.g. you can have a flags IEEE754_STRICT or IEE754_HAS_NAN etc and use versioning that dectects the wrong compiler-mode.

Affects only library maintainers, but I haven't looked too much into what gcc offers as a platform regarding this.

The ARM market is terrible, and it will certainly be the case that we *can't* have a one size fits all solution.  But in the standard libraries we can certainly keep strictly in line with the most conforming chips, so if you wish to support X you may do so in a platform-specific fork.

This is certainly not unusual for druntime (eg: minilibd)

Regards
Iain

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Ola Fosheim Grøstad
in reply to Iain Buclaw

Ola Fosheim Grøstad

Posted in reply to Iain Buclaw

On Wednesday, 2 July 2014 at 16:03:47 UTC, Iain Buclaw via Digitalmars-d wrote:
>
> Only matters if you have to implement it in your backend

You have to implement it in the backend if D requires strict IEEE754 conformance?

> Vectors are treated differently from floats

How can you then let the compiler vectorize? You can't. Meaning, your code will run very slow.

>> It affects versioning.
>>
>> E.g. you can have a flags IEEE754_STRICT or IEE754_HAS_NAN etc and use
>> versioning that dectects the wrong compiler-mode.
>
> Affects only library maintainers, but I haven't looked too much into
> what gcc offers as a platform regarding this.

I don't agree. If your code produce denormal numbers then it matters a lot if they are forced to zero or not. Just think about divison by zero traps. Why would this only be a library issue?

If you write portable code you shouldn't have to second guess what the compiler does. It either guarantees that denormal numbers are not flushed to zero, or it does not. If it does not then D does not require IEEE754 and you will have to think about this when you write your code.

> The ARM market is terrible, and it will certainly be the case that we
> *can't* have a one size fits all solution.  But in the standard
> libraries we can certainly keep strictly in line with the most
> conforming chips, so if you wish to support X you may do so in a
> platform-specific fork.

I don't really understand the reasoning here. Is D Intel x86 specific? Meaning, it is a system level programming language on x86 only and something arbitrary on other platforms?

Either D requires IEEE754 strict mode, or it does not. It matters what the default is.

July 02, 2014

Re: std.math performance (SSE vs. real)

Posted by Walter Bright
in reply to Don

Walter Bright

Posted in reply to Don

On 7/2/2014 1:53 AM, Don wrote:
> Definitely, discarding excess precision is a crucial operation. C and C++ tried
> to do it implicitly with "sequence points", but that kills optimisation
> possibilities so much that compilers don't respect it. I think it's actually
> quite similar to write barriers in multithreaded programming. C got it wrong,
> but we're currently in an even worse situation because it doesn't necessarily
> happen at all.
>
> We need a builtin operation -- and not in std.math, this is as crucial as
> addition, and it's purely a signal to the optimiser. It's very similar to a
> casting operation. I wonder if we can do it as an attribute?  .exact_float,
> .restrict_float, .force_float, .spill_float or something similar?
>
> With D's current floating point semantics, it's actually impossible to write
> correct floating-point code. Everything that works right now, is technically
> only working by accident.
>
> But if we get this right, we can have very nice semantics for when things like
> FMA are allowed to happen -- essentially the optimiser would have free reign
> between these explicit discard_excess_precision sequence points.

This is easily handled without language changes by putting a couple builtin functions in druntime - roundToFloat() and roundToDouble().


> Ideally, I think we'd have a __real80 type. On x86 32 bit this would be the same
> as 'real', while on x86-64 __real80 would be available but probably 'real' would
> alias to double. But I'm a lot less certain about this.

I'm afraid that would not only break most D programs, but also interoperability with C.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation