August 05, 2016
Thanks for finding these.

On 8/5/2016 3:22 AM, Ilya Yaroshenko wrote:
> 1. https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz
> mathmodule.c, math_fsum has comment:
>    Depends on IEEE 754 arithmetic guarantees and half-even rounding.
> The same algorithm also available in Mir. And it does not work with 32 bit DMD.
>
> 2. sum_kbn in https://github.com/JuliaLang/julia/blob/master/base/reduce.jl
> requires ieee arithmetic. The same algorithm also available in Mir.

I agree that the typical summation algorithm suffers from double rounding. But that's one algorithm. I would appreciate if you would review http://dlang.org/phobos/std_algorithm_iteration.html#sum to ensure it doesn't have this problem, and if it does, how we can fix it.


> 3. http://www.netlib.org/cephes/
> See log2.c for example:
>     z = x - 0.5;
>     z -= 0.5;
>     y = 0.5 * x  + 0.5;
> This code requires IEEE. And you can found it in Phobos std.math

It'd be great to have a value for x where it fails, then we can add it to the unittests and ensure it is fixed.


> 4. Mir has 5 types of summation, and 3 of them requires IEEE.

See above for summation.


> 5. Tinflex requires IEEE arithmetic because extended precision may force
> algorithm to choose wrong computation branch. The most significant part of
> original code was written in R, and the scientists who create this algorithm did
> not care about non IEEE compilers at all.
>
> 6. Atmosphere requires IEEE for may functions such as
> https://github.com/9il/atmosphere/blob/master/source/atmosphere/math.d#L616
> Without proper IEEE rounding the are not guarantee that this functions will stop.
>
> 7. The same true for real world implementations of algorithms presented in
> Numeric Recipes, which uses various series expansion such as for Modified Bessel
> Function.

I hear you. I'd like to explore ways of solving it. Got any ideas?

August 05, 2016
On 8/5/2016 4:27 AM, Seb wrote:
> 1) There are some function (exp, pow, log, round, sqrt) for which using
> llvm_intrinsincs significantly increases your performance.
>
> It's a simple benchmark and might be flawed, but I hope it shows the point.

Speed is not the only criteria. Accuracy is as well. I've been using C math functions forever, and have constantly discovered that this or that math function on this or that platform produces bad results. This is why D's math functions are re-implemented in D rather than just forwarding to the C ones.


> 2) As mentioned before they can yield _different_ results
>
> https://dpaste.dzfl.pl/c0ab5131b49d

Ah, but which result is the correct one? I am interested in getting the functions correct to the last bit first, and performance second.
August 05, 2016
On 8/5/2016 2:40 AM, Ilya Yaroshenko wrote:
> No. For example std.math.log requires it! But you don't care about other
> compilers which not use yl2x and about making it template (real version slows
> down code for double and float).

I'm interested in correct to the last bit results first, and performance second. std.math changes that hew to that are welcome.

August 06, 2016
On Friday, 5 August 2016 at 20:53:42 UTC, Walter Bright wrote:

> I agree that the typical summation algorithm suffers from double rounding. But that's one algorithm. I would appreciate if you would review http://dlang.org/phobos/std_algorithm_iteration.html#sum to ensure it doesn't have this problem, and if it does, how we can fix it.
>

Phobos's sum is two different algorithms. Pairwise summation for Random Access Ranges and Kahan summation for Input Ranges. Pairwise summation does not require IEEE rounding, but Kahan summation requires it.

The problem with real world example is that it depends on optimisation. For example, if all temporary values are rounded, this is not a problem, and if all temporary values are not rounded this is not a problem too. However if some of them rounded and others are not, than this will break Kahan algorithm.

Kahan is the shortest and one of the slowest (comparing with KBN for example) summation algorithms. The true story about Kahan, that we may have it in Phobos, but we can use pairwise summation for Input Ranges without random access, and it will be faster then Kahan. So we don't need Kahan for current API at all.

Mir has both Kahan, which works with 32-bit DMD, and pairwise, witch works with input ranges.

Kahan, KBN, KB2, and Precise summations is always use `real` or `Complex!real` internal values for 32 bit X86 target. The only problem with Precise summation, if we need precise result in double and use real for internal summation, then the last bit will be wrong in the 50% of cases.

Another good point about Mir's summation algorithms, that they are Output Ranges. This means they can be used effectively to sum multidimensional arrays for example. Also, Precise summator may be used to compute exact sum of distributed data.

When we get a decision and solution for rounding problem, I will make PR for std.experimental.numeric.sum.

> I hear you. I'd like to explore ways of solving it. Got any ideas?

We need to take the overall picture.

It is very important to recognise that D core team is small and D community is not large enough now to involve a lot of new professionals. This means that time of existing one engineers is very important for D and the most important engineer for D is you, Walter.

In the same time we need to move forward fast with language changes and druntime changes (GC-less Fibers for example).

So, we need to choose tricky options for development. The most important option for D in the science context is to split D Programming Language from DMD in our minds. I am not asking to remove DMD as reference compiler. Instead of that, we can introduce changes in D that can not be optimally implemented in DMD (because you have a lot of more important things to do for D instead of optimisation) but will be awesome for our LLVM-based or GCC-based backends.

We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:

1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub operations.
2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to allow extended precision.

This should be 2 separate pragmas. The second one may assume the first one.

Recent LDC beta has @fastmath attribute for functions, and it is already used in Phobos ndslice.algorithm PR and its Mir's mirror. Attributes are alternative for pragmas, but their syntax should be extended, see [2]

The old approach is separate compilation, but it is weird, low level for users, and requires significant efforts for both small and large projects.

[1] http://llvm.org/docs/LangRef.html#fast-math-flags
[2] https://github.com/ldc-developers/ldc/issues/1669

Best regards,
Ilya

August 06, 2016
On 4 August 2016 at 23:38, Seb via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On Thursday, 4 August 2016 at 21:13:23 UTC, Iain Buclaw wrote:
>>
>> On 4 August 2016 at 01:00, Seb via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>>>
>>> To make matters worse std.math yields different results than compiler/assembly intrinsics - note that in this example import std.math.pow adds about 1K instructions to the output assembler, whereas llvm_powf boils down to the assembly powf. Of course the performance of powf is a lot better, I measured [3] that e.g. std.math.pow takes ~1.5x as long for both LDC and DMD. Of course if you need to run this very often, this cost isn't acceptable.
>>>
>>
>> This could be something specific to your architecture.  I get the same result on from all versions of powf, and from GCC builtins too, regardless of optimization tunings.
>
>
> I can reproduce this on DPaste (also x86_64).
>
> https://dpaste.dzfl.pl/c0ab5131b49d
>
> Behavior with a recent LDC build is similar (as annotated with the
> comments).

When testing the math functions, I chose not to compare results to what C libraries, or CPU instructions return, but rather compared the results to Wolfram, which I hope I'm correct in saying is a more reliable and proven source of scientific maths than libm.

As of the time I ported all pure D (not IASM) implementations of math functions, the results returned from all unittests using 80-bit reals were identical with Wolfram given up to the last 2 digits as an acceptable error with some values.  This was true for all except inputs that were just inside the domain for the function, in which case only double precision was guaranteed.  Where applicable, they were also found to in some cases to be more accurate than the inline assembler or yl2x implementations version paths that are used if you compile with DMD or LDC.
August 06, 2016
On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
> On Friday, 5 August 2016 at 20:53:42 UTC, Walter Bright wrote:
>
>> I agree that the typical summation algorithm suffers from double rounding. But
>> that's one algorithm. I would appreciate if you would review
>> http://dlang.org/phobos/std_algorithm_iteration.html#sum to ensure it doesn't
>> have this problem, and if it does, how we can fix it.
>>
>
> Phobos's sum is two different algorithms. Pairwise summation for Random Access
> Ranges and Kahan summation for Input Ranges. Pairwise summation does not require
> IEEE rounding, but Kahan summation requires it.
>
> The problem with real world example is that it depends on optimisation. For
> example, if all temporary values are rounded, this is not a problem, and if all
> temporary values are not rounded this is not a problem too. However if some of
> them rounded and others are not, than this will break Kahan algorithm.
>
> Kahan is the shortest and one of the slowest (comparing with KBN for example)
> summation algorithms. The true story about Kahan, that we may have it in Phobos,
> but we can use pairwise summation for Input Ranges without random access, and it
> will be faster then Kahan. So we don't need Kahan for current API at all.
>
> Mir has both Kahan, which works with 32-bit DMD, and pairwise, witch works with
> input ranges.
>
> Kahan, KBN, KB2, and Precise summations is always use `real` or `Complex!real`
> internal values for 32 bit X86 target. The only problem with Precise summation,
> if we need precise result in double and use real for internal summation, then
> the last bit will be wrong in the 50% of cases.
>
> Another good point about Mir's summation algorithms, that they are Output
> Ranges. This means they can be used effectively to sum multidimensional arrays
> for example. Also, Precise summator may be used to compute exact sum of
> distributed data.
>
> When we get a decision and solution for rounding problem, I will make PR for
> std.experimental.numeric.sum.
>
>> I hear you. I'd like to explore ways of solving it. Got any ideas?
>
> We need to take the overall picture.
>
> It is very important to recognise that D core team is small and D community is
> not large enough now to involve a lot of new professionals. This means that time
> of existing one engineers is very important for D and the most important
> engineer for D is you, Walter.
>
> In the same time we need to move forward fast with language changes and druntime
> changes (GC-less Fibers for example).
>
> So, we need to choose tricky options for development. The most important option
> for D in the science context is to split D Programming Language from DMD in our
> minds. I am not asking to remove DMD as reference compiler. Instead of that, we
> can introduce changes in D that can not be optimally implemented in DMD (because
> you have a lot of more important things to do for D instead of optimisation) but
> will be awesome for our LLVM-based or GCC-based backends.
>
> We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:
>
> 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub operations.
> 2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to allow
> extended precision.
>
> This should be 2 separate pragmas. The second one may assume the first one.
>
> Recent LDC beta has @fastmath attribute for functions, and it is already used in
> Phobos ndslice.algorithm PR and its Mir's mirror. Attributes are alternative for
> pragmas, but their syntax should be extended, see [2]
>
> The old approach is separate compilation, but it is weird, low level for users,
> and requires significant efforts for both small and large projects.
>
> [1] http://llvm.org/docs/LangRef.html#fast-math-flags
> [2] https://github.com/ldc-developers/ldc/issues/1669

Thanks for your help with this.

Using attributes for this is a mistake. Attributes affect the interface to a function, not its internal implementation. Pragmas are suitable for internal implementation things. I also oppose using compiler flags, because they tend to be overly global, and the details of an algorithm should not be split between the source code and the makefile.

August 06, 2016
On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
> We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:
>
> 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub operations.
> 2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to allow
> extended precision.


The LDC fastmath bothers me a lot. It throws away proper NaN and infinity handling, and throws away precision by allowing reciprocal and algebraic transformations. As I've said before, correctness should be first, not speed, and fastmath has nothing to do with this thread.


I don't know what the point of fusedMath is.
August 06, 2016
On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
> On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
>> We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:
>>
>> 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub operations.
>> 2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to allow
>> extended precision.
>
>
> The LDC fastmath bothers me a lot. It throws away proper NaN and infinity handling, and throws away precision by allowing reciprocal and algebraic transformations. As I've said before, correctness should be first, not speed, and fastmath has nothing to do with this thread.

OK, then we need a third pragma,`pragma(ieeeRound)`. But `pragma(fusedMath)` and `pragma(fastMath)` should be presented too.

> I don't know what the point of fusedMath is.

It allows a compiler to replace two arithmetic operations with single composed one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction set.
August 06, 2016
On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
>>
>> On 8/6/2016 1:21 AM, Ilya Yaroshenko wrote:
>>>
>>> We need 2 new pragmas with the same syntax as `pragma(inline, xxx)`:
>>>
>>> 1. `pragma(fusedMath)` allows fused mul-add, mul-sub, div-add, div-sub
>>> operations.
>>> 2. `pragma(fastMath)` equivalents to [1]. This pragma can be used to
>>> allow
>>> extended precision.
>>
>>
>>
>> The LDC fastmath bothers me a lot. It throws away proper NaN and infinity handling, and throws away precision by allowing reciprocal and algebraic transformations. As I've said before, correctness should be first, not speed, and fastmath has nothing to do with this thread.
>
>
> OK, then we need a third pragma,`pragma(ieeeRound)`. But `pragma(fusedMath)`
> and `pragma(fastMath)` should be presented too.
>
>> I don't know what the point of fusedMath is.
>
>
> It allows a compiler to replace two arithmetic operations with single composed one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction set.

No pragmas tied to a specific architecture should be allowed in the language spec, please.
August 06, 2016
On Saturday, 6 August 2016 at 10:02:25 UTC, Iain Buclaw wrote:
> On 6 August 2016 at 11:48, Ilya Yaroshenko via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>> On Saturday, 6 August 2016 at 09:35:32 UTC, Walter Bright wrote:
>>> [...]
>>
>>
>> OK, then we need a third pragma,`pragma(ieeeRound)`. But `pragma(fusedMath)`
>> and `pragma(fastMath)` should be presented too.
>>
>>> [...]
>>
>>
>> It allows a compiler to replace two arithmetic operations with single composed one, see AVX2 (FMA3 for intel and FMA4 for AMD) instruction set.
>
> No pragmas tied to a specific architecture should be allowed in the language spec, please.

Then probably Mir will drop all compilers, but LDC
LLVM is tied for real world, so we can tied D for real world too. If a compiler can not implement optimization pragma, then this pragma can be just ignored by the compiler.