Thread overview
float.max + 1.0 does not overflow
Dec 27, 2017
rumbu
Dec 27, 2017
Benjamin Thaut
Dec 28, 2017
Dave Jones
December 27, 2017
Is that normal?

use std.math;
float f = float.max;
f += 1.0;
assert(IeeeFlags.overflow) //failure
assert(f == float.inf) //failure, f is in fact float.max

On the contrary, float.max + float.max will overflow. The behavior is the same for double and real.


December 27, 2017
On Wednesday, 27 December 2017 at 13:40:28 UTC, rumbu wrote:
> Is that normal?
>
> use std.math;
> float f = float.max;
> f += 1.0;
> assert(IeeeFlags.overflow) //failure
> assert(f == float.inf) //failure, f is in fact float.max
>
> On the contrary, float.max + float.max will overflow. The behavior is the same for double and real.

This is actually correct floating point behavior. Consider the following program:

float nextReprensentableToMax = float.max;
// find next smaller representable floating point number
(*cast(int*)&nextReprensentableToMax)--;
writefln("%f", float.max - nextReprensentableToMax);

It computes the difference between float.max and the next smaller reprensentable number in floating point. The difference printed by the program is:
20282409603651670423947251286016.0

As you might notice this is siginificantly bigger then 1.0. Floating point operations work like this: They perform the operation and then round to the nearest representable number in floating point. So adding 1.0 to float.max and then rounding to the nearest representable number will just give you back float.max. If you however add float.max and float.max the next nearest reprensentable number is float.inf.

When trying to understand how floating point works I would highly recommend that you read these articles (oldest first): https://randomascii.wordpress.com/category/floating-point/

Kind Regards
Benjamin Thaut
December 28, 2017
On Wednesday, 27 December 2017 at 14:14:42 UTC, Benjamin Thaut wrote:
> On Wednesday, 27 December 2017 at 13:40:28 UTC, rumbu wrote:
>> Is that normal?
> It computes the difference between float.max and the next smaller reprensentable number in floating point. The difference printed by the program is:
> 20282409603651670423947251286016.0
>
> As you might notice this is siginificantly bigger then 1.0. Floating point operations work like this: They perform the operation and then round to the nearest representable number in floating point. So adding 1.0 to float.max and then rounding to the nearest representable number will just give you back float.max. If you however add float.max and float.max the next nearest reprensentable number is float.inf.

The float with the lower exponent would have to be shifted to match the higher which means 1.0 would be shifted something like 156 bits to the right before the addition can be done. If you shift right more bits than are in the mantissa then it get rounded to zero. Hence once the two values are lined up to do the actual op it becomes float.max + 0.0.

That said i suspect the OP was expecting the FPU unit to catch that in theory it should overflow. Not that the actual op would overflow but that the FPU would be checking the values on input. Maybe.