May 16, 2016
On Monday, 16 May 2016 at 17:50:22 UTC, Walter Bright wrote:
>
> The needs of numerical analysts have often been neglected by the C/C++ community. The community has been very slow to recognize IEEE, by about a decade. It wasn't until fairly recently that C/C++ compilers even handled NaN properly. It's no surprise that FORTRAN reigned as the choice of numerical analysts.
>
> (On the other hand, the needs of game developers have received strong support.)
>

I have regression scripts that check audio computations against the previous baseline. The goal is to ensure the results are consistent between compilers and bitness, and to check for the validity of sound-altering optimization (usually within 80 dB peak difference is OK). And of course not to utterly break things during optimization work.

I've been harmed once by the x87 promotion of a float to real, which made an algorithm sound better on 32-bit than 64-bit. On 64-bit DMD used SSE which avoided the internal precision boost. If I had tested only in 32-bit I could have inferred float precision was enough.

This is one time I would have preferred if D would never promoted float to double or to real. I understand that the x87 code would go much slower if restricted precision was ensured. So how about using SSE like in 64-bit code?

Free bits of precision may lead to think things are better than they are.


May 17, 2016
I wondered how the precision of typical SSE vs FPU code would compare. SSE uses floats and doubles as the main mean to control precision, while the FPU is configured via x87 control word.

The attached program calculates 5^-441 iteratively using these methods, then converts to double and prints the mantissa:

  Compiled with Digital Mars D in 64-bit
  float SSE : 00000000000000000000000000000000000000000000000000000
  float x87 : 00100000101010100111001110000000000000000000000000000
  double SSE: 00100000101010100111001101111011011110011011110010001
  double x87: 00100000101010100111001101111011011110011011110010000
  real x87  : 00100000101010100111001101111011011110011011110001111

Take-aways:
- SSE results generally differ from x87 results at the same
  precision
- x87 produces accurate single precision results, whereas
  the error with SSE is fatal (and that's not a flush-to-zero;
  the greatest power that still produces 1 bit of mantissa
  is 5^-64(!))

-- 
Marco


May 17, 2016
On 16 May 2016 at 18:10, Walter Bright via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On 5/15/2016 11:00 PM, Ola Fosheim Grøstad wrote:
>>
>> On Sunday, 15 May 2016 at 22:49:27 UTC, Walter Bright wrote:
>>>
>>> On 5/15/2016 2:06 PM, Ola Fosheim Grøstad wrote:
>>>>
>>>> The net result is that adding const/immutable to a type can change the
>>>> semantics
>>>> of the program entirely at the whim of the compiler implementor.
>>>
>>>
>>> C++ Standard allows the same increased precision, at the whim of the
>>> compiler
>>> implementor, as quoted to you earlier.
>>>
>>> What your particular C++ compiler does is not relevant, as its behavior
>>> is not
>>> required by the Standard.
>>
>>
>> This is a crazy attitude to take. C++ provides means to detect that IEEE
>> floats
>> are being used in the standard library.
>
>
> IEEE floats do not specify precision of intermediate results. A C/C++ compiler can be fully IEEE compliant and yet legitimately have increased precision for intermediate results.
>
> I posted several links here pointing out this behavior in VC++ and g++. If your C++ numerics code didn't have a problem with it, it's likely you wrote the code in such a way that more accurate answers were not wrong.
>
> FP behavior has complex trade-offs with speed, accuracy, compatibility, and size. There are no easy, obvious answers.

I think I might have misunderstood Iain's initial post on the matter.
I understand the links you've posted, but the point I was objecting to
was the moment an assignment is made, truncation must occur.
If you do:
float f()
{
  float x = some_float_expression;
  float y = some_expression_involving_x;
  return some_expression_involving_y;
}

If I run that in CTFE, will assignment to x, y, and the return statement all truncate intermediate results to float at those moments? If the answer is yes, I apologise for the confusion.

May 17, 2016
On 16 May 2016 at 20:27, Andrei Alexandrescu via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On 5/16/16 12:37 AM, Walter Bright wrote:
>>
>> Me, I think of that as "Who cares that you paid $$$ for an 80 bit CPU, we're going to give you only 64 bits."
>
>
> I'm not sure about this. My understanding is that all SSE has hardware for 32 and 64 bit floats, and the the 80-bit hardware is pretty much cut-and-pasted from the x87 days without anyone really looking in improving it. And that's been the case for more than a decade. Is that correct?

It's not only correct, but it's also deprecated, and may be removed from silicone (ie, emulated/microcoded, is it's not already...) at some unknown point in the future.
May 17, 2016
On 16 May 2016 at 21:35, Andrei Alexandrescu via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On 5/16/16 7:33 AM, deadalnix wrote:
>>
>> Regardless of the compiler actually doing it or not, the argument that extra precision is a problem is self defeating.
>
>
> I agree that this whole "we need less precision" argument would be difficult to accept.
>
>> I don't think argument
>> for speed have been raised so far.
>
> This may be the best angle in this discussion. For all I can tell 80 bit is slow as molasses and on the road to getting slower. Isn't that enough of an argument to move away from it?

It really has though!
At Remedy, the first thing we noticed when compiling code with DMD is
that our float code produced x87 ops, and with the x64 calling
convention (floats in SSE regs), this means pushing every function
argument to the stack, loading them into x87 regs, doing the work
(slowly), pushing them back, and popping them into a return reg. The
codegen is insanely long and inefficient.
The only reason it wasn't a critical blocker was because of the
relative small level of use of D, and that it wasn't in any
time-critical path.
If Ethan and Remedy want to expand their use of D, the compiler CAN
NOT emit x87 code. It's just a matter of time before a loop is in a
hot path.
May 17, 2016
On 17 May 2016 at 12:00, Manu <turkeyman@gmail.com> wrote:
> On 16 May 2016 at 21:35, Andrei Alexandrescu via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>> On 5/16/16 7:33 AM, deadalnix wrote:
>>>
>>> Regardless of the compiler actually doing it or not, the argument that extra precision is a problem is self defeating.
>>
>>
>> I agree that this whole "we need less precision" argument would be difficult to accept.
>>
>>> I don't think argument
>>> for speed have been raised so far.
>>
>> This may be the best angle in this discussion. For all I can tell 80 bit is slow as molasses and on the road to getting slower. Isn't that enough of an argument to move away from it?
>
> It really has though!
> At Remedy, the first thing we noticed when compiling code with DMD is
> that our float code produced x87 ops, and with the x64 calling
> convention (floats in SSE regs), this means pushing every function
> argument to the stack, loading them into x87 regs, doing the work
> (slowly), pushing them back, and popping them into a return reg. The
> codegen is insanely long and inefficient.
> The only reason it wasn't a critical blocker was because of the
> relative small level of use of D, and that it wasn't in any
> time-critical path.
> If Ethan and Remedy want to expand their use of D, the compiler CAN
> NOT emit x87 code. It's just a matter of time before a loop is in a
> hot path.

That said, I recall a conversation where Walter said this has been addressed? I haven't worked on this in some time now, so I haven't tested where it's at. Just saying that complaints about x87 performance have definitely been made.
May 17, 2016
On Monday, 16 May 2016 at 10:33:58 UTC, Andrei Alexandrescu wrote:
> My understanding is also that 80-bit math is on the wrong side of the tradeoff simply because it's disproportionately slow.

 Speed in theory shouldn't be that big of a problem. As I recall the FPU *was* a separate processor; Sending the instructions took like 3 cycles. Following that you could do other stuff before returning for the result(s), but that assumes you aren't *only* doing FPU work. The issue would then come up when you are waiting for the result after the fact (and that's only for really slow operations, most operations are very fast, but my knowledge/experience is probably more than a decade out of date).

 Today I'm sure the FPU is built directly into the CPU.

 Not to mention there's a whole slew of instructions for the x86 that haven't improved much because they aren't used at all and instead are there for historical reasons or for compatibility. jcxz, xlat, rep, and probably another dozen I am not sure just off the top of my head.

 Perhaps more processors and in general should move to a RISC style instructions.
May 17, 2016
On Monday, 16 May 2016 at 18:44:57 UTC, Jonathan M Davis wrote:
> On Sunday, May 15, 2016 15:49:27 Walter Bright via Digitalmars-d wrote:
>> My proposal removes the "whim" by requiring 128 bit precision for CTFE.
>
> Based on some of the recent discussions, it sounds like having soft floating point in the compiler would also help with cross-compilation. So, completely aside from the precision chosen, it sounds like having a soft floating point implementation in CTFE would definitely help - though maybe I misunderstood. My understanding of the floating point stuff is pretty bad, unfortunately.

 My understanding of floating point is bad too. I understand fixed floating point (a number of bits is considered the fraction) but as I recall while trying to break down and answer questions while referring to as much of the information as I could, the power/significand portion I got stuck when looking at the raw bits and proper answers failed to come out.


 As for soft floating point, I'd hopefully see an option to control how much precision you can raise it to (so it would probably be a template struct), as well as making it a library we can use. Same with fixed integers which we could then incorporate cent and ucent until such a time that the types are commonly avaliable. (Afterall 128bit computers will come sooner or later, maybe in 20 years? I doubt the memory model would need to move up from 64bit though)

 Although for implementation it could use BCD math instead (which is easier to print off as decimal); Just specify the number of digits for precision (40+), how many digits it can shift larger/smaller; The floating point type would be related to the 8-bit computers of old but on steroids! Actually basic floating point like that wouldn't be too hard to implement over a weekend.
May 17, 2016
Am Tue, 17 May 2016 04:12:09 +0000
schrieb Era Scarecrow <rtcvb32@yahoo.com>:

> I understand fixed floating point

Quote of the day :D

-- 
Marco

May 17, 2016
On Monday, 16 May 2016 at 14:32:55 UTC, Andrei Alexandrescu wrote:
> It is rare to need to actually compute the inverse of a matrix.

Unless you're doing game/graphics work ;-) 4x3 or 4x4 matrices are commonly used to represent transforms in 3D space in every 3D polygon-based rendering pipeline I know of. It's even a requirement for fixed-function OpenGL 1.x.

Video games - also known around here as "The Exception To The Rule".

(Side note: My own preference is to represent transforms as a quaternion and vector. Inverting such a transform is a simple matter of negating a few components. Generating a matrix from such a transform for rendering purposes is trivial compared to matrix inversion.)