DMD floating point performance.

Guaranteed Optimization

Nov 13, 2006

Nov 15, 2006

Ok this is trivial, but it probably fairly represents what many use floating point for day-to-day (simple math on simple data): import std.stdio, std.date; void main() { d_time s = getUTCtime; double sum = 1.0, val = 1.000001; for(size_t i = 0; i < 10_000_000; i++) { sum += val; sum -= val; sum *= val; sum /= val; } d_time e = getUTCtime; writefln(sum,": ",(e-s)/cast(real)TicksPerSecond," secs"); } # gdc -O2 fp.d -o fp ; fp 1: 0.282 secs # dmd -O fp.d ; fp 1: 0.683 secs A difference of 2.5x?! If you look at the DMD asm, the problem is that each operation is wrapped by a load/store.. Why wouldn't val and sum be kept in fp registers inside the loop? Another couple of examples: http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=all http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=all I mean I'd like to standardize on DMD; a) it doesn't come with any baggage on either Windows or Linux and b) it is more consistently maintained as of late, but things like this make the decision a lot tougher. Thanks, - Dave

Lionello Lunesu wrote: > I'm not sure if it would change anything, but for a full release build, you'd also want -release and -inline. > > L. > > It wouldn't make a diff. for these tests, so I avoided the clutter.

Dave wrote: > If you look at the DMD asm, the problem is that each operation is wrapped by a load/store.. Why wouldn't val and sum be kept in fp registers inside the loop? The issue isn't with D, it's with the back end code generator. You'll see the same thing with the C and C++ compiler. Because the FPU is a 'stack' machine rather than a register machine, it's hard to write a code generator that enregisters floating point variables. It's also problematical because every function call must empty the FPU stack anyway, and so a lot of register spill logic must be in place for it to be very useful. Not impossible, just a lot of tricky work. How does GDC do with this?

November 12, 2006

Re: DMD floating point performance.

Posted by Dave
in reply to Walter Bright

Permalink

Dave

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> Dave wrote:
>> If you look at the DMD asm, the problem is that each operation is wrapped by a load/store.. Why wouldn't val and sum be kept in fp registers inside the loop?
> 
> The issue isn't with D, it's with the back end code generator. You'll see the same thing with the C and C++ compiler.
> 
> Because the FPU is a 'stack' machine rather than a register machine, it's hard to write a code generator that enregisters floating point 

I know this is simplistic, but could something like fxch be used to 'mimick' a register machine as vars are stored into registers? fxch is supposed to be very efficient.

> variables. It's also problematical because every function call must empty the FPU stack anyway, and so a lot of register spill logic must be in place for it to be very useful.

Could ffree greatly simply things here?

> Not impossible, just a lot of tricky work. How does GDC do with this?

The original post was comparing DMD vs. GDC.

Here's some code showing relative asm. output:

dmd = 1: 0.673 secs
dmd asm = 1: 0.682 secs
opt asm = 1: 0.272 secs

;---

import std.stdio, std.date;

void main()
{
    d_time s = getUTCtime;
    double sum = fp;
    d_time e = getUTCtime;
    writefln("dmd = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs");

    s = getUTCtime;
    sum = fp_dmd_asm;
    e = getUTCtime;
    writefln("dmd asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs");

    s = getUTCtime;
    sum = fp_dmd_opt;
    e = getUTCtime;
    writefln("opt asm = ",sum,": ",(e-s)/cast(real)TicksPerSecond," secs");
}

double fp()
{
    double sum = 1.0, val = 0.000001;
    for(size_t i = 0; i < 10_000_000; i++)
    {
        sum += val;
        sum -= val;
        sum *= val;
        sum /= val;
    }
    return sum;
}

double _sum = 1.0, _val = 0.000001;
double fp_dmd_asm() // more or less
{
    asm
    {
                fld     qword ptr _sum[0];
                fstp    qword ptr -8[EBP];
                xor     EAX,EAX;
L11:            fld     qword ptr _val[0];
                fadd    qword ptr -8[EBP];
                fstp    qword ptr -8[EBP];
                fld     qword ptr _val[0];
                fsubr   qword ptr -8[EBP];
                fstp    qword ptr -8[EBP];
                fld     qword ptr _val[0];
                fmul    qword ptr -8[EBP];
                fstp    qword ptr -8[EBP];
                fld     qword ptr _val[0];
                fdivr   qword ptr -8[EBP];
                fstp    qword ptr -8[EBP];
                inc     EAX;
                cmp     EAX,10_000_000;
                jb      L11;
                fld     qword ptr -8[EBP];
                fstp    _sum[0];
    }
    return _sum;
}

double fp_dmd_opt()
{
    double sum = 1.0, val = 0.000001;
    asm
    {
                fld     qword ptr sum[0];
                fld     qword ptr val[0];
                fxch    ST(1);
                xor     EAX,EAX;
L1:             fadd    ST, ST(1);
                fsubr   ST, ST(1);
                fmul    ST, ST(1);
                fdivr   ST, ST(1);
                inc     EAX;
                cmp     EAX,10_000_000;
                jb      L1;
                fstp    sum[0];
                ffree   ST(1);
    }
    return sum;
}

Dave wrote: > I know this is simplistic, but could something like fxch be used to 'mimick' a register machine as vars are stored into registers? fxch is supposed to be very efficient. Yes, I just haven't done the work to figure it out.

Walter Bright wrote: > Dave wrote: >> I know this is simplistic, but could something like fxch be used to 'mimick' a register machine as vars are stored into registers? fxch is supposed to be very efficient. > > Yes, I just haven't done the work to figure it out. The definitive article on this is Agner Fog's optimisation manual at www.agner.org. It's fantastically difficult to do x87 perfectly (and this is probably the main appeal of SSE), but I'm sure there are massive gains to be had relatively cheaply.

Don Clugston wrote: > The definitive article on this is Agner Fog's optimisation manual at www.agner.org. It's fantastically difficult to do x87 perfectly (and this is probably the main appeal of SSE), but I'm sure there are massive gains to be had relatively cheaply. Thanks, it's a great reference I didn't know existed. But I couldn't find anything specific about enregistering ST register algorithms.

I think it is possible to systematic search for cases where dmd don't make optimal code. How this could be done is described at http://www.oonumerics.org/lunar/ The idea is that one start with a simples program main() { return(0); } Then one deoptimize it through one or more deoptimizing steps, example: main() { int i=0; if (false) { int i=10; } return(i) } At last one run the deoptimized program through the compiler and compere the result to the result of the start program. Now we just need someone to write a deoptimizer for D.

Does anyone know of tools which might be use to hack up a deoptimizer like that ??? It does not have to be a tool written in D, but is should be possible to generate D code. Maybe a good genetic programing tool can be used, I don't know. Any suggestions ?? On Mon, 13 Nov 2006 10:04:00 +0100, Knud Sørensen wrote: > I think it is possible to systematic search for cases where dmd don't make optimal code. > > How this could be done is described at http://www.oonumerics.org/lunar/ > > The idea is that one start with a simples program > main() > { > return(0); > } > > Then one deoptimize it through one or more deoptimizing steps, example: > > main() > { > int i=0; > if (false) > { > int i=10; > } > return(i) > } > > At last one run the deoptimized program through the compiler and compere the result to the result of the start program. > > Now we just need someone to write a deoptimizer for D.

Forums