Jump to page: 1 2
Thread overview
Re: inline functions
Mar 26, 2011
Jonathan M Davis
Mar 26, 2011
Caligo
Mar 26, 2011
Jonathan M Davis
Mar 26, 2011
Caligo
Mar 26, 2011
Jonathan M Davis
Mar 26, 2011
bearophile
Mar 26, 2011
bearophile
Mar 26, 2011
Caligo
Mar 26, 2011
bearophile
Mar 26, 2011
Jérôme M. Berger
Mar 26, 2011
bearophile
Mar 26, 2011
Caligo
March 26, 2011
On 2011-03-25 19:04, Caligo wrote:
> T[3] data;
> 
> T dot(const ref Vector o){
>     return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2];
> }
> 
> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
> data[2] * data[2]; }
> T LengthSquared_Slow(){ return dot(this); }
> 
> 
> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD.  Is it because the compilers don't inline-expand the dot() function call?  I need the performance, but the faster version is too verbose.

It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.

- Jonathan M Davis
March 26, 2011
On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On 2011-03-25 19:04, Caligo wrote:
>> T[3] data;
>>
>> T dot(const ref Vector o){
>>     return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2];
>> }
>>
>> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
>> data[2] * data[2]; }
>> T LengthSquared_Slow(){ return dot(this); }
>>
>>
>> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD.  Is it because the compilers don't inline-expand the dot() function call?  I need the performance, but the faster version is too verbose.
>
> It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
>
> - Jonathan M Davis
>

I didn't know I had to supply GDC with -inline, so I did, and it did not help.  In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls.  In any case, code compiled with DMD is always behind GDC when it comes to performance.
March 26, 2011
On 2011-03-25 21:21, Caligo wrote:
> On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com>
wrote:
> > On 2011-03-25 19:04, Caligo wrote:
> >> T[3] data;
> >> 
> >> T dot(const ref Vector o){
> >>     return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
> >> o.data[2]; }
> >> 
> >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
> >> data[2] * data[2]; }
> >> T LengthSquared_Slow(){ return dot(this); }
> >> 
> >> 
> >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD.  Is it because the compilers don't inline-expand the dot() function call?  I need the performance, but the faster version is too verbose.
> > 
> > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
> > 
> > - Jonathan M Davis
> 
> I didn't know I had to supply GDC with -inline, so I did, and it did not help.  In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls.  In any case, code compiled with DMD is always behind GDC when it comes to performance.

I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on.

- Jonathan M Davis
March 26, 2011
On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On 2011-03-25 21:21, Caligo wrote:
>> On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com>
> wrote:
>> > On 2011-03-25 19:04, Caligo wrote:
>> >> T[3] data;
>> >>
>> >> T dot(const ref Vector o){
>> >>     return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
>> >> o.data[2]; }
>> >>
>> >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
>> >> data[2] * data[2]; }
>> >> T LengthSquared_Slow(){ return dot(this); }
>> >>
>> >>
>> >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD.  Is it because the compilers don't inline-expand the dot() function call?  I need the performance, but the faster version is too verbose.
>> >
>> > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
>> >
>> > - Jonathan M Davis
>>
>> I didn't know I had to supply GDC with -inline, so I did, and it did not help.  In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls.  In any case, code compiled with DMD is always behind GDC when it comes to performance.
>
> I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on.
>
> - Jonathan M Davis
>

The only time that -inline has no effect is when I turn on -O3.  This is also when the code performs the best.  I've never used -O3 in my C++ code, but I guess things are different in D even with the same back-end.
March 26, 2011
On 2011-03-26 01:06, Caligo wrote:
> On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis <jmdavisProg@gmx.com>
wrote:
> > On 2011-03-25 21:21, Caligo wrote:
> >> On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com>
> > 
> > wrote:
> >> > On 2011-03-25 19:04, Caligo wrote:
> >> >> T[3] data;
> >> >> 
> >> >> T dot(const ref Vector o){
> >> >>     return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
> >> >> o.data[2]; }
> >> >> 
> >> >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1]
> >> >> + data[2] * data[2]; }
> >> >> T LengthSquared_Slow(){ return dot(this); }
> >> >> 
> >> >> 
> >> >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD.  Is it because the compilers don't inline-expand the dot() function call?  I need the performance, but the faster version is too verbose.
> >> > 
> >> > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
> >> > 
> >> > - Jonathan M Davis
> >> 
> >> I didn't know I had to supply GDC with -inline, so I did, and it did not help.  In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls.  In any case, code compiled with DMD is always behind GDC when it comes to performance.
> > 
> > I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on.
> > 
> > - Jonathan M Davis
> 
> The only time that -inline has no effect is when I turn on -O3.  This is also when the code performs the best.  I've never used -O3 in my C++ code, but I guess things are different in D even with the same back-end.

I really don't know what gdc does. With dmd, inlining is not turned on unless -inline is used. Also, -inline with dmd does not force inlining, it merely turns on the optimization. The compiler still chooses where and when it's best to inline.

With gcc, I believe that inlining is normally turned on at a pretty low optimization level (probably -O), and like dmd, it chooses where and when it's best to inline, but unlike dmd, it uses the inline keyword in C++ as a hint as to what it should do. However, -O3 forces inlining on all functions marked with inline. How gdc deals with that given that D doesn't have an inline keyword, I don't know.

Regardless, given what inlining does, I have a _very_ hard time believing that it would ever degrade performance unless it's buggy.

- Jonathan M Davis
March 26, 2011
On Sat, Mar 26, 2011 at 3:47 AM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On 2011-03-26 01:06, Caligo wrote:
>> On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis <jmdavisProg@gmx.com>
> wrote:
>> > On 2011-03-25 21:21, Caligo wrote:
>> >> On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com>
>> >
>> > wrote:
>> >> > On 2011-03-25 19:04, Caligo wrote:
>> >> >> T[3] data;
>> >> >>
>> >> >> T dot(const ref Vector o){
>> >> >>     return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
>> >> >> o.data[2]; }
>> >> >>
>> >> >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1]
>> >> >> + data[2] * data[2]; }
>> >> >> T LengthSquared_Slow(){ return dot(this); }
>> >> >>
>> >> >>
>> >> >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD.  Is it because the compilers don't inline-expand the dot() function call?  I need the performance, but the faster version is too verbose.
>> >> >
>> >> > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
>> >> >
>> >> > - Jonathan M Davis
>> >>
>> >> I didn't know I had to supply GDC with -inline, so I did, and it did not help.  In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls.  In any case, code compiled with DMD is always behind GDC when it comes to performance.
>> >
>> > I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on.
>> >
>> > - Jonathan M Davis
>>
>> The only time that -inline has no effect is when I turn on -O3.  This is also when the code performs the best.  I've never used -O3 in my C++ code, but I guess things are different in D even with the same back-end.
>
> I really don't know what gdc does. With dmd, inlining is not turned on unless -inline is used. Also, -inline with dmd does not force inlining, it merely turns on the optimization. The compiler still chooses where and when it's best to inline.
>
> With gcc, I believe that inlining is normally turned on at a pretty low optimization level (probably -O), and like dmd, it chooses where and when it's best to inline, but unlike dmd, it uses the inline keyword in C++ as a hint as to what it should do. However, -O3 forces inlining on all functions marked with inline. How gdc deals with that given that D doesn't have an inline keyword, I don't know.
>
> Regardless, given what inlining does, I have a _very_ hard time believing that it would ever degrade performance unless it's buggy.
>
> - Jonathan M Davis
>


I was going to post my code, but I take back what I said.  What is happening is that there is a lot of fluctuation in performance.  The low performance always occurred when I had -inline enabled, which made me think -inline degrades performance.  The performance should be consistent, but for some reason it's not.

The important thing is that -inline doesn't make any difference with GDC.  The -O3 does make a big difference.
March 26, 2011
Answer for Jonathan M Davis and Caligo:

I far as I remember you need to use -finline-functions on GDC to perform inlining.

-O3 implies inlining, on GCC, and I presume on GDC too.

Inlining is a complex art, the compilers compute a score for each function and each function call and decide if perform the inlining. There are many situations where inlining harms performance, and it's not just a matter of code cache pressure (this list of problems is not complete: http://en.wikipedia.org/wiki/Inlining#Problems ). DMD inlining is  in many ways weak compared to GCC/LLVM (GDC/LDC) ones. 32 bit GDC/DMD are also able use SSE+ registers, that sometimes give performance gains.

To discuss a bit about the dot product performance (that's present in Phobos too) I suggest to pull out and show the assembly code. Timings alone don't suffice. I may produce some assembly later, if I create a little test program or if Caligo posts here one.

Bye,
bearophile
March 26, 2011
This little test program:


struct Vector(T) {
    T[3] data;

    T dot(const ref Vector o) {
        return data[0] * o.data[0] +
               data[1] * o.data[1] +
               data[2] * o.data[2];
    }

    T lengthSquaredSlow() {
        return dot(this);
    }

    T lengthSquaredFast() {
        return data[0] * data[0] +
               data[1] * data[1] +
               data[2] * data[2];
    }
}

Vector!double v;
void main() {}

The assembly, DMD 2.052, -O -release -inline:


dot (T=double):
        push    EBX
        mov EDX,EAX
        mov EBX,8[ESP]
        fld qword ptr 010h[EDX]
        fld qword ptr [EDX]
        fxch    ST1
        fmul    qword ptr 010h[EBX]
        fxch    ST1
        fld qword ptr 8[EDX]
        fxch    ST1
        fmul    qword ptr [EBX]
        fxch    ST1
        fmul    qword ptr 8[EBX]
        faddp   ST(1),ST
        faddp   ST(1),ST
        pop EBX
        ret 4

lengthSquaredSlow (T=double):
        mov ECX,EAX
        fld qword ptr 010h[EAX]
        fld qword ptr [ECX]
        fxch    ST1
        fmul    qword ptr 010h[ECX]
        fxch    ST1
        fld qword ptr 8[ECX]
        fxch    ST1
        fmul    qword ptr [ECX]
        fxch    ST1
        fmul    qword ptr 8[ECX]
        faddp   ST(1),ST
        faddp   ST(1),ST
        ret

lengthSquaredFast (T=double):
        mov ECX,EAX
        fld qword ptr 010h[EAX]
        fld qword ptr [ECX]
        fxch    ST1
        fmul    qword ptr 010h[ECX]
        fxch    ST1
        fld qword ptr 8[ECX]
        fxch    ST1
        fmul    qword ptr [ECX]
        fxch    ST1
        fmul    qword ptr 8[ECX]
        faddp   ST(1),ST
        faddp   ST(1),ST
        ret

The fast and slow versions seem to be compiled to the same code. So please, show a D code example where there is some difference.

Bye,
bearophile
March 26, 2011
I've changed my code since I posted this, so here is something different that shows performance difference:

module t1;

struct Vector{

private:
  double x = void;
  double y = void;
  double z = void;

public:
  this(in double x, in double y, in double z){
    this.x = x;
    this.y = y;
    this.z = z;
  }

  Vector opBinary(string op)(const double rhs) const if(op == "*"){
    return mixin("Vector(x"~op~"rhs, y"~op~"rhs, z"~op~"rhs)");
  }

  Vector opBinaryRight(string op)(const double lhs) const if(op == "*"){
    return opBinary!op(lhs);
  }
}

void main(){

  auto v1 = Vector(4, 5, 6);
  for(int i = 0; i < 60_000_000; i++){
    v1 = v1 * 1.00000012;
    //v1 = 1.00000012 * v1;
  }
}


Calling opBinaryRight:
/*  gdc -O3 -o t1 t1.d

real    0m0.394s
user    0m0.390s
sys     0m0.000s
*/

Calling opBinary:
/* gdc -O3 -o t1 t1.d

real    0m0.321s
user    0m0.310s
sys     0m0.000s
*/

Those results are best of 10.

There shouldn't be a performance difference between the two, but there.
March 26, 2011
Caligo:

> There shouldn't be a performance difference between the two, but there.

It seems the compiler isn't removing some useless code (the first has 3 groups of movsd, the second has 4 of them):

------------

v = v * 1.00000012;
main:

L45:    mov ESI,offset FLAT:_D4test6Vector6__initZ
        lea EDI,068h[ESP]
        movsd
        movsd
        movsd
        movsd
        movsd
        movsd
        fld qword ptr 010h[ESP]
        fld qword ptr 018h[ESP]
        fxch    ST1
        fmul    qword ptr FLAT:_DATA[018h]
        lea ESI,068h[ESP]
        lea EDI,048h[ESP]
        fxch    ST1
        fmul    qword ptr FLAT:_DATA[018h]
        fld qword ptr 8[ESP]
        fmul    qword ptr FLAT:_DATA[018h]
        fxch    ST2
        fstp    qword ptr 080h[ESP]
        fxch    ST1
        fld qword ptr 080h[ESP]
        fxch    ST2
        fstp    qword ptr 088h[ESP]
        fxch    ST1
        fld qword ptr 088h[ESP]
        fxch    ST2
        fstp    qword ptr 068h[ESP]
        fstp    qword ptr 070h[ESP]
        fstp    qword ptr 078h[ESP]
        movsd
        movsd
        movsd
        movsd
        movsd
        movsd
        lea ESI,048h[ESP]
        lea EDI,8[ESP]
        movsd
        movsd
        movsd
        movsd
        movsd
        movsd
        inc EAX
        cmp EAX,03938700h
        jb  L45

-----------------------------

v = 1.00000012 * v;
main:

L45:    mov ESI,offset FLAT:_D4test6Vector6__initZ
        lea EDI,088h[ESP]
        movsd
        movsd
        movsd
        movsd
        movsd
        movsd
        fld qword ptr 010h[ESP]
        fld qword ptr 018h[ESP]
        fxch    ST1
        fmul    qword ptr FLAT:_DATA[018h]
        lea ESI,088h[ESP]
        fxch    ST1
        fmul    qword ptr FLAT:_DATA[018h]
        fld qword ptr 8[ESP]
        fxch    ST2
        lea EDI,068h[ESP]
        fxch    ST2
        fmul    qword ptr FLAT:_DATA[018h]
        fxch    ST2
        fstp    qword ptr 0A0h[ESP]
        fxch    ST1
        fld qword ptr 0A0h[ESP]
        fxch    ST2
        fstp    qword ptr 0A8h[ESP]
        fxch    ST1
        fld qword ptr 0A8h[ESP]
        fxch    ST2
        fstp    qword ptr 088h[ESP]
        fstp    qword ptr 090h[ESP]
        fstp    qword ptr 098h[ESP]
        movsd
        movsd
        movsd
        movsd
        movsd
        movsd
        lea ESI,068h[ESP]
        lea EDI,048h[ESP]
        movsd
        movsd
        movsd
        movsd
        movsd
        movsd
        lea ESI,048h[ESP]
        lea EDI,8[ESP]
        movsd
        movsd
        movsd
        movsd
        movsd
        movsd
        inc EAX
        cmp EAX,03938700h
        jb  L45

-----------------

v.x *= 1.00000012; v.y *= 1.00000012; v.z *= 1.00000012;

L42:    fld qword ptr FLAT:_DATA[018h]
        inc EAX
        cmp EAX,03938700h
        fmul    qword ptr 8[ESP]
        fstp    qword ptr 8[ESP]
        fld qword ptr FLAT:_DATA[018h]
        fmul    qword ptr 010h[ESP]
        fstp    qword ptr 010h[ESP]
        fld qword ptr FLAT:_DATA[018h]
        fmul    qword ptr 018h[ESP]
        fstp    qword ptr 018h[ESP]
        jb  L42

-----------------

C GCC uses only 5 instructions/loop, to improve this :

v.x *= 1.00000012; v.y *= 1.00000012; v.z *= 1.00000012;

L2:
    fmul    %st, %st(3)
    subl    $1, %eax
    fmul    %st, %st(2)
    fmul    %st, %st(1)
    jne L2

-----------------

C GCC, -mfpmath=sse -msse3

v.x *= 1.00000012; v.y *= 1.00000012; v.z *= 1.00000012;

L2:
	subl	$1, %eax
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	jne	L2

-----------------

C GCC, -mfpmath=sse -msse3 -funroll-loops

L2:
	subl	$8, %eax
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	mulsd	%xmm0, %xmm1
	mulsd	%xmm0, %xmm2
	mulsd	%xmm0, %xmm3
	jne	L2

I have not found a quick way to let GCC vectorize this code, using two multiplications with one SSE instructions, I am not sure GCC is able to do this automatically.

Bye,
bearophile
« First   ‹ Prev
1 2