Thread overview | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
On 2011-03-25 19:04, Caligo wrote:
> T[3] data;
>
> T dot(const ref Vector o){
> return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2];
> }
>
> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
> data[2] * data[2]; }
> T LengthSquared_Slow(){ return dot(this); }
>
>
> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose.
It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
- Jonathan M Davis
|
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On 2011-03-25 19:04, Caligo wrote:
>> T[3] data;
>>
>> T dot(const ref Vector o){
>> return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2];
>> }
>>
>> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
>> data[2] * data[2]; }
>> T LengthSquared_Slow(){ return dot(this); }
>>
>>
>> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose.
>
> It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
>
> - Jonathan M Davis
>
I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance.
|
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
On 2011-03-25 21:21, Caligo wrote: > On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote: > > On 2011-03-25 19:04, Caligo wrote: > >> T[3] data; > >> > >> T dot(const ref Vector o){ > >> return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * > >> o.data[2]; } > >> > >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] + > >> data[2] * data[2]; } > >> T LengthSquared_Slow(){ return dot(this); } > >> > >> > >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose. > > > > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it. > > > > - Jonathan M Davis > > I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance. I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on. - Jonathan M Davis |
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On 2011-03-25 21:21, Caligo wrote:
>> On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com>
> wrote:
>> > On 2011-03-25 19:04, Caligo wrote:
>> >> T[3] data;
>> >>
>> >> T dot(const ref Vector o){
>> >> return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
>> >> o.data[2]; }
>> >>
>> >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
>> >> data[2] * data[2]; }
>> >> T LengthSquared_Slow(){ return dot(this); }
>> >>
>> >>
>> >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose.
>> >
>> > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
>> >
>> > - Jonathan M Davis
>>
>> I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance.
>
> I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on.
>
> - Jonathan M Davis
>
The only time that -inline has no effect is when I turn on -O3. This is also when the code performs the best. I've never used -O3 in my C++ code, but I guess things are different in D even with the same back-end.
|
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
On 2011-03-26 01:06, Caligo wrote: > On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote: > > On 2011-03-25 21:21, Caligo wrote: > >> On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com> > > > > wrote: > >> > On 2011-03-25 19:04, Caligo wrote: > >> >> T[3] data; > >> >> > >> >> T dot(const ref Vector o){ > >> >> return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * > >> >> o.data[2]; } > >> >> > >> >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] > >> >> + data[2] * data[2]; } > >> >> T LengthSquared_Slow(){ return dot(this); } > >> >> > >> >> > >> >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose. > >> > > >> > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it. > >> > > >> > - Jonathan M Davis > >> > >> I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance. > > > > I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on. > > > > - Jonathan M Davis > > The only time that -inline has no effect is when I turn on -O3. This is also when the code performs the best. I've never used -O3 in my C++ code, but I guess things are different in D even with the same back-end. I really don't know what gdc does. With dmd, inlining is not turned on unless -inline is used. Also, -inline with dmd does not force inlining, it merely turns on the optimization. The compiler still chooses where and when it's best to inline. With gcc, I believe that inlining is normally turned on at a pretty low optimization level (probably -O), and like dmd, it chooses where and when it's best to inline, but unlike dmd, it uses the inline keyword in C++ as a hint as to what it should do. However, -O3 forces inlining on all functions marked with inline. How gdc deals with that given that D doesn't have an inline keyword, I don't know. Regardless, given what inlining does, I have a _very_ hard time believing that it would ever degrade performance unless it's buggy. - Jonathan M Davis |
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
On Sat, Mar 26, 2011 at 3:47 AM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On 2011-03-26 01:06, Caligo wrote:
>> On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis <jmdavisProg@gmx.com>
> wrote:
>> > On 2011-03-25 21:21, Caligo wrote:
>> >> On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis <jmdavisProg@gmx.com>
>> >
>> > wrote:
>> >> > On 2011-03-25 19:04, Caligo wrote:
>> >> >> T[3] data;
>> >> >>
>> >> >> T dot(const ref Vector o){
>> >> >> return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
>> >> >> o.data[2]; }
>> >> >>
>> >> >> T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1]
>> >> >> + data[2] * data[2]; }
>> >> >> T LengthSquared_Slow(){ return dot(this); }
>> >> >>
>> >> >>
>> >> >> The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose.
>> >> >
>> >> > It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it.
>> >> >
>> >> > - Jonathan M Davis
>> >>
>> >> I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance.
>> >
>> > I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on.
>> >
>> > - Jonathan M Davis
>>
>> The only time that -inline has no effect is when I turn on -O3. This is also when the code performs the best. I've never used -O3 in my C++ code, but I guess things are different in D even with the same back-end.
>
> I really don't know what gdc does. With dmd, inlining is not turned on unless -inline is used. Also, -inline with dmd does not force inlining, it merely turns on the optimization. The compiler still chooses where and when it's best to inline.
>
> With gcc, I believe that inlining is normally turned on at a pretty low optimization level (probably -O), and like dmd, it chooses where and when it's best to inline, but unlike dmd, it uses the inline keyword in C++ as a hint as to what it should do. However, -O3 forces inlining on all functions marked with inline. How gdc deals with that given that D doesn't have an inline keyword, I don't know.
>
> Regardless, given what inlining does, I have a _very_ hard time believing that it would ever degrade performance unless it's buggy.
>
> - Jonathan M Davis
>
I was going to post my code, but I take back what I said. What is happening is that there is a lot of fluctuation in performance. The low performance always occurred when I had -inline enabled, which made me think -inline degrades performance. The performance should be consistent, but for some reason it's not.
The important thing is that -inline doesn't make any difference with GDC. The -O3 does make a big difference.
|
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | Answer for Jonathan M Davis and Caligo: I far as I remember you need to use -finline-functions on GDC to perform inlining. -O3 implies inlining, on GCC, and I presume on GDC too. Inlining is a complex art, the compilers compute a score for each function and each function call and decide if perform the inlining. There are many situations where inlining harms performance, and it's not just a matter of code cache pressure (this list of problems is not complete: http://en.wikipedia.org/wiki/Inlining#Problems ). DMD inlining is in many ways weak compared to GCC/LLVM (GDC/LDC) ones. 32 bit GDC/DMD are also able use SSE+ registers, that sometimes give performance gains. To discuss a bit about the dot product performance (that's present in Phobos too) I suggest to pull out and show the assembly code. Timings alone don't suffice. I may produce some assembly later, if I create a little test program or if Caligo posts here one. Bye, bearophile |
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | This little test program: struct Vector(T) { T[3] data; T dot(const ref Vector o) { return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2]; } T lengthSquaredSlow() { return dot(this); } T lengthSquaredFast() { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; } } Vector!double v; void main() {} The assembly, DMD 2.052, -O -release -inline: dot (T=double): push EBX mov EDX,EAX mov EBX,8[ESP] fld qword ptr 010h[EDX] fld qword ptr [EDX] fxch ST1 fmul qword ptr 010h[EBX] fxch ST1 fld qword ptr 8[EDX] fxch ST1 fmul qword ptr [EBX] fxch ST1 fmul qword ptr 8[EBX] faddp ST(1),ST faddp ST(1),ST pop EBX ret 4 lengthSquaredSlow (T=double): mov ECX,EAX fld qword ptr 010h[EAX] fld qword ptr [ECX] fxch ST1 fmul qword ptr 010h[ECX] fxch ST1 fld qword ptr 8[ECX] fxch ST1 fmul qword ptr [ECX] fxch ST1 fmul qword ptr 8[ECX] faddp ST(1),ST faddp ST(1),ST ret lengthSquaredFast (T=double): mov ECX,EAX fld qword ptr 010h[EAX] fld qword ptr [ECX] fxch ST1 fmul qword ptr 010h[ECX] fxch ST1 fld qword ptr 8[ECX] fxch ST1 fmul qword ptr [ECX] fxch ST1 fmul qword ptr 8[ECX] faddp ST(1),ST faddp ST(1),ST ret The fast and slow versions seem to be compiled to the same code. So please, show a D code example where there is some difference. Bye, bearophile |
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | I've changed my code since I posted this, so here is something different that shows performance difference: module t1; struct Vector{ private: double x = void; double y = void; double z = void; public: this(in double x, in double y, in double z){ this.x = x; this.y = y; this.z = z; } Vector opBinary(string op)(const double rhs) const if(op == "*"){ return mixin("Vector(x"~op~"rhs, y"~op~"rhs, z"~op~"rhs)"); } Vector opBinaryRight(string op)(const double lhs) const if(op == "*"){ return opBinary!op(lhs); } } void main(){ auto v1 = Vector(4, 5, 6); for(int i = 0; i < 60_000_000; i++){ v1 = v1 * 1.00000012; //v1 = 1.00000012 * v1; } } Calling opBinaryRight: /* gdc -O3 -o t1 t1.d real 0m0.394s user 0m0.390s sys 0m0.000s */ Calling opBinary: /* gdc -O3 -o t1 t1.d real 0m0.321s user 0m0.310s sys 0m0.000s */ Those results are best of 10. There shouldn't be a performance difference between the two, but there. |
March 26, 2011 Re: inline functions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Caligo | Caligo:
> There shouldn't be a performance difference between the two, but there.
It seems the compiler isn't removing some useless code (the first has 3 groups of movsd, the second has 4 of them):
------------
v = v * 1.00000012;
main:
L45: mov ESI,offset FLAT:_D4test6Vector6__initZ
lea EDI,068h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
fld qword ptr 010h[ESP]
fld qword ptr 018h[ESP]
fxch ST1
fmul qword ptr FLAT:_DATA[018h]
lea ESI,068h[ESP]
lea EDI,048h[ESP]
fxch ST1
fmul qword ptr FLAT:_DATA[018h]
fld qword ptr 8[ESP]
fmul qword ptr FLAT:_DATA[018h]
fxch ST2
fstp qword ptr 080h[ESP]
fxch ST1
fld qword ptr 080h[ESP]
fxch ST2
fstp qword ptr 088h[ESP]
fxch ST1
fld qword ptr 088h[ESP]
fxch ST2
fstp qword ptr 068h[ESP]
fstp qword ptr 070h[ESP]
fstp qword ptr 078h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
lea ESI,048h[ESP]
lea EDI,8[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
inc EAX
cmp EAX,03938700h
jb L45
-----------------------------
v = 1.00000012 * v;
main:
L45: mov ESI,offset FLAT:_D4test6Vector6__initZ
lea EDI,088h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
fld qword ptr 010h[ESP]
fld qword ptr 018h[ESP]
fxch ST1
fmul qword ptr FLAT:_DATA[018h]
lea ESI,088h[ESP]
fxch ST1
fmul qword ptr FLAT:_DATA[018h]
fld qword ptr 8[ESP]
fxch ST2
lea EDI,068h[ESP]
fxch ST2
fmul qword ptr FLAT:_DATA[018h]
fxch ST2
fstp qword ptr 0A0h[ESP]
fxch ST1
fld qword ptr 0A0h[ESP]
fxch ST2
fstp qword ptr 0A8h[ESP]
fxch ST1
fld qword ptr 0A8h[ESP]
fxch ST2
fstp qword ptr 088h[ESP]
fstp qword ptr 090h[ESP]
fstp qword ptr 098h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
lea ESI,068h[ESP]
lea EDI,048h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
lea ESI,048h[ESP]
lea EDI,8[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
inc EAX
cmp EAX,03938700h
jb L45
-----------------
v.x *= 1.00000012; v.y *= 1.00000012; v.z *= 1.00000012;
L42: fld qword ptr FLAT:_DATA[018h]
inc EAX
cmp EAX,03938700h
fmul qword ptr 8[ESP]
fstp qword ptr 8[ESP]
fld qword ptr FLAT:_DATA[018h]
fmul qword ptr 010h[ESP]
fstp qword ptr 010h[ESP]
fld qword ptr FLAT:_DATA[018h]
fmul qword ptr 018h[ESP]
fstp qword ptr 018h[ESP]
jb L42
-----------------
C GCC uses only 5 instructions/loop, to improve this :
v.x *= 1.00000012; v.y *= 1.00000012; v.z *= 1.00000012;
L2:
fmul %st, %st(3)
subl $1, %eax
fmul %st, %st(2)
fmul %st, %st(1)
jne L2
-----------------
C GCC, -mfpmath=sse -msse3
v.x *= 1.00000012; v.y *= 1.00000012; v.z *= 1.00000012;
L2:
subl $1, %eax
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
jne L2
-----------------
C GCC, -mfpmath=sse -msse3 -funroll-loops
L2:
subl $8, %eax
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
mulsd %xmm0, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm3
jne L2
I have not found a quick way to let GCC vectorize this code, using two multiplications with one SSE instructions, I am not sure GCC is able to do this automatically.
Bye,
bearophile
|
Copyright © 1999-2021 by the D Language Foundation