SIMD benchmark

I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. ----------------------- import core.simd; void test1a(float[4] a) { } void test1() { float[4] a = 1.2; a[] = a[] * 3 + 7; test1a(a); } void test2a(float4 a) { } void test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); } import std.stdio; import std.datetime; int main() { test1(); test2(); auto b = comparingBenchmark!(test1, test2, 100); writeln(b.point); return 0; }

On 1/14/2012 10:56 PM, Walter Bright wrote: > as there's a serious lack of source code to > test it with. Here's what there is at the moment. Needs much more. https://github.com/D-Programming-Language/dmd/blob/master/test/runnable/testxmm.d

On 15/01/12 6:56 AM, Walter Bright wrote: > I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. > Anyhow, it's good enough now to play around with. Consider it alpha > quality. Expect bugs - but make bug reports, as there's a serious lack > of source code to test it with. You sure you want proper bug reports for this? There still seems to be a lot of issues. For example, none of these work for me (OSX 64-bt). ---- int4 a = 2; // backend/cod2.c 2630 ---- int4 a = void; int4 b = void; a = b; // segfault ---- int4 a = void; a = simd(XMM.PXOR, a, a); // segfault ---- I could go on and on really. Very little seems to work at my end. Actually, looking at the auto-tester, I'm not alone. Just seems to be OSX though. http://d.puremagic.com/test-results/index.ghtml

On 15 January 2012 06:56, Walter Bright <newshound2@digitalmars.com> wrote: > I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. I get 20+ speedup without optimisations with GDC on that small test. :) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';

On 15 January 2012 16:59, Iain Buclaw <ibuclaw@ubuntu.com> wrote: > On 15 January 2012 06:56, Walter Bright <newshound2@digitalmars.com> wrote: >> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. > > I get 20+ speedup without optimisations with GDC on that small test. :) > Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my... -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';

On 1/15/2012 3:49 AM, Peter Alexander wrote: > Actually, looking at the auto-tester, I'm not alone. Just seems to be OSX though. Yeah, it's just OSX. I had the test for that platform inadvertently disabled, gak.

On 15 January 2012 20:10, Iain Buclaw <ibuclaw@ubuntu.com> wrote: > On 15 January 2012 16:59, Iain Buclaw <ibuclaw@ubuntu.com> wrote: > > On 15 January 2012 06:56, Walter Bright <newshound2@digitalmars.com> > wrote: > >> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha > quality. > >> Expect bugs - but make bug reports, as there's a serious lack of source > code > >> to test it with. > > > > I get 20+ speedup without optimisations with GDC on that small test. :) > > > > Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my... Oh my indeed. Haha, well I'm sure that's a fairly artificial result, but yes, this is why I've been harping on for months that it's a bare necessity to provide language support :P

Iain Buclaw: > Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my... Please, show me the assembly code produced, with its relative D source :-) Bye, bearophile

January 15, 2012

Re: SIMD benchmark

Posted by Iain Buclaw
in reply to bearophile

Permalink

Iain Buclaw

Posted in reply to bearophile

Permalink

On 15 January 2012 19:01, bearophile <bearophileHUGS@lycos.com> wrote:
> Iain Buclaw:
>
>> Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above.  My oh my...
>
> Please, show me the assembly code produced, with its relative D source :-)
>
> Bye,
> bearophile

D code:
----
import core.simd;

void test2a(float4 a) { }

float4 test2()
{
   float4 a = 1.2;
   a = a * 3 + 7;
   test2a(a);
   return a;
}
----

Relevant assembly:
----
.LC5:
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .section        .rodata.cst4,"aM",@progbits,4
        .align 4

_D4test5test2FZNhG4f:
        .cfi_startproc
        movl    $3, %eax
        cvtsi2ss        %eax, %xmm0
        movb    $7, %al
        cvtsi2ss        %eax, %xmm1
        unpcklps        %xmm0, %xmm0
        unpcklps        %xmm1, %xmm1
        movlhps %xmm0, %xmm0
        movlhps %xmm1, %xmm1
        mulps   .LC5(%rip), %xmm0
        addps   %xmm1, %xmm0
        ret
        .cfi_endproc
----

As someone pointed out to me, the only optimisation missing was constant propagation, but that doesn't matter too much for now.

Regards
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Forums