SIMD benchmark (page 2)

On 15 January 2012 19:01, bearophile <bearophileHUGS@lycos.com> wrote: > Iain Buclaw: > >> Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above. My oh my... > > Please, show me the assembly code produced, with its relative D source :-) > > Bye, > bearophile For those who can't read AT&T: ---- .LC5: .long 1067030938 .long 1067030938 .long 1067030938 .long 1067030938 .align 16 _D4test5test2FZNhG4f: .cfi_startproc mov eax, 3 cvtsi2ss xmm0, eax mov al, 7 cvtsi2ss xmm1, eax unpcklps xmm0, xmm0 unpcklps xmm1, xmm1 movlhps xmm0, xmm0 movlhps xmm1, xmm1 mulps xmm0, XMMWORD PTR .LC5[rip] addps xmm0, xmm1 ret .cfi_endproc ---- -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';

I just built 32 & 64 bit DMD (latest commit on git tree is f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) Using the 32-bit version, I got this error: Internal error: backend/cg87.c 1702 The 64-bit version went fine. Previously, both 32 and 64 bit version had no problem. On 01/15/2012 01:56 PM, Walter Bright wrote: > I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. > ----------------------- > import core.simd; > > void test1a(float[4] a) { } > > void test1() > { > float[4] a = 1.2; > a[] = a[] * 3 + 7; > test1a(a); > } > > void test2a(float4 a) { } > > void test2() > { > float4 a = 1.2; > a = a * 3 + 7; > test2a(a); > } > > import std.stdio; > import std.datetime; > > int main() > { > test1(); > test2(); > auto b = comparingBenchmark!(test1, test2, 100); > writeln(b.point); > return 0; > }

On 1/16/2012 12:59 AM, Andre Tampubolon wrote: > I just built 32& 64 bit DMD (latest commit on git tree is > f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) > > Using the 32-bit version, I got this error: > Internal error: backend/cg87.c 1702 > > The 64-bit version went fine. > > Previously, both 32 and 64 bit version had no problem. Which machine?

Well I only have 1 machine, a laptop running 64 bit Arch Linux. Yesterday I did a git pull, built both 32 & 64 bit DMD, and this code compiled fine using those. But now, the 32 bit version fails. Walter Bright <newshound2@digitalmars.com> wrote: > On 1/16/2012 12:59 AM, Andre Tampubolon wrote: >> I just built 32& 64 bit DMD (latest commit on git tree is >> f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) >> >> Using the 32-bit version, I got this error: >> Internal error: backend/cg87.c 1702 >> >> The 64-bit version went fine. >> >> Previously, both 32 and 64 bit version had no problem. > > Which machine?

32 bit SIMD for Linux is not implemented. It's all 64 bit platforms, and 32 bit OS X. On 1/16/2012 2:35 AM, Andre Tampubolon wrote: > Well I only have 1 machine, a laptop running 64 bit Arch Linux. > Yesterday I did a git pull, built both 32& 64 bit DMD, and this code > compiled fine using those. > But now, the 32 bit version fails. > > Walter Bright<newshound2@digitalmars.com> wrote: >> On 1/16/2012 12:59 AM, Andre Tampubolon wrote: >>> I just built 32& 64 bit DMD (latest commit on git tree is >>> f800f6e342e2d9ab1ec9a6275b8239463aa1cee8) >>> >>> Using the 32-bit version, I got this error: >>> Internal error: backend/cg87.c 1702 >>> >>> The 64-bit version went fine. >>> >>> Previously, both 32 and 64 bit version had no problem. >> >> Which machine?

On 1/15/12 12:56 AM, Walter Bright wrote: > I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. > Anyhow, it's good enough now to play around with. Consider it alpha > quality. Expect bugs - but make bug reports, as there's a serious lack > of source code to test it with. > ----------------------- > import core.simd; > > void test1a(float[4] a) { } > > void test1() > { > float[4] a = 1.2; > a[] = a[] * 3 + 7; > test1a(a); > } > > void test2a(float4 a) { } > > void test2() > { > float4 a = 1.2; > a = a * 3 + 7; > test2a(a); > } These two functions should have the same speed. The function that ought to be slower is: void test1() { float[5] a = 1.2; float[] b = a[1 .. $]; b[] = b[] * 3 + 7; test1a(a); } Andrei

On 16 January 2012 18:17, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org > wrote: > On 1/15/12 12:56 AM, Walter Bright wrote: > >> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with. >> ----------------------- >> import core.simd; >> >> void test1a(float[4] a) { } >> >> void test1() >> { >> float[4] a = 1.2; >> a[] = a[] * 3 + 7; >> test1a(a); >> } >> >> void test2a(float4 a) { } >> >> void test2() >> { >> float4 a = 1.2; >> a = a * 3 + 7; >> test2a(a); >> } >> > > These two functions should have the same speed. A function using float arrays and a function using hardware vectors should certainly not be the same speed.

On 1/16/12 10:46 AM, Manu wrote: > A function using float arrays and a function using hardware vectors > should certainly not be the same speed. My point was that the version using float arrays should opportunistically use hardware ops whenever possible. Andrei

On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > On 1/15/12 12:56 AM, Walter Bright wrote: >> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. >> Anyhow, it's good enough now to play around with. Consider it alpha >> quality. Expect bugs - but make bug reports, as there's a serious lack >> of source code to test it with. >> ----------------------- >> import core.simd; >> >> void test1a(float[4] a) { } >> >> void test1() >> { >> float[4] a = 1.2; >> a[] = a[] * 3 + 7; >> test1a(a); >> } >> >> void test2a(float4 a) { } >> >> void test2() >> { >> float4 a = 1.2; >> a = a * 3 + 7; >> test2a(a); >> } > > These two functions should have the same speed. The function that ought to be slower is: > > void test1() > { > float[5] a = 1.2; > float[] b = a[1 .. $]; > b[] = b[] * 3 + 7; > test1a(a); > } > > > Andrei Unfortunately druntime's array ops are a mess and fail to speed up anything below 16 floats. Additionally there is overhead for a function call and they have to check alignment at runtime. martin

January 16, 2012

Re: SIMD benchmark

Posted by Manu
in reply to Andrei Alexandrescu

Permalink

Manu

Posted in reply to Andrei Alexandrescu

Attachments:

text/html part

Permalink

On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org
> wrote:

> On 1/16/12 10:46 AM, Manu wrote:
>
>> A function using float arrays and a function using hardware vectors should certainly not be the same speed.
>>
>
> My point was that the version using float arrays should opportunistically use hardware ops whenever possible.

I think this is a mistake, because such a piece of code never exists
outside of some context. If the context it exists within is all FPU code
(and it is, it's a float array), then swapping between FPU and SIMD
execution units will probably result in the function being slower than the
original (also the float array is unaligned). The SIMD version however must
exist within a SIMD context, since the API can't implicitly interact with
floats, this guarantees that the context of each function matches that
within which it lives.
This is fundamental to fast vector performance. Using SIMD is an all or
nothing decision, you can't just mix it in here and there.
You don't go casting back and fourth between floats and ints on every other
line... obviously it's imprecise, but it's also a major performance hazard.
There is no difference here, except the performance hazard is much worse.

Forums