SIMD implementation of dot-product. Benchmarks

Aug 17, 2013

Ilya Yaroshenko

Aug 17, 2013

Aug 17, 2013

Aug 17, 2013

Aug 18, 2013

Aug 18, 2013

Aug 18, 2013

Aug 24, 2013

Aug 26, 2013

Aug 17, 2013

Aug 18, 2013

Aug 18, 2013

Aug 18, 2013

Aug 18, 2013

Aug 18, 2013

Aug 18, 2013

Aug 18, 2013

Aug 21, 2013

Aug 18, 2013

On Saturday, 17 August 2013 at 18:50:15 UTC, Ilya Yaroshenko wrote: > http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html > > Ilya Nice, that's a good speedup. BTW: -march=native automatically implies -mtune=native

On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote: >> BTW: -march=native automatically implies -mtune=native > > Thanks, I`ll remove mtune) It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve.

Ilya Yaroshenko: > http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html From the blog post: > Compile fast_math code from other program separately and then link it. This is easy solution. However this is a step back to C.< > To introduce a @fast_math attribute. This is hard to realize. But I hope this will be done for future compilers.< One solution is to copy one of the features of Lisp, that is offer an annotation to specify different compilation switches for functions. Since some time it's present in GNU-C too: http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas Bye, bearophile

It doesn't look like you account for alignment. This is basically not-portable (I doubt unaligned loads in this context are faster than performing scalar operations), and possibly inefficient on x86 too. To make it account for potentially random alignment will be awkward, but it might be possible to do efficiently. On 18 August 2013 04:50, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote: > http://spiceandmath.blogspot.**ru/2013/08/simd-** implementation-of-dot-product_**17.html<http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html> > > Ilya > > > >

On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote: > It doesn't look like you account for alignment. > This is basically not-portable (I doubt unaligned loads in this context are > faster than performing scalar operations), and possibly inefficient on x86 > too. dotProduct uses unaligned loads (__builtin_ia32_loadups256, __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar version. Why unaligned loads is not-portable and inefficient? > To make it account for potentially random alignment will be awkward, but it > might be possible to do efficiently. Did you mean use unaligned loads or prepare data for alignment loads at the beginning of function?

On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote: > On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote: >>> BTW: -march=native automatically implies -mtune=native >> >> Thanks, I`ll remove mtune) > > It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve. I am lazy ) I have looked at assembler code: float, scalar (main loop): .L191: vmovss xmm1, DWORD PTR [rsi+rax*4] vfmadd231ss xmm0, xmm1, DWORD PTR [rcx+rax*4] add rax, 1 cmp rax, rdi jne .L191 float, vector (main loop): .L2448: vmovups ymm5, YMMWORD PTR [rax] sub rax, -128 sub r11, -128 vmovups ymm4, YMMWORD PTR [r11-128] vmovups ymm6, YMMWORD PTR [rax-96] vmovups ymm7, YMMWORD PTR [r11-96] vfmadd231ps ymm3, ymm5, ymm4 vmovups ymm8, YMMWORD PTR [rax-64] vmovups ymm9, YMMWORD PTR [r11-64] vfmadd231ps ymm0, ymm6, ymm7 vmovups ymm10, YMMWORD PTR [rax-32] vmovups ymm11, YMMWORD PTR [r11-32] cmp rdi, rax vfmadd231ps ymm2, ymm8, ymm9 vfmadd231ps ymm1, ymm10, ymm11 ja .L2448 float, vector (full): https://gist.github.com/9il/6258443 It is pretty optimized) ____ Best regards Ilya

On 18 August 2013 14:39, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote: > On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote: > >> It doesn't look like you account for alignment. >> This is basically not-portable (I doubt unaligned loads in this context >> are >> faster than performing scalar operations), and possibly inefficient on x86 >> too. >> > > dotProduct uses unaligned loads (__builtin_ia32_loadups256, __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar version. > > Why unaligned loads is not-portable and inefficient? x86 is the only arch that can perform an unaligned load. And even on x86 (many implementations) it's not very efficient. To make it account for potentially random alignment will be awkward, but it >> might be possible to do efficiently. >> > > Did you mean use unaligned loads or prepare data for alignment loads at the beginning of function? > I mean to only use aligned loads, in whatever way that happens to work out. The hard case is when the 2 arrays have different start offsets. Otherwise you need to wrap your code in a version(x86) block.

August 18, 2013

Re: SIMD implementation of dot-product. Benchmarks

Posted by Ilya Yaroshenko
in reply to Manu

Permalink

Ilya Yaroshenko

Posted in reply to Manu

Permalink

On Sunday, 18 August 2013 at 05:07:12 UTC, Manu wrote:
> On 18 August 2013 14:39, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote:
>
>> On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
>>
>>> It doesn't look like you account for alignment.
>>> This is basically not-portable (I doubt unaligned loads in this context
>>> are
>>> faster than performing scalar operations), and possibly inefficient on x86
>>> too.
>>>
>>
>> dotProduct uses unaligned loads (__builtin_ia32_loadups256,
>> __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar
>> version.
>>
>> Why unaligned loads is not-portable and inefficient?
>
>
> x86 is the only arch that can perform an unaligned load. And even on x86
> (many implementations) it's not very efficient.

:(

>
>
>  To make it account for potentially random alignment will be awkward, but it
>>> might be possible to do efficiently.
>>>
>>
>> Did you mean use unaligned loads or prepare data for alignment loads at
>> the beginning of function?
>>
>
> I mean to only use aligned loads, in whatever way that happens to work out.
> The hard case is when the 2 arrays have different start offsets.
>
> Otherwise you need to wrap your code in a version(x86) block.

Thanks!

Forums