Thread overview | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
August 17, 2013 SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html Ilya |
August 17, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | On Saturday, 17 August 2013 at 18:50:15 UTC, Ilya Yaroshenko wrote:
> http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html
>
> Ilya
Nice, that's a good speedup.
BTW: -march=native automatically implies -mtune=native
|
August 17, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | > BTW: -march=native automatically implies -mtune=native
Thanks, I`ll remove mtune)
|
August 17, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:
>> BTW: -march=native automatically implies -mtune=native
>
> Thanks, I`ll remove mtune)
It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve.
|
August 17, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | Ilya Yaroshenko: > http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html From the blog post: > Compile fast_math code from other program separately and then link it. This is easy solution. However this is a step back to C.< > To introduce a @fast_math attribute. This is hard to realize. But I hope this will be done for future compilers.< One solution is to copy one of the features of Lisp, that is offer an annotation to specify different compilation switches for functions. Since some time it's present in GNU-C too: http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas Bye, bearophile |
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko Attachments:
| It doesn't look like you account for alignment.
This is basically not-portable (I doubt unaligned loads in this context are
faster than performing scalar operations), and possibly inefficient on x86
too.
To make it account for potentially random alignment will be awkward, but it
might be possible to do efficiently.
On 18 August 2013 04:50, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote:
> http://spiceandmath.blogspot.**ru/2013/08/simd-** implementation-of-dot-product_**17.html<http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html>
>
> Ilya
>
>
>
>
|
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote: > It doesn't look like you account for alignment. > This is basically not-portable (I doubt unaligned loads in this context are > faster than performing scalar operations), and possibly inefficient on x86 > too. dotProduct uses unaligned loads (__builtin_ia32_loadups256, __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar version. Why unaligned loads is not-portable and inefficient? > To make it account for potentially random alignment will be awkward, but it > might be possible to do efficiently. Did you mean use unaligned loads or prepare data for alignment loads at the beginning of function? |
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote: > On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote: >>> BTW: -march=native automatically implies -mtune=native >> >> Thanks, I`ll remove mtune) > > It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve. I am lazy ) I have looked at assembler code: float, scalar (main loop): .L191: vmovss xmm1, DWORD PTR [rsi+rax*4] vfmadd231ss xmm0, xmm1, DWORD PTR [rcx+rax*4] add rax, 1 cmp rax, rdi jne .L191 float, vector (main loop): .L2448: vmovups ymm5, YMMWORD PTR [rax] sub rax, -128 sub r11, -128 vmovups ymm4, YMMWORD PTR [r11-128] vmovups ymm6, YMMWORD PTR [rax-96] vmovups ymm7, YMMWORD PTR [r11-96] vfmadd231ps ymm3, ymm5, ymm4 vmovups ymm8, YMMWORD PTR [rax-64] vmovups ymm9, YMMWORD PTR [r11-64] vfmadd231ps ymm0, ymm6, ymm7 vmovups ymm10, YMMWORD PTR [rax-32] vmovups ymm11, YMMWORD PTR [r11-32] cmp rdi, rax vfmadd231ps ymm2, ymm8, ymm9 vfmadd231ps ymm1, ymm10, ymm11 ja .L2448 float, vector (full): https://gist.github.com/9il/6258443 It is pretty optimized) ____ Best regards Ilya |
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko Attachments:
| On 18 August 2013 14:39, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote: > On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote: > >> It doesn't look like you account for alignment. >> This is basically not-portable (I doubt unaligned loads in this context >> are >> faster than performing scalar operations), and possibly inefficient on x86 >> too. >> > > dotProduct uses unaligned loads (__builtin_ia32_loadups256, __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar version. > > Why unaligned loads is not-portable and inefficient? x86 is the only arch that can perform an unaligned load. And even on x86 (many implementations) it's not very efficient. To make it account for potentially random alignment will be awkward, but it >> might be possible to do efficiently. >> > > Did you mean use unaligned loads or prepare data for alignment loads at the beginning of function? > I mean to only use aligned loads, in whatever way that happens to work out. The hard case is when the 2 arrays have different start offsets. Otherwise you need to wrap your code in a version(x86) block. |
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On Sunday, 18 August 2013 at 05:07:12 UTC, Manu wrote: > On 18 August 2013 14:39, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote: > >> On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote: >> >>> It doesn't look like you account for alignment. >>> This is basically not-portable (I doubt unaligned loads in this context >>> are >>> faster than performing scalar operations), and possibly inefficient on x86 >>> too. >>> >> >> dotProduct uses unaligned loads (__builtin_ia32_loadups256, >> __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar >> version. >> >> Why unaligned loads is not-portable and inefficient? > > > x86 is the only arch that can perform an unaligned load. And even on x86 > (many implementations) it's not very efficient. :( > > > To make it account for potentially random alignment will be awkward, but it >>> might be possible to do efficiently. >>> >> >> Did you mean use unaligned loads or prepare data for alignment loads at >> the beginning of function? >> > > I mean to only use aligned loads, in whatever way that happens to work out. > The hard case is when the 2 arrays have different start offsets. > > Otherwise you need to wrap your code in a version(x86) block. Thanks! |
Copyright © 1999-2021 by the D Language Foundation