August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko Attachments:
| movups is not good. It'll be a lot faster (and portable) if you use movaps.
Process looks something like:
* do the first few from a[0] until a's alignment interval as scalar
* load the left of b's aligned pair
* loop for each aligned vector in a
- load a[n..n+4] aligned
- load the right of b's pair
- combine left~right and shift left to match elements against a
- left = right
* perform stragglers as scalar
Your benchmark is probably misleading too, because I suspect you are
passing directly alloc-ed arrays into the function (which are 16 byte
aligned).
movups will be significantly slower if the pointers supplied are not 16
byte aligned.
Also, results vary significantly between chip manufacturers and revisions.
On 18 August 2013 14:55, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote:
> On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:
>
>> On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:
>>
>>> BTW: -march=native automatically implies -mtune=native
>>>>
>>>
>>> Thanks, I`ll remove mtune)
>>>
>>
>> It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve.
>>
>
> I am lazy )
>
> I have looked at assembler code:
>
> float, scalar (main loop):
> .L191:
> vmovss xmm1, DWORD PTR [rsi+rax*4]
> vfmadd231ss xmm0, xmm1, DWORD PTR [rcx+rax*4]
> add rax, 1
> cmp rax, rdi
> jne .L191
>
>
> float, vector (main loop):
> .L2448:
> vmovups ymm5, YMMWORD PTR [rax]
> sub rax, -128
> sub r11, -128
> vmovups ymm4, YMMWORD PTR [r11-128]
> vmovups ymm6, YMMWORD PTR [rax-96]
> vmovups ymm7, YMMWORD PTR [r11-96]
> vfmadd231ps ymm3, ymm5, ymm4
> vmovups ymm8, YMMWORD PTR [rax-64]
> vmovups ymm9, YMMWORD PTR [r11-64]
> vfmadd231ps ymm0, ymm6, ymm7
> vmovups ymm10, YMMWORD PTR [rax-32]
> vmovups ymm11, YMMWORD PTR [r11-32]
> cmp rdi, rax
> vfmadd231ps ymm2, ymm8, ymm9
> vfmadd231ps ymm1, ymm10, ymm11
> ja .L2448
>
> float, vector (full):
> https://gist.github.com/9il/**6258443<https://gist.github.com/9il/6258443>
>
>
> It is pretty optimized)
>
>
> ____
> Best regards
>
> Ilya
>
>
|
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
> movups is not good. It'll be a lot faster (and portable) if you use movaps.
>
> Process looks something like:
> * do the first few from a[0] until a's alignment interval as scalar
> * load the left of b's aligned pair
> * loop for each aligned vector in a
> - load a[n..n+4] aligned
> - load the right of b's pair
> - combine left~right and shift left to match elements against a
> - left = right
> * perform stragglers as scalar
>
> Your benchmark is probably misleading too, because I suspect you are
> passing directly alloc-ed arrays into the function (which are 16 byte
> aligned).
> movups will be significantly slower if the pointers supplied are not 16
> byte aligned.
> Also, results vary significantly between chip manufacturers and revisions.
I`ll try =). Thanks you very math!
|
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | On 8/17/13 11:50 AM, Ilya Yaroshenko wrote:
> http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html
>
>
> Ilya
The images never load for me, all I see is some "Request timed out" stripes after the text.
Typo: Ununtu
Andrei
|
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 18 August 2013 at 16:32:33 UTC, Andrei Alexandrescu wrote: > On 8/17/13 11:50 AM, Ilya Yaroshenko wrote: >> http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html >> >> >> Ilya > > The images never load for me, all I see is some "Request timed out" stripes after the text. I have changed interactive charts to png images. Does it works? > > Typo: Ununtu > > > Andrei |
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | On 17 August 2013 19:50, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote: > http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html > > Ilya > > Having a quick flick through the simd.d source, I see LDC's and GDC's implementation couldn't be any more wildly different... (LDC's doesn't even look like D code thanks to pragma LDC_inline_ir :) Thumbs up on marking functions as pure - I should really document somewhere how gcc builtins are fleshed out to the gcc.builtins module sometime... -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0'; |
August 18, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | On 8/18/13 10:24 AM, Ilya Yaroshenko wrote:
> On Sunday, 18 August 2013 at 16:32:33 UTC, Andrei Alexandrescu wrote:
>> On 8/17/13 11:50 AM, Ilya Yaroshenko wrote:
>>> http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html
>>>
>>>
>>>
>>> Ilya
>>
>> The images never load for me, all I see is some "Request timed out"
>> stripes after the text.
>
> I have changed interactive charts to png images.
> Does it works?
Yes, thanks.
Andrei
|
August 21, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko | > On 8/17/13 11:50 AM, Ilya Yaroshenko wrote: > http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html http://www.reddit.com/r/programming/comments/1ktue0/benchmarking_a_simd_implementation_of_dot_product/ Andrei |
August 24, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
> movups is not good. It'll be a lot faster (and portable) if you use movaps.
>
> Process looks something like:
> * do the first few from a[0] until a's alignment interval as scalar
> * load the left of b's aligned pair
> * loop for each aligned vector in a
> - load a[n..n+4] aligned
> - load the right of b's pair
> - combine left~right and shift left to match elements against a
> - left = right
> * perform stragglers as scalar
>
> Your benchmark is probably misleading too, because I suspect you are
> passing directly alloc-ed arrays into the function (which are 16 byte
> aligned).
> movups will be significantly slower if the pointers supplied are not 16
> byte aligned.
> Also, results vary significantly between chip manufacturers and revisions.
I have tried to write fast implementation with aligned loads:
1. I have now idea how to shift (rotate) 32-bytes avx vector without XOP instruction set (XOP available only for AMD).
2. I have tried to use one vmovaps and [one vmovups]/[two vinsertf128] with 16-bytes aligned arrays (previously iterates with a). It works slower then two vmovups (because loop tricks). Now I have 300 lines of slow dotProduct code =)
4. Condition for small arrays works good.
I think it is better to use:
1. vmovups if it is available with condition for small arrays
2. version like from phobos if vmovups is not avalible
3. special version for small static size arrays
I think version for static size arrays can be easily done for phobos, processors can unroll such code. And dot product optimized for complex numbers can be done too.
Best regards
Ilya
|
August 26, 2013 Re: SIMD implementation of dot-product. Benchmarks | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ilya Yaroshenko Attachments:
| On 25 August 2013 01:01, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote: > On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote: > >> movups is not good. It'll be a lot faster (and portable) if you use >> movaps. >> >> Process looks something like: >> * do the first few from a[0] until a's alignment interval as scalar >> * load the left of b's aligned pair >> * loop for each aligned vector in a >> - load a[n..n+4] aligned >> - load the right of b's pair >> - combine left~right and shift left to match elements against a >> - left = right >> * perform stragglers as scalar >> >> Your benchmark is probably misleading too, because I suspect you are >> passing directly alloc-ed arrays into the function (which are 16 byte >> aligned). >> movups will be significantly slower if the pointers supplied are not 16 >> byte aligned. >> Also, results vary significantly between chip manufacturers and revisions. >> > > > I have tried to write fast implementation with aligned loads: > 1. I have now idea how to shift (rotate) 32-bytes avx vector without XOP > instruction set (XOP available only for AMD). > 2. I have tried to use one vmovaps and [one vmovups]/[two vinsertf128] > with 16-bytes aligned arrays (previously iterates with a). It works slower > then two vmovups (because loop tricks). Now I have 300 lines of slow > dotProduct code =) > This if course depends largely on your processor too. What processor/revision? There is a massive difference between vendors and revisions. > 4. Condition for small arrays works good. > Did you try putting this path selection logic in an outer function that the compiler can inline? Where's your new code with the movaps solution? |
Copyright © 1999-2021 by the D Language Foundation