Thread overview | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
October 25, 2015 Playing SIMD | ||||
---|---|---|---|---|
| ||||
Here is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values. ------ immutable size_t ArraySize = 16; int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack) { ubyte16 arr; arr.array = haystack[]; ubyte16 niddles; niddles.array[] = niddle; ubyte16 result; result = __simd_sto(XMM.PCMPEQB, arr, niddles); alias Mask = ulong; static assert(2 * Mask.sizeof == result.sizeof); immutable BitsInByte = 8; if (Mask mask = *cast(Mask*)(result.array.ptr)) { return bsf(mask) / BitsInByte; } else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof)) { return bsf(mask) / BitsInByte + cast(int)Mask.sizeof; } else { return -1; } } ------ Is it optimal and how do you implement this stuf? Full exaple with comparation of algorithms (SIMD, naive, binary search): http://dpaste.dzfl.pl/f3b8989841e3 Benchmark result on dpaste.dzfl.pl: SIMD: TickDuration(157000) Binary: TickDuration(472000) Naive: TickDuration(437000) At home with defult dub config "dub run --build=release": SIMD: TickDuration(241566) Binary: TickDuration(450515) Naive: TickDuration(450371) |
October 25, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iakh | On 10/25/2015 03:37 PM, Iakh wrote:
> Here is my implementatation of SIMD find. Function returns index of
> ubyte in static 16 byte array with unique values.
>
> ------
> immutable size_t ArraySize = 16;
> int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack)
> {
> ubyte16 arr;
> arr.array = haystack[];
> ubyte16 niddles;
> niddles.array[] = niddle;
> ubyte16 result;
> result = __simd_sto(XMM.PCMPEQB, arr, niddles);
> alias Mask = ulong;
> static assert(2 * Mask.sizeof == result.sizeof);
> immutable BitsInByte = 8;
>
> if (Mask mask = *cast(Mask*)(result.array.ptr))
> {
> return bsf(mask) / BitsInByte;
> }
> else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof))
> {
> return bsf(mask) / BitsInByte + cast(int)Mask.sizeof;
> }
> else
> {
> return -1;
> }
>
> }
> ------
>
> Is it optimal and how do you implement this stuf?
>
> Full exaple with comparation of algorithms (SIMD, naive, binary search):
> http://dpaste.dzfl.pl/f3b8989841e3
>
> Benchmark result on dpaste.dzfl.pl:
> SIMD: TickDuration(157000)
> Binary: TickDuration(472000)
> Naive: TickDuration(437000)
>
> At home with defult dub config "dub run --build=release":
> SIMD: TickDuration(241566)
> Binary: TickDuration(450515)
> Naive: TickDuration(450371)
This is compelling but needs a bit of work to integrate. Care to work on a PR that makes std.algorithm use it? Thanks! -- Andrei
|
October 25, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iakh | On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:
> Is it optimal and how do you implement this stuf?
>
I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result in memory and you need only one BSF instruction.
|
October 25, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Matthias Bentrup | On Sunday, 25 October 2015 at 22:17:58 UTC, Matthias Bentrup wrote: > On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote: >> Is it optimal and how do you implement this stuf? >> > > I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result in memory and you need only one BSF instruction. Yeah but PMOVMSKB not implemented in core.simd. Bit more comprehensive here: http://forum.dlang.org/post/20150923115833.054fdb09@marco-toshiba |
October 25, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu wrote:
> [...]
> This is compelling but needs a bit of work to integrate. Care to work on a PR that makes std.algorithm use it? Thanks! -- Andrei
First of all I need sort of investigation about PRs and std.algorithm. But in general challenge accepted.
|
October 26, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iakh | On 25.10.2015 20:37, Iakh wrote: > Full exaple with comparation of algorithms (SIMD, naive, binary search): > http://dpaste.dzfl.pl/f3b8989841e3 From there: > void runBinary() > { > static int i = 0; > naiveIndexOf(cast(ubyte)(i++/ArraySize + 1), arr); > } runBinary calls naiveIndexOf. You're not testing binaryIndexOf. |
October 26, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iakh | On 10/25/15 6:57 PM, Iakh wrote:
> On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu wrote:
>> [...]
>> This is compelling but needs a bit of work to integrate. Care to work
>> on a PR that makes std.algorithm use it? Thanks! -- Andrei
>
> First of all I need sort of investigation about PRs and std.algorithm.
> But in general challenge accepted.
Thanks! -- Andrei
|
October 26, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to anonymous | On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote: > runBinary calls naiveIndexOf. You're not testing binaryIndexOf. You are right. This is fixed example: http://dpaste.dzfl.pl/f7a54b789a21 and results at dpaste.dzfl.pl: ----- SIMD: TickDuration(151000) Binary: TickDuration(255000) Naive: TickDuration(459000) So SIMD version ~1.68 faster than binary |
October 26, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iakh | On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:
> On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
>> runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
>
> You are right.
> This is fixed example:
> http://dpaste.dzfl.pl/f7a54b789a21
>
> and results at dpaste.dzfl.pl:
> -----
> SIMD: TickDuration(151000)
> Binary: TickDuration(255000)
> Naive: TickDuration(459000)
>
> So SIMD version ~1.68 faster than binary
At home with defult dub config "dub run --build=release":
-----
SIMD: TickDuration(350644)
Binary: TickDuration(434014)
Naive: TickDuration(657548)
~1.24 times faster than binary and
~1.87 times faster than naive
|
October 26, 2015 Re: Playing SIMD | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iakh | On Monday, 26 October 2015 at 11:10:31 UTC, Iakh wrote:
> On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:
>> On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
>>> runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
>>
>> You are right.
>> This is fixed example:
>> http://dpaste.dzfl.pl/f7a54b789a21
>>
>> and results at dpaste.dzfl.pl:
>> -----
>> SIMD: TickDuration(151000)
>> Binary: TickDuration(255000)
>> Naive: TickDuration(459000)
>>
>> So SIMD version ~1.68 faster than binary
>
> At home with defult dub config "dub run --build=release":
> -----
> SIMD: TickDuration(350644)
> Binary: TickDuration(434014)
> Naive: TickDuration(657548)
>
>
> ~1.24 times faster than binary and
> ~1.87 times faster than naive
It's good to see work like this. Have you tried with gdc/ldc (-march=native / -mcpu=native respectively if you want to use the full available instruction set)?
|
Copyright © 1999-2021 by the D Language Foundation