Jump to page: 1 2 3
Thread overview
Playing SIMD
Oct 25, 2015
Iakh
Oct 25, 2015
Iakh
Oct 25, 2015
Matthias Bentrup
Oct 25, 2015
Iakh
Oct 26, 2015
Iain Buclaw
Oct 28, 2015
Marco Leise
Oct 26, 2015
anonymous
Oct 26, 2015
Iakh
Oct 26, 2015
Iakh
Oct 26, 2015
John Colvin
Oct 26, 2015
rumbu
Oct 26, 2015
Kagamin
Oct 26, 2015
Iakh
Oct 26, 2015
Iakh
Oct 26, 2015
Don
Oct 26, 2015
Iain Buclaw
Oct 26, 2015
Iakh
October 25, 2015
Here is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values.

------
immutable size_t ArraySize = 16;
int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack)
{
    ubyte16 arr;
    arr.array = haystack[];
    ubyte16 niddles;
    niddles.array[] = niddle;
    ubyte16 result;
    result = __simd_sto(XMM.PCMPEQB, arr, niddles);
    alias Mask = ulong;
	static assert(2 * Mask.sizeof == result.sizeof);
	immutable BitsInByte = 8;

    if (Mask mask = *cast(Mask*)(result.array.ptr))
    {
        return bsf(mask) / BitsInByte;
    }
    else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof))
    {
        return bsf(mask) / BitsInByte + cast(int)Mask.sizeof;
    }
    else
    {
        return -1;
    }

}
------

Is it optimal and how do you implement this stuf?

Full exaple with comparation of algorithms (SIMD, naive, binary search):
http://dpaste.dzfl.pl/f3b8989841e3

Benchmark result on dpaste.dzfl.pl:
SIMD:   TickDuration(157000)
Binary: TickDuration(472000)
Naive:  TickDuration(437000)

At home with defult dub config "dub run --build=release":
SIMD:    TickDuration(241566)
Binary:  TickDuration(450515)
Naive:   TickDuration(450371)
October 25, 2015
On 10/25/2015 03:37 PM, Iakh wrote:
> Here is my implementatation of SIMD find. Function returns index of
> ubyte in static 16 byte array with unique values.
>
> ------
> immutable size_t ArraySize = 16;
> int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack)
> {
>      ubyte16 arr;
>      arr.array = haystack[];
>      ubyte16 niddles;
>      niddles.array[] = niddle;
>      ubyte16 result;
>      result = __simd_sto(XMM.PCMPEQB, arr, niddles);
>      alias Mask = ulong;
>      static assert(2 * Mask.sizeof == result.sizeof);
>      immutable BitsInByte = 8;
>
>      if (Mask mask = *cast(Mask*)(result.array.ptr))
>      {
>          return bsf(mask) / BitsInByte;
>      }
>      else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof))
>      {
>          return bsf(mask) / BitsInByte + cast(int)Mask.sizeof;
>      }
>      else
>      {
>          return -1;
>      }
>
> }
> ------
>
> Is it optimal and how do you implement this stuf?
>
> Full exaple with comparation of algorithms (SIMD, naive, binary search):
> http://dpaste.dzfl.pl/f3b8989841e3
>
> Benchmark result on dpaste.dzfl.pl:
> SIMD:   TickDuration(157000)
> Binary: TickDuration(472000)
> Naive:  TickDuration(437000)
>
> At home with defult dub config "dub run --build=release":
> SIMD:    TickDuration(241566)
> Binary:  TickDuration(450515)
> Naive:   TickDuration(450371)

This is compelling but needs a bit of work to integrate. Care to work on a PR that makes std.algorithm use it? Thanks! -- Andrei

October 25, 2015
On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:
> Is it optimal and how do you implement this stuf?
>

I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result in memory and you need only one BSF instruction.
October 25, 2015
On Sunday, 25 October 2015 at 22:17:58 UTC, Matthias Bentrup wrote:
> On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:
>> Is it optimal and how do you implement this stuf?
>>
>
> I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result in memory and you need only one BSF instruction.

Yeah but PMOVMSKB not implemented in core.simd.

Bit more comprehensive here:
http://forum.dlang.org/post/20150923115833.054fdb09@marco-toshiba
October 25, 2015
On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu wrote:
> [...]
> This is compelling but needs a bit of work to integrate. Care to work on a PR that makes std.algorithm use it? Thanks! -- Andrei

First of all I need sort of investigation about PRs and std.algorithm. But in general challenge accepted.
October 26, 2015
On 25.10.2015 20:37, Iakh wrote:
> Full exaple with comparation of algorithms (SIMD, naive, binary search):
> http://dpaste.dzfl.pl/f3b8989841e3

From there:
> void runBinary()
> {
>     static int i = 0;
>     naiveIndexOf(cast(ubyte)(i++/ArraySize + 1), arr);
> }

runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
October 26, 2015
On 10/25/15 6:57 PM, Iakh wrote:
> On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu wrote:
>> [...]
>> This is compelling but needs a bit of work to integrate. Care to work
>> on a PR that makes std.algorithm use it? Thanks! -- Andrei
>
> First of all I need sort of investigation about PRs and std.algorithm.
> But in general challenge accepted.

Thanks! -- Andrei
October 26, 2015
On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
> runBinary calls naiveIndexOf. You're not testing binaryIndexOf.

You are right.
This is fixed example:
http://dpaste.dzfl.pl/f7a54b789a21

and results at dpaste.dzfl.pl:
-----
SIMD:   TickDuration(151000)
Binary: TickDuration(255000)
Naive:  TickDuration(459000)

So SIMD version ~1.68 faster than binary
October 26, 2015
On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:
> On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
>> runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
>
> You are right.
> This is fixed example:
> http://dpaste.dzfl.pl/f7a54b789a21
>
> and results at dpaste.dzfl.pl:
> -----
> SIMD:   TickDuration(151000)
> Binary: TickDuration(255000)
> Naive:  TickDuration(459000)
>
> So SIMD version ~1.68 faster than binary

At home with defult dub config "dub run --build=release":
-----
SIMD:    TickDuration(350644)
Binary:  TickDuration(434014)
Naive:   TickDuration(657548)


~1.24 times faster than binary and
~1.87 times faster than naive
October 26, 2015
On Monday, 26 October 2015 at 11:10:31 UTC, Iakh wrote:
> On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:
>> On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
>>> runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
>>
>> You are right.
>> This is fixed example:
>> http://dpaste.dzfl.pl/f7a54b789a21
>>
>> and results at dpaste.dzfl.pl:
>> -----
>> SIMD:   TickDuration(151000)
>> Binary: TickDuration(255000)
>> Naive:  TickDuration(459000)
>>
>> So SIMD version ~1.68 faster than binary
>
> At home with defult dub config "dub run --build=release":
> -----
> SIMD:    TickDuration(350644)
> Binary:  TickDuration(434014)
> Naive:   TickDuration(657548)
>
>
> ~1.24 times faster than binary and
> ~1.87 times faster than naive

It's good to see work like this. Have you tried with gdc/ldc (-march=native / -mcpu=native respectively if you want to use the full available instruction set)?
« First   ‹ Prev
1 2 3