Playing SIMD (page 2)

On 10/26/2015 05:48 AM, Iakh wrote: > On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote: >> runBinary calls naiveIndexOf. You're not testing binaryIndexOf. > > You are right. > This is fixed example: > http://dpaste.dzfl.pl/f7a54b789a21 > > and results at dpaste.dzfl.pl: > ----- > SIMD: TickDuration(151000) > Binary: TickDuration(255000) > Naive: TickDuration(459000) > > So SIMD version ~1.68 faster than binary That's a healthy margin. It may get eroded by startup/finish codes that need to get to the first aligned chunk and handle the misaligned data at the end, etc. But it's a solid proof of concept. -- Andrei

On Monday, 26 October 2015 at 11:47:56 UTC, Andrei Alexandrescu wrote: > On 10/26/2015 05:48 AM, Iakh wrote: >> On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote: >>> runBinary calls naiveIndexOf. You're not testing binaryIndexOf. >> >> You are right. >> This is fixed example: >> http://dpaste.dzfl.pl/f7a54b789a21 >> >> and results at dpaste.dzfl.pl: >> ----- >> SIMD: TickDuration(151000) >> Binary: TickDuration(255000) >> Naive: TickDuration(459000) >> >> So SIMD version ~1.68 faster than binary > > That's a healthy margin. It may get eroded by startup/finish codes that need to get to the first aligned chunk and handle the misaligned data at the end, etc. But it's a solid proof of concept. -- Andrei Interpolation search is even faster :) http://dpaste.dzfl.pl/4443c5753454 ----- SIMD: TickDuration(144000) Binary: TickDuration(229000) Naive: TickDuration(472000) Interpolation: TickDuration(92000) ----- SIMD: TickDuration(145000) Binary: TickDuration(247000) Naive: TickDuration(463000) Interpolation: TickDuration(91000)

On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote: > Here is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values. [snip] You need to be very careful with doing benchmarks on tiny test cases, they can be very misleading. Be aware that the speed of bsf() and bsr() is very very strongly processor dependent. On some machines, it is utterly pathetic. eg AMD K7, BSR is 23 micro-operations, on original pentium is was up to 73 (!), even on AMD Bobcat it is 11 micro-ops, but on recent Intel it is one micro-op. This fact of 73 can totally screw up your performance comparisons. Just because it is a single machine instruction does not mean it is fast.

On 25 Oct 2015 11:50 pm, "Iakh via Digitalmars-d" < digitalmars-d@puremagic.com> wrote: > > On Sunday, 25 October 2015 at 22:17:58 UTC, Matthias Bentrup wrote: >> >> On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote: >>> >>> Is it optimal and how do you implement this stuf? >>> >> >> I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result in memory and you need only one BSF instruction. > > > Yeah but PMOVMSKB not implemented in core.simd. > Don't use core.simd, push for getting std.simd in, then leverage the generics exposed through that module.

On 26 Oct 2015 1:40 pm, "Don via Digitalmars-d" <digitalmars-d@puremagic.com> wrote: > > On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote: >> >> Here is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values. > > > [snip] > > You need to be very careful with doing benchmarks on tiny test cases, they can be very misleading. > > Be aware that the speed of bsf() and bsr() is very very strongly processor dependent. On some machines, it is utterly pathetic. Some compilers are smart about uses of bsf and bsr. :-)

On Monday, 26 October 2015 at 12:10:41 UTC, rumbu wrote: > Interpolation search is even faster :) > > http://dpaste.dzfl.pl/4443c5753454 For a funny name it's called Newton search: you can use interpolation of higher order, can be more efficient if the data has a trend.

October 26, 2015

Re: Playing SIMD

Posted by Iakh
in reply to Andrei Alexandrescu

Permalink

Iakh

Posted in reply to Andrei Alexandrescu

Permalink

On Monday, 26 October 2015 at 11:47:56 UTC, Andrei Alexandrescu wrote:
> On 10/26/2015 05:48 AM, Iakh wrote:
>> On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:
>>> runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
>>
>> You are right.
>> This is fixed example:
>> http://dpaste.dzfl.pl/f7a54b789a21
>>
>> and results at dpaste.dzfl.pl:
>> -----
>> SIMD:   TickDuration(151000)
>> Binary: TickDuration(255000)
>> Naive:  TickDuration(459000)
>>
>> So SIMD version ~1.68 faster than binary
>
> That's a healthy margin. It may get eroded by startup/finish codes that need to get to the first aligned chunk and handle the misaligned data at the end, etc. But it's a solid proof of concept. -- Andrei

(Binary)Searching in large sorted continuous array e. g. 1 MB of bytes
1 MB = 2 ^^ 20 bytes
Imagine 4 binary levels pass in 4 ticks. So 20 levels in 20 ticks. With SIMD last 4 levels would be done in 2 or 3 ticks. For 20 levels 18 - 19 ticks. So overall improvement is 5-10%. Furthermore think of cache and memory pages misses on firs levels.

IMO SIMD needed for unsorted data(strings?) or for trees.

On Monday, 26 October 2015 at 15:03:12 UTC, Iakh wrote: > (Binary)Searching in large sorted continuous array e. g. 1 MB of bytes > 1 MB = 2 ^^ 20 bytes > Imagine 4 binary levels pass in 4 ticks. So 20 levels in 20 ticks. With SIMD last 4 levels would be done in 2 or 3 ticks. For 20 levels 18 - 19 ticks. So overall improvement is 5-10%. Furthermore think of cache and memory pages misses on firs levels. > > IMO SIMD needed for unsorted data(strings?) or for trees. But yeah... Who needs 1000_000 0..256 values(sorted :D )?

On 10/26/2015 08:35 AM, Don wrote: > On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote: >> Here is my implementatation of SIMD find. Function returns index of >> ubyte in static 16 byte array with unique values. > > [snip] > > You need to be very careful with doing benchmarks on tiny test cases, > they can be very misleading. > > Be aware that the speed of bsf() and bsr() is very very strongly > processor dependent. On some machines, it is utterly pathetic. eg AMD > K7, BSR is 23 micro-operations, on original pentium is was up to 73 (!), > even on AMD Bobcat it is 11 micro-ops, but on recent Intel it is one > micro-op. This fact of 73 can totally screw up your performance > comparisons. > > Just because it is a single machine instruction does not mean it is fast. One other note: don't compare with binary search, it's not an appropriate baseline. You should use it only if you implemented SIMD-based binary search. Good baselines: std.find, memchr, a naive version with pointers (no bounds checking). Andrei

On Monday, 26 October 2015 at 12:35:39 UTC, Don wrote: > > You need to be very careful with doing benchmarks on tiny test cases, they can be very misleading. Can you advice some reading about benchmark?

Forums