Thread overview
Fast linear search in D array (slice)
Nov 07, 2018
Per Nordlöw
Nov 07, 2018
12345swordy
Nov 07, 2018
H. S. Teoh
Nov 07, 2018
Per Nordlöw
Nov 08, 2018
Stanislav Blinov
Nov 08, 2018
Per Nordlöw
November 07, 2018
I'm currently implementing overloads of some Phobos's search algorithms where the haystack is a `char[]` and the needle is an ASCII-clean `char`. I plan to integrate them to Phobos later down the road.

One such overload is at

https://github.com/nordlow/phobos-next/blob/master/src/find_split_ex.d#L10

which currently implements the linear search as

foreach (immutable offset; 0 .. haystack.length)
{
    const bool hit = haystack[offset] == needles[0];
    if (hit)
    {
        return Result(haystack, offset);
    }
}

I now have a couple questions regarding the performance of the above foreach-loop:

- Is this the preferred way of implementing a linear search in D?
- Andrei has mentioned that sentinel-based search can be faster if the haystack is allowed to be temporarily mutated at the end. But I presume such a solution is only faster above a certain length for the haystack. Has anybody benchmarked and found a suitable length limit for this?
- As a special-case of the above there's also the possibility of mutating the last character to be a null and then reusing `strchr`.
- When is it preferred to search in blocks of 2, 4, 8 or 16 bytes? I presume for a certain alignment and, again, length limit.
- Would it be relevant to make such a function a compiler intrinsic that handles all these cases for the generic non-decoding case where haystack is a `T[]` and needle a `T`?
November 07, 2018
On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:
> I'm currently implementing overloads of some Phobos's search algorithms where the haystack is a `char[]` and the needle is an ASCII-clean `char`. I plan to integrate them to Phobos later down the road.
>
> One such overload is at
>
> https://github.com/nordlow/phobos-next/blob/master/src/find_split_ex.d#L10
>
> which currently implements the linear search as
>
> foreach (immutable offset; 0 .. haystack.length)
> {
>     const bool hit = haystack[offset] == needles[0];
>     if (hit)
>     {
>         return Result(haystack, offset);
>     }
> }
>
> I now have a couple questions regarding the performance of the above foreach-loop:
>
> - Is this the preferred way of implementing a linear search in D?
> - Andrei has mentioned that sentinel-based search can be faster if the haystack is allowed to be temporarily mutated at the end. But I presume such a solution is only faster above a certain length for the haystack. Has anybody benchmarked and found a suitable length limit for this?
> - As a special-case of the above there's also the possibility of mutating the last character to be a null and then reusing `strchr`.
> - When is it preferred to search in blocks of 2, 4, 8 or 16 bytes? I presume for a certain alignment and, again, length limit.
> - Would it be relevant to make such a function a compiler intrinsic that handles all these cases for the generic non-decoding case where haystack is a `T[]` and needle a `T`?

Have you thought about using simd for this?
November 07, 2018
On Wed, Nov 07, 2018 at 09:06:14PM +0000, Per Nordlöw via Digitalmars-d wrote: [...]
> - Is this the preferred way of implementing a linear search in D?
> - Andrei has mentioned that sentinel-based search can be faster if the
> haystack is allowed to be temporarily mutated at the end. But I
> presume such a solution is only faster above a certain length for the
> haystack. Has anybody benchmarked and found a suitable length limit
> for this?

I would think the suitable length limit would be highly CPU-dependent and memory-configuration-dependent.  There is no single magic number that's going to work everywhere.  We can sample a wide variety of hardware configurations and arbitrarily choose a middle-ground figure, I suppose, but it's not going to be optimal for everybody, and it will probably become quickly outdated with every new release of hardware.


> - As a special-case of the above there's also the possibility of mutating the last character to be a null and then reusing `strchr`.

That's only because strchr is in the C library, and the assumption is that whoever implemented strchr for that platform has already done his homework to find the best-performing implementation for that platform. If you really wanted to, you could do the same in D and figure out the "best" implementation that way.


> - When is it preferred to search in blocks of 2, 4, 8 or 16 bytes? I presume for a certain alignment and, again, length limit.

I don't have the luxury of being able to compare performance on a wide variety of platforms, but I did peruse the Glibc source code for memchr once, and it basically used a clever bit-mask hack to scan for byte patterns 64 bits at a time, with extra code at the start/end to account for non-aligned boundaries of the array.  I ran some tests against it, and found that while it was definitely faster when the byte being searched occurred deep into the array, it was actually slower when the byte occurred near the beginning, because of the complexity of setting up the bit-mask hack and the starting/ending boundary code.  When the byte being searched for occurs very frequently in a series of repeated calls to memchr, the glibc memchr actually performed worse than a naïve for-loop that compared byte by byte.

The bottom line is that even supposedly optimized-to-hell-and-back code like glibc's memchr may have poor performance characteristics for certain use cases.  As with all things, it's always a matter of compromise, and the best solution for you will be heavily dependent on exactly what you're going to be using the code for.


> - Would it be relevant to make such a function a compiler intrinsic that handles all these cases for the generic non-decoding case where haystack is a `T[]` and needle a `T`?

I thought somebody had already put in some specializations in std.algorithm.find to actually call C's memchr when the haystack is an array and the needle is a 1-byte value?  I'm not sure a compiler intrinsic is necessary for something like this... sounds like something that the standard library (i.e., Phobos) should be responsible for. Have you tried profiling std.algorithm.find to see how it performs on your use case?

And if indeed std.algorithm.find is suboptimal, we could consider whether your use case is general enough to be worth adding another specialization to handle it, and that way it would benefit everybody, instead of everyone rolling their own probably-suboptimal version.

(One thing to watch out for with std.algorithm.find is that it may trigger the dreaded autodecoding machinery. You may want to cast to ubyte[] just to eliminate that possibility. Or use .byCodeUnit, but I'm unsure if .find is able to recognize that as an array, so the optimized path may not be chosen.)


T

-- 
Let's call it an accidental feature. -- Larry Wall
November 07, 2018
On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:
> - As a special-case of the above there's also the possibility of mutating the last character to be a null and then reusing `strchr`.

Correction it's presumably better to, in place of `strchr`, use

       void *memchr(const void *s, int c, size_t n);

and

       void *rawmemchr(const void *s, int c);

for the sentinel-at-the-end case.

For the non-CTFE case, that is.
November 08, 2018
On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:

> - Andrei has mentioned that sentinel-based search can be faster if the haystack is allowed to be temporarily mutated at the end. But I presume such a solution is only faster above a certain length for the haystack. Has anybody benchmarked and found a suitable length limit for this?

Speaking about x86(64), when your haystack fits in the cache entirely and is already in it, you'd be hard pressed to beat linear search. If not, a naive sentinel search is more likely to *degrade* performance, unless needle tends to appear in the last cache line worth of data. That's because you'll be performing unnecessary memory accesses. A sentinel search by cache line may be a better idea.
So there isn't a particular length limit. When performance is key, your program has to decide which approach to take based on your data layout and access patterns. A single generic library function can't do that.

> - As a special-case of the above there's also the possibility of mutating the last character to be a null and then reusing `strchr`.

D libraries should reduce dependencies on C libraries, not breed them.
November 08, 2018
On Thursday, 8 November 2018 at 01:47:05 UTC, Stanislav Blinov wrote:
> So there isn't a particular length limit. When performance is key, your program has to decide which approach to take based on your data layout and access patterns. A single generic library function can't do that.

> D libraries should reduce dependencies on C libraries, not breed them.

Yep, agree.

Thanks.