May 28, 2016
On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu wrote:
> On 5/28/16 6:56 AM, qznc wrote:
>> The sentinel value is `needleBack+1`, but range elements need not
>> support addition. Finding a sentinel is hard and most certainly requires
>> more assumptions about the ranges.
>
> No need for a sentinel really so long as you first search for the last element of the needle, in the haystack.
>
> Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work:

A nice one. Here are the numbers I got:

LDC:
    std find:	156 ±33
 manual find:	117 ±24
   qznc find:	114 ±14
  Chris find:	135 ±24
 Andrei find:	112 ±27

DMD:
    std find:	177 ±31
 manual find:	140 ±28
   qznc find:	102 ±4
  Chris find:	165 ±31
 Andrei find:	130 ±25

My sentinel idea is to have only one condition branch in the inner loop. Andrei find still has to `if` in the inner loop. My inner loop looks like this:

  haystack[scout] = cast(typeof(needleBack)) (needleBack+1); // place sentinel
  for (size_t j = scout + 1 - needleLength, k = 0;; ++j, ++k) { // no condition
    if (!binaryFun!pred(haystack[j], needle[k])) { // only condition
      // ... ok in here comes another, but this not as costly
    }
  }

As long as you are matching the needle, there is only one condition to check. There is no check for k < needleLength, because the check always fails for needle[$] aka haystack[scout] due to the sentinel.

I thought it might be slower, since we need to write to memory twice instead of only reading. Turns out it is faster.
May 28, 2016
On Saturday, 28 May 2016 at 14:18:10 UTC, Chris wrote:
> I used dmd, because I don't have ldc on my laptop. qznc's find is clearly the winner.

DMD performance feels flaky to me.

If performance is important, you should use ldc or gdc. Alternatively, you are Walter Bright and simply optimize dmd. ;)
May 29, 2016
I played around with the benchmark. Some more numbers:

$ make ldc
ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
./benchmark.ldc
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
    std find: 153 ±25    +66 (1934)  -15 (7860)
 manual find: 122 ±28    +80 (1812)  -17 (8134)
   qznc find: 125 ±16    +18 (4644)  -15 (5126)
  Chris find: 148 ±29    +75 (1976)  -18 (7915)
 Andrei find: 114 ±23   +100 (1191)  -13 (8770)
 (avg slowdown vs fastest; absolute deviation)
$ make dmd
dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
./benchmark.dmd
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
    std find: 160 ±27    +44 (3162)  -20 (6709)
 manual find: 148 ±28    +54 (2701)  -19 (7178)
   qznc find: 102 ±3     +27 ( 766)   -1 (9136)
  Chris find: 175 ±30    +55 (2796)  -21 (7106)
 Andrei find: 122 ±22    +46 (2554)  -14 (7351)
 (avg slowdown vs fastest; absolute deviation)

The additional numbers on the right are the ±MAD separated by above or below the mean. For example Andrei find with ldc:

  Andrei find: 114 ±23   +100 (1191)  -13 (8770)

The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 10000 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown.

What bothers me is that changing the alphabet changes the numbers so much. Currently, if you restrict the alphabets for haystack and needle, the numbers change. The benchmark already does a random subset on each run, but there is definitely a bias.
May 29, 2016
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:
> I played around with the benchmark. Some more numbers:
>
> $ make ldc
> ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
> ./benchmark.ldc
> E: wrong result with Chris find
> E: wrong result with Chris find
> E: wrong result with Chris find
>     std find: 153 ±25    +66 (1934)  -15 (7860)
>  manual find: 122 ±28    +80 (1812)  -17 (8134)
>    qznc find: 125 ±16    +18 (4644)  -15 (5126)
>   Chris find: 148 ±29    +75 (1976)  -18 (7915)
>  Andrei find: 114 ±23   +100 (1191)  -13 (8770)
>  (avg slowdown vs fastest; absolute deviation)
> $ make dmd
> dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
> ./benchmark.dmd
> E: wrong result with Chris find
> E: wrong result with Chris find
> E: wrong result with Chris find
>     std find: 160 ±27    +44 (3162)  -20 (6709)
>  manual find: 148 ±28    +54 (2701)  -19 (7178)
>    qznc find: 102 ±3     +27 ( 766)   -1 (9136)
>   Chris find: 175 ±30    +55 (2796)  -21 (7106)
>  Andrei find: 122 ±22    +46 (2554)  -14 (7351)
>  (avg slowdown vs fastest; absolute deviation)
>
> The additional numbers on the right are the ±MAD separated by above or below the mean. For example Andrei find with ldc:
>
>   Andrei find: 114 ±23   +100 (1191)  -13 (8770)
>
> The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 10000 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown.
>
> What bothers me is that changing the alphabet changes the numbers so much. Currently, if you restrict the alphabets for haystack and needle, the numbers change. The benchmark already does a random subset on each run, but there is definitely a bias.

You can avoid "E: wrong result with Chris find" by using

outer:
    for (auto i = 0; i < haystack.length-needle.length; i++)
    {
        if (haystack[i] != needle[0])
            continue;
        for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
            if (haystack[j] != needle[k])
                continue outer;
        return haystack[i..$];
    }

It's a tad faster.

I'm planning to test on more varied data and see where a bias may occur.
May 29, 2016
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:
> I played around with the benchmark. Some more numbers:
>
> The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 10000 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown.
>
A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms.

--Jon
May 29, 2016
On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
> A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms.

I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median?

When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE.

However, this is more advanced and requires further introspection.

[0] http://forum.dlang.org/post/aahhopdcrengakvoemrn@forum.dlang.org
May 29, 2016
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
> On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
>> A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms.
>
> I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median?
>

Oh, I see. The benchmark varies the data on each run and aggregates, is that right? Sorry, I missed that.
May 30, 2016
It's weird, on my machine, my find function is consistently faster than `manual find`

LDC:

    std find: 137 ±22    +33 (3580)  -17 (6251)
 manual find: 137 ±32    +53 (3105)  -23 (6834)
   qznc find: 106 ±8     +17 (2608)   -5 (7195)
  Chris find: 132 ±32    +58 (2803)  -23 (7131)
 Andrei find: 120 ±26    +54 (2466)  -17 (7454)

===

    std find: 136 ±22    +33 (3503)  -17 (6304)
 manual find: 137 ±33    +55 (3000)  -23 (6920)
   qznc find: 106 ±8     +17 (2535)   -5 (7287)
  Chris find: 132 ±33    +59 (2803)  -23 (7137)
 Andrei find: 119 ±25    +51 (2569)  -16 (7374)


It's negligible, but I wonder is it the extra initialization:

`size_t end = haystack.length - needle.length + 1;`

and

`size_t i = 0` is defined, before we know, if it's a legitimate search. But that shouldn't matter, or should it?

Or does it depend on the machine?

DMD slows all searches down considerably (except for `qznc find`, well done!:)

DMD:

    std find: 161 ±36    +46 (4015)  -31 (5847)
 manual find: 147 ±40    +68 (3001)  -28 (6917)
   qznc find: 106 ±8     +15 (2623)   -5 (7172)
  Chris find: 201 ±41    +73 (2830)  -28 (7120)
 Andrei find: 150 ±40    +61 (3347)  -30 (6575)


PS I've corrected the end of the search string to

`outer:
    for (auto i = 0; i < haystack.length-(needle.length-1); i++)`

else it will not match at the end of a string.
May 30, 2016
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
> When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE.

So far, the results look disappointing. Andrei find does not get faster with wordwise matching:

./benchmark.ldc
    std find: 133 ±25    +38 (3384)  -19 (6486)
 manual find: 140 ±37    +64 (2953)  -25 (6962)
   qznc find: 114 ±17    +33 (2610)  -11 (7262)
  Chris find: 146 ±39    +66 (3045)  -28 (6873)
 Andrei find: 126 ±29    +54 (2720)  -19 (7189)
Wordwise find: 130 ±30    +53 (2934)  -21 (6980)

Interesting side-note: On my laptop Andrei find is faster than qznc find (for LDC), but on my desktop it reverses (see above). Both are Intel i7. Need to find a simpler processor. Maybe wordwise is faster there. Alternatively, find is purely memory bound and the L1 cache makes every difference disappear.

Also, note how std find is faster than manual find! Finding a reliable benchmark is hard. :/
May 30, 2016
On 05/30/2016 05:31 AM, qznc wrote:
> On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
>> When looking at the assembly I don't like the single-byte loads. Since
>> string (ubyte[] here) is of extraordinary importance, it should be
>> worthwhile to use word loads [0] instead. Really fancy would be SSE.
>
> So far, the results look disappointing. Andrei find does not get faster
> with wordwise matching:
>
> ./benchmark.ldc
>      std find: 133 ±25    +38 (3384)  -19 (6486)
>   manual find: 140 ±37    +64 (2953)  -25 (6962)
>     qznc find: 114 ±17    +33 (2610)  -11 (7262)
>    Chris find: 146 ±39    +66 (3045)  -28 (6873)
>   Andrei find: 126 ±29    +54 (2720)  -19 (7189)
> Wordwise find: 130 ±30    +53 (2934)  -21 (6980)
>
> Interesting side-note: On my laptop Andrei find is faster than qznc find
> (for LDC), but on my desktop it reverses (see above). Both are Intel i7.
> Need to find a simpler processor. Maybe wordwise is faster there.
> Alternatively, find is purely memory bound and the L1 cache makes every
> difference disappear.
>
> Also, note how std find is faster than manual find! Finding a reliable
> benchmark is hard. :/

Please throw this hat into the ring as well: it should improve average search on large vocabulary dramatically.

https://dpaste.dzfl.pl/dc8dc6e1eb53

It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- Andrei