Table lookups - this is pretty definitive (page 2) - D Programming Language Discussion Forum

April 01, 2014

Re: Table lookups - this is pretty definitive

Posted by Walter Bright
in reply to John Colvin

Permalink

Walter Bright

Posted in reply to John Colvin

Permalink

On 4/1/2014 12:23 PM, John Colvin wrote:
> It's not really possible to tell anything from that benchmark, especially with
> fancy modern optimisers and branch predictors.

I disagree, read on.

> 1) ldc and gdc both optimise away some/all of the tests respectively.
> 2) once isIdentifierCharX is inlined, the compiler can determine what the
> results will be before-hand.

This is always an issue with benchmarking in order to find the best way to write an algorithm. I just adjust it so the particular compiler used can't throw it away.


> 3) Even without 1) and 2), the results are unrealistically (very!) correlated,
> leading to a branch-prediction bonanza. I'm sure you know how significant that
> can be on modern CPUs. It is also very friendly to pre-fetching of the lookup
> table*

This is exactly why I wanted to test it.


> Perhaps have the same benchmark, but working on realistic data from a file.

Switching to it in Warp produced a very significant speedup on real files.

https://github.com/facebook/warp/pull/5

On 04/01/14 21:35, JR wrote: > On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote: >> immutable bool[256] tab2; >> static this() >> { >> for (size_t u = 0; u < 0x100; ++u) >> { >> tab2[u] = isIdentifierChar0(cast(ubyte)u); >> } >> } > > Table lookups are awesome. > > While tab2 there lends itself to being populated in a static this() it would really lower the bar if we could define static immutable AA literals at compile-time. Or at least CTFE-fill them. Having to rely on static this() for such is unfortunate and hacky. You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example: immutable bool[256] tab2 = { bool t[256]; for (size_t u = 0; u < 0x100; ++u) t[u] = isIdentifierChar0(cast(ubyte)u); return t; }(); "Static immutable AAs" can be built at CT likewise, it's just that their type will be different from "std" RT AAs. artur

On 4/1/14, Artur Skawina <art.08.09@gmail.com> wrote: > You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example: > > immutable bool[256] tab2 = { > bool t[256]; > for (size_t u = 0; u < 0x100; ++u) > t[u] = isIdentifierChar0(cast(ubyte)u); > return t; > }(); Also (untested): immutable bool[256] tab2 = iota(0, 0x100) .map!(i => isIdentifierChar0(cast(ubyte)i)) .array;

> Switching to it in Warp produced a very significant speedup on real files. > > https://github.com/facebook/warp/pull/5 That's to be expected as warp will call it often and it will stay in the cache. For a program that calls it less often the other methods may be faster. Basically, it depends on the use case. Glad to see it's helped in warp. Any idea how it is comparing to clang with this improvement?

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote: > Try this benchmark comparing various classification schemes: > --------------------------------- 10_000 iterations: Milliseconds 7 10 2 Milliseconds 5 5 1 Milliseconds 8 10 2 100_000 iterations: Milliseconds 78 93 21 Milliseconds 78 94 21 Milliseconds 78 92 21 1_000_000 iterations: Milliseconds 584 614 146 Milliseconds 541 612 147 Milliseconds 502 609 146 Milliseconds 529 613 147 Milliseconds 539 606 147 This is DMD 32-bit 2.065.0 on an Intel Core-i7. The options are -O -release -inline -noboundscheck. A bit of comments: 1. The benchmark seems rather artificial to me. The natural order of input is fixed, which is nice for both the branch predictor (long streaks of each branch) and the array access (sequential). 2. The latter function wins, no wonder though, since there's no branching. (I tried to augment it by actually using the cache in between, but it still wins against my attempts so far.) Ivan Kazmenko.

On Tuesday, 1 April 2014 at 21:34:29 UTC, Ivan Kazmenko wrote: > On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote: > 2. The latter function wins, no wonder though, since there's no branching. (I tried to augment it by actually using the cache in between, but it still wins against my attempts so far.) The "it" being augmented is the benchmark, the "it" winning being the function isIdentifierChar2. This modification of the benchmark pretends to use a stack-allocated array alongside calling the funcions of interest: http://dpaste.dzfl.pl/1ed247445308 The timings are: Milliseconds 623 590 229 Milliseconds 629 592 231 Milliseconds 619 596 233 Milliseconds 524 592 283 Funnily, on DPaste, the picture changes: Milliseconds 1864 1349 1303 Perhaps there is no "-O -release -inline -noboundscheck" tuning there. Ivan Kazmenko.

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote: > Try this benchmark comparing various classification schemes: I modified it a bit to use the Phobos source code as the test data: http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d My results (DMD git HEAD, -O -inline -release): x86: 41070 29642 5169 x64: 42975 29759 5822 I've been using table lookups in my library for a while, e.g. for ASCII case conversions: https://github.com/CyberShadow/ae/blob/master/utils/text.d#L323-L341 or escaping JSON strings: https://github.com/CyberShadow/ae/blob/master/utils/json.d#L150-L188

On 4/1/2014 3:00 PM, Vladimir Panteleev wrote: > I modified it a bit to use the Phobos source code as the test data: > http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d Nice. I especially like the use of dirEntries to read all the text. That should go into the phobos documentation!

Forums