April 01, 2014
On 4/1/2014 12:23 PM, John Colvin wrote:
> It's not really possible to tell anything from that benchmark, especially with
> fancy modern optimisers and branch predictors.

I disagree, read on.

> 1) ldc and gdc both optimise away some/all of the tests respectively.
> 2) once isIdentifierCharX is inlined, the compiler can determine what the
> results will be before-hand.

This is always an issue with benchmarking in order to find the best way to write an algorithm. I just adjust it so the particular compiler used can't throw it away.


> 3) Even without 1) and 2), the results are unrealistically (very!) correlated,
> leading to a branch-prediction bonanza. I'm sure you know how significant that
> can be on modern CPUs. It is also very friendly to pre-fetching of the lookup
> table*

This is exactly why I wanted to test it.


> Perhaps have the same benchmark, but working on realistic data from a file.

Switching to it in Warp produced a very significant speedup on real files.

https://github.com/facebook/warp/pull/5

April 01, 2014
On 04/01/14 21:35, JR wrote:
> On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
>> immutable bool[256] tab2;
>> static this()
>> {
>>     for (size_t u = 0; u < 0x100; ++u)
>>     {
>>         tab2[u] = isIdentifierChar0(cast(ubyte)u);
>>     }
>> }
> 
> Table lookups are awesome.
> 
> While tab2 there lends itself to being populated in a static this() it would really lower the bar if we could define static immutable AA literals at compile-time. Or at least CTFE-fill them. Having to rely on static this() for such is unfortunate and hacky.

You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example:

   immutable bool[256] tab2 = {
      bool t[256];
      for (size_t u = 0; u < 0x100; ++u)
          t[u] = isIdentifierChar0(cast(ubyte)u);
      return t;
   }();

"Static immutable AAs" can be built at CT likewise, it's just that their type will be different from "std" RT AAs.

artur
April 01, 2014
On 4/1/14, Artur Skawina <art.08.09@gmail.com> wrote:
> You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example:
>
>    immutable bool[256] tab2 = {
>       bool t[256];
>       for (size_t u = 0; u < 0x100; ++u)
>           t[u] = isIdentifierChar0(cast(ubyte)u);
>       return t;
>    }();

Also (untested):

immutable bool[256] tab2 =
    iota(0, 0x100)
    .map!(i => isIdentifierChar0(cast(ubyte)i))
    .array;
April 01, 2014
> Switching to it in Warp produced a very significant speedup on real files.
>
> https://github.com/facebook/warp/pull/5

That's to be expected as warp will call it often and it will stay in the cache. For a program that calls it less often the other methods may be faster.

Basically, it depends on the use case.

Glad to see it's helped in warp. Any idea how it is comparing to clang with this improvement?
April 01, 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
> Try this benchmark comparing various classification schemes:
> ---------------------------------

10_000 iterations:
Milliseconds 7 10 2
Milliseconds 5 5 1
Milliseconds 8 10 2

100_000 iterations:
Milliseconds 78 93 21
Milliseconds 78 94 21
Milliseconds 78 92 21

1_000_000 iterations:
Milliseconds 584 614 146
Milliseconds 541 612 147
Milliseconds 502 609 146
Milliseconds 529 613 147
Milliseconds 539 606 147

This is DMD 32-bit 2.065.0 on an Intel Core-i7.
The options are -O -release -inline -noboundscheck.

A bit of comments:

1. The benchmark seems rather artificial to me.  The natural order of input is fixed, which is nice for both the branch predictor (long streaks of each branch) and the array access (sequential).

2. The latter function wins, no wonder though, since there's no branching.  (I tried to augment it by actually using the cache in between, but it still wins against my attempts so far.)

Ivan Kazmenko.
April 01, 2014
On Tuesday, 1 April 2014 at 21:34:29 UTC, Ivan Kazmenko wrote:
> On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:

> 2. The latter function wins, no wonder though, since there's no branching.  (I tried to augment it by actually using the cache in between, but it still wins against my attempts so far.)

The "it" being augmented is the benchmark, the "it" winning being the function isIdentifierChar2.

This modification of the benchmark pretends to use a stack-allocated array alongside calling the funcions of interest:

http://dpaste.dzfl.pl/1ed247445308

The timings are:
Milliseconds 623 590 229
Milliseconds 629 592 231
Milliseconds 619 596 233
Milliseconds 524 592 283

Funnily, on DPaste, the picture changes:
Milliseconds 1864 1349 1303

Perhaps there is no "-O -release -inline -noboundscheck" tuning there.

Ivan Kazmenko.
April 01, 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
> Try this benchmark comparing various classification schemes:

I modified it a bit to use the Phobos source code as the test data:
http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d

My results (DMD git HEAD, -O -inline -release):

x86: 41070 29642 5169
x64: 42975 29759 5822

I've been using table lookups in my library for a while, e.g. for ASCII case conversions:

https://github.com/CyberShadow/ae/blob/master/utils/text.d#L323-L341

or escaping JSON strings:

https://github.com/CyberShadow/ae/blob/master/utils/json.d#L150-L188
April 01, 2014
On 4/1/2014 3:00 PM, Vladimir Panteleev wrote:
> I modified it a bit to use the Phobos source code as the test data:
> http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d

Nice. I especially like the use of dirEntries to read all the text. That should go into the phobos documentation!

April 01, 2014
On 4/1/2014 2:30 PM, Peter Alexander wrote:
> Any idea how it is comparing to clang with this improvement?

No data yet. Stay tuned.

April 01, 2014
On 4/1/2014 12:46 PM, Dmitry Olshansky wrote:
> Would strongly suggest to use 2 kinds of data - randomized and some text. And,
> of course, it must be loaded at run-time.

See Vladimir's post.