Jump to page: 1 24  
Page
Thread overview
Strange counter-performance in an alternative `decimalLength9` function
Feb 26, 2020
Basile B.
Feb 26, 2020
H. S. Teoh
Feb 26, 2020
Basile B.
Feb 26, 2020
Stefan Koch
Feb 26, 2020
Basile B.
Feb 26, 2020
Bruce Carneal
Feb 26, 2020
Bruce Carneal
Feb 26, 2020
Basile B.
Feb 27, 2020
Bruce Carneal
Feb 27, 2020
Bruce Carneal
Feb 27, 2020
Basile B.
Feb 27, 2020
Basile B.
Feb 27, 2020
Bruce Carneal
Feb 27, 2020
Bruce Carneal
Feb 27, 2020
Basile B.
Feb 27, 2020
Bruce Carneal
Feb 27, 2020
Basile B.
Feb 27, 2020
Bruce Carneal
Feb 28, 2020
Basile B.
Feb 26, 2020
Johan
Feb 26, 2020
Basile B.
Feb 27, 2020
Basile B.
Feb 27, 2020
Basile B.
Feb 27, 2020
Dennis Cote
Feb 27, 2020
Basile B.
Feb 27, 2020
Basile B.
Feb 28, 2020
9il
Feb 28, 2020
Bruce Carneal
Feb 28, 2020
Bruce Carneal
Feb 28, 2020
kinke
Feb 28, 2020
9il
February 26, 2020
So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.

After struggling with bad ideas I finally find something that looks nice:

- count the leading zero of the input
- switch() this count to have in the worst of the cases only 1 CMP (when the bit number possibly changes the digit count, e.g 9 -> 10 (when bsr(input) == 4)

after writing all the values allowing to know the cases where a comparison is necessary...

  min2 = 0b10
  max2 = 0b11
  min3 = 0b100
  max3 = 0b111
  ...
  ...
  min32 = 0b100.......0
  max32 = 0b111.......1

...I finally write the "nice" thing.

---
#!dmd -boundscheck=off -O -release -inline
import std.stdio;

ubyte decimalLength9(const uint v)
{
    if (v >= 100000000) { return 9; }
    if (v >= 10000000) { return 8; }
    if (v >= 1000000) { return 7; }
    if (v >= 100000) { return 6; }
    if (v >= 10000) { return 5; }
    if (v >= 1000) { return 4; }
    if (v >= 100) { return 3; }
    if (v >= 10) { return 2; }
    return 1;
}

ubyte fdecimalLength9(const uint v) pure nothrow
{
    import core.bitop;
    const ubyte lzc = cast(ubyte) (bsr(v) + 1);
    ubyte result;
    switch (lzc)
    {
        case 0 :
        case 1 :
        case 2 :
        case 3 : result =  1; break;
        case 4 : result =  v >= 10  ? 2 : 1; break;
        case 5 :
        case 6 : result =  2; break;
        case 7 : result =  v >= 100  ? 3 : 2; break;
        case 8 :
        case 9 : result =  3; break;
        case 10: result =  v >= 1000  ? 4 : 3; break;
        case 11:
        case 12:
        case 13: result =  4; break;
        case 14: result =  v >= 10000  ? 5 : 4; break;
        case 15:
        case 16: result =  5; break;
        case 17: result =  v >= 100000  ? 6 : 5; break;
        case 18:
        case 19: result =  6; break;
        case 20: result =  v >= 1000000  ? 7 : 6; break;
        case 21:
        case 22:
        case 23: result =  7; break;
        case 24: result =  v >= 10000000  ? 8 : 7; break;
        case 25:
        case 26: result =  8; break;
        case 27: result =  v >= 100000000  ? 9 : 8; break;
        case 28:
        case 29:
        case 30:
        case 31:
        case 32: result =  9; break;
        default: assert(false);
    }
    return result;
}

private ubyte ffdecimalLength9(const uint v) pure nothrow
{
    import core.bitop;
    const ubyte lzc = cast(ubyte) (bsr(v) + 1);
    static immutable pure nothrow ubyte function(uint)[33] tbl =
    [
    0 : (uint a) => ubyte(1),
    1 : (uint a) => ubyte(1),
    2 : (uint a) => ubyte(1),
    3 : (uint a) => ubyte(1),
    4 : (uint a) => a >= 10  ? ubyte(2) : ubyte(1),
    5 : (uint a) => ubyte(2),
    6 : (uint a) => ubyte(2),
    7 : (uint a) => a >= 100  ? ubyte(3) : ubyte(2),
    8 : (uint a) => ubyte(3),
    9 : (uint a) => ubyte(3),
    10: (uint a) => a >= 1000  ? ubyte(4) : ubyte(3),
    11: (uint a) => ubyte(4),
    12: (uint a) => ubyte(4),
    13: (uint a) => ubyte(4),
    14: (uint a) => a >= 10000  ? ubyte(5) : ubyte(4),
    15: (uint a) => ubyte(5),
    16: (uint a) => ubyte(5),
    17: (uint a) => a >= 100000  ? ubyte(6) : ubyte(5),
    18: (uint a) => ubyte(6),
    19: (uint a) => ubyte(6),
    20: (uint a) => a >= 1000000  ? ubyte(7) : ubyte(6),
    21: (uint a) => ubyte(7),
    22: (uint a) => ubyte(7),
    23: (uint a) => ubyte(7),
    24: (uint a) => a >= 10000000  ? ubyte(8) : ubyte(7),
    25: (uint a) => ubyte(8),
    26: (uint a) => ubyte(8),
    27: (uint a) => a >= 100000000  ? ubyte(9) : ubyte(8),
    28: (uint a) => ubyte(9),
    29: (uint a) => ubyte(9),
    30: (uint a) => ubyte(9),
    31: (uint a) => ubyte(9),
    32: (uint a) => ubyte(9),
    ];
    return tbl[lzc](v);
}

void main()
{
    import std.datetime.stopwatch, std.range, std.algorithm;

    int s1, s2, s3;
    benchmark!({ iota(100000000u).each!(a => s1 += decimalLength9(a+1)); })(10).writeln;
    benchmark!({ iota(100000000u).each!(a => s2 += fdecimalLength9(a+1));})(10).writeln;
    benchmark!({ iota(100000000u).each!(a => s3 += ffdecimalLength9(a+1));})(10).writeln;
    assert(s1 == s2);
    assert(s1 == s3);
}
---

Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the args from the script line. Maybe the fdecimalLength9 version is slightly faster. Only *slightly*. Good news, I've lost my time. So I try an alternative version that uses a table of delegates instead of a switch (ffdecimalLength9) and surprise, "tada", it is like **100x** slower then the two others.

How is that possible ?
February 25, 2020
On Wed, Feb 26, 2020 at 12:50:35AM +0000, Basile B. via Digitalmars-d-learn wrote: [...]
> #!dmd -boundscheck=off -O -release -inline
[...]

TBH, I'm skeptical of any performance results using dmd.  I wouldn't pay attention to performance numbers obtained this way, and rather look at the ldmd/ldc2 numbers.


[...]
> Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the args from the script line. Maybe the fdecimalLength9 version is slightly faster.  Only *slightly*. Good news, I've lost my time. So I try an alternative version that uses a table of delegates instead of a switch (ffdecimalLength9) and surprise, "tada", it is like **100x** slower then the two others.
> 
> How is that possible ?

Did you check the assembly output to see what the difference is?

Delegates involve a function call, which involves function call overhead, which includes a CPU pipeline hazard.  Worse yet it's an indirect call, meaning you're defeating the CPU branch predictor and invalidating the instruction cache. And on top of that, delegates involve allocating a context, and you *really* do not want allocations inside an inner loop.

And of course, normal function calls are easier for compilers to inline, because the destination is fixed. Indirect calls involving delegates are hard to predict, and the optimizer is more liable to just give up.

These are just my educated guesses, of course.  For the real answer, look at the assembly output. :-D


T

-- 
What are you when you run out of Monet? Baroque.
February 26, 2020
On Wednesday, 26 February 2020 at 01:10:07 UTC, H. S. Teoh wrote:
> On Wed, Feb 26, 2020 at 12:50:35AM +0000, Basile B. via Digitalmars-d-learn wrote: [...]
>> #!dmd -boundscheck=off -O -release -inline
> [...]
>
> TBH, I'm skeptical of any performance results using dmd.  I wouldn't pay attention to performance numbers obtained this way, and rather look at the ldmd/ldc2 numbers.

I didn't use DMD. The script line is actually interpreted by the IDE. It drops the compiler name and just parse the arg and pass them to a compiler that's defined by non overridable options. In my case I used LDMD (this is LDC you see, but with a DMD like options syntax)
February 26, 2020
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>
> [...]

It can be made faster using binary search. Not by much though.

You'll get ceil(log2(numberofplaces)) cmps.
February 26, 2020
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> How is that possible ?

It turns out that there's a problem with the benchmarking method. With command line argument the different optimization passes of LLVM don't fuck up with the literal constants.

It appears that none of the alternatives based on the "most significant bit trick" are faster (at least when covering a decent range of numbers):

---
#!ldmd -boundscheck=off -release -inline -O -mcpu=native -mattr=+sse2,+sse3,+sse4.1,+sse4.2,+fast-lzcnt,+avx,+avx2,+cmov,+bmi,+bmi2
import core.memory;
import core.bitop;
import std.stdio;
import std.range;
import std.algorithm;
import std.getopt;

ubyte decimalLength9_0(const uint v)
{
    if (v >= 100000000) { return 9; }
    if (v >= 10000000) { return 8; }
    if (v >= 1000000) { return 7; }
    if (v >= 100000) { return 6; }
    if (v >= 10000) { return 5; }
    if (v >= 1000) { return 4; }
    if (v >= 100) { return 3; }
    if (v >= 10) { return 2; }
    return 1;
}

ubyte decimalLength9_1(const uint v) pure nothrow
{
    if (v == 0) // BSR and LZCNT UB when input is 0
        return 1;
    const ubyte lzc = cast(ubyte) bsr(v);
    ubyte result;
    switch (lzc)
    {
        case 0 :
        case 1 :
        case 2 : result =  1; break;
        case 3 : result =  v >= 10 ? 2 : 1; break;
        case 4 :
        case 5 : result =  2; break;
        case 6 : result =  v >= 100 ? 3 : 2; break;
        case 7 :
        case 8 : result =  3; break;
        case 9 : result =  v >= 1000 ? 4 : 3; break;
        case 10:
        case 11:
        case 12: result =  4; break;
        case 13: result =  v >= 10000 ? 5 : 4; break;
        case 14:
        case 15: result =  5; break;
        case 16: result =  v >= 100000 ? 6 : 5; break;
        case 17:
        case 18: result =  6; break;
        case 19: result =  v >= 1000000 ? 7 : 6; break;
        case 20:
        case 21:
        case 22: result =  7; break;
        case 23: result =  v >= 10000000 ? 8 : 7; break;
        case 24:
        case 25: result =  8; break;
        case 26: result =  v >= 100000000 ? 9 : 8; break;
        case 27:
        case 28:
        case 29:
        case 30:
        case 31: result =  9; break;

        default: assert(false);
    }
    return result;
}

private ubyte decimalLength9_2(const uint v) pure nothrow
{
    if (v == 0) // BSR and LZCNT UB when input is 0
        return 1;
    const ubyte lzc = cast(ubyte) bsr(v);
    static immutable pure nothrow ubyte function(uint)[32] tbl =
    [
    0 : (uint a) => ubyte(1),
    1 : (uint a) => ubyte(1),
    2 : (uint a) => ubyte(1),
    3 : (uint a) => a >= 10  ? ubyte(2) : ubyte(1),
    4 : (uint a) => ubyte(2),
    5 : (uint a) => ubyte(2),
    6 : (uint a) => a >= 100  ? ubyte(3) : ubyte(2),
    7 : (uint a) => ubyte(3),
    8 : (uint a) => ubyte(3),
    9 : (uint a) => a >= 1000  ? ubyte(4) : ubyte(3),
    10: (uint a) => ubyte(4),
    11: (uint a) => ubyte(4),
    12: (uint a) => ubyte(4),
    13: (uint a) => a >= 10000  ? ubyte(5) : ubyte(4),
    14: (uint a) => ubyte(5),
    15: (uint a) => ubyte(5),
    16: (uint a) => a >= 100000  ? ubyte(6) : ubyte(5),
    17: (uint a) => ubyte(6),
    18: (uint a) => ubyte(6),
    19: (uint a) => a >= 1000000  ? ubyte(7) : ubyte(6),
    20: (uint a) => ubyte(7),
    21: (uint a) => ubyte(7),
    22: (uint a) => ubyte(7),
    23: (uint a) => a >= 10000000  ? ubyte(8) : ubyte(7),
    24: (uint a) => ubyte(8),
    25: (uint a) => ubyte(8),
    26: (uint a) => a >= 100000000  ? ubyte(9) : ubyte(8),
    27: (uint a) => ubyte(9),
    28: (uint a) => ubyte(9),
    29: (uint a) => ubyte(9),
    30: (uint a) => ubyte(9),
    31: (uint a) => ubyte(9),
    ];
    return tbl[lzc](v);
}

ubyte decimalLength9_3(const uint v) pure nothrow
{
    if (v == 0) // BSR and LZCNT UB when input is 0
        return 1;
    ubyte result;
    const ubyte lzc = cast(ubyte) bsr(v);
    const ubyte[32] decimalLength9LUT =
    [
    0 : ubyte(1),
    1 : ubyte(1),
    2 : ubyte(1),
    3 : ubyte(10),
    4 : ubyte(2),
    5 : ubyte(2),
    6 : ubyte(11),
    7 : ubyte(3),
    8 : ubyte(3),
    9 : ubyte(12),
    10: ubyte(4),
    11: ubyte(4),
    12: ubyte(4),
    13: ubyte(12),
    14: ubyte(5),
    15: ubyte(5),
    16: ubyte(13),
    17: ubyte(6),
    18: ubyte(6),
    19: ubyte(14),
    20: ubyte(7),
    21: ubyte(7),
    22: ubyte(7),
    23: ubyte(15),
    24: ubyte(8),
    25: ubyte(8),
    26: ubyte(16),
    27: ubyte(9),
    28: ubyte(9),
    29: ubyte(9),
    30: ubyte(9),
    31: ubyte(9),
    ];
    ubyte resultOrSelector = decimalLength9LUT[lzc];
    if (resultOrSelector < 10)
        result = resultOrSelector;
    else switch (lzc)
    {
        case 3 : result =  v >= 10 ? 2 : 1; break;
        case 6 : result =  v >= 100 ? 3 : 2; break;
        case 9 : result =  v >= 1000 ? 4 : 3; break;
        case 13: result =  v >= 10000 ? 5 : 4; break;
        case 16: result =  v >= 100000 ? 6 : 5; break;
        case 19: result =  v >= 1000000 ? 7 : 6; break;
        case 23: result =  v >= 10000000 ? 8 : 7; break;
        case 26: result =  v >= 100000000 ? 9 : 8; break;
        default: assert(false);
    }
    return result;
}

void main(string[] args)
{
    uint    sum;
    uint    count;
    ubyte   func;

    GC.disable;
    getopt(args, config.passThrough, "c|count", &count, "f|func", &func);
    static const funcs = [&decimalLength9_0, &decimalLength9_1, &decimalLength9_2, &decimalLength9_3];

    foreach (i; 0 .. count)
        sum += funcs[func](i);
    writeln(sum);
}
---

Results:

---
[mr_x@pc1 temp]$ time ./declen --count=100000000 --func=0
788888890

real    0m0,207s
user    0m0,203s
sys     0m0,003s
[mr_x@pc1 temp]$ time ./declen --count=100000000 --func=1
788888890

real    0m0,369s
user    0m0,366s
sys     0m0,002s
[mr_x@pc1 temp]$ time ./declen --count=100000000 --func=2
788888890

real    0m0,271s
user    0m0,264s
sys     0m0,003s
[mr_x@pc1 temp]$ time ./declen --count=100000000 --func=3
788888890

real    0m0,253s
user    0m0,250s
sys     0m0,002s
---

That's pathetic ;)

February 26, 2020
On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> ...
>     foreach (i; 0 .. count)
>         sum += funcs[func](i);

The input stream is highly predictable and strongly skewed towards higher digits.

The winning function implementation lines up with that distribution.  It would not fare as well with higher entropy input.







February 26, 2020
On Wednesday, 26 February 2020 at 19:44:05 UTC, Bruce Carneal wrote:
> On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
>> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> ...
>>     foreach (i; 0 .. count)
>>         sum += funcs[func](i);
>
> The input stream is highly predictable and strongly skewed towards higher digits.
>
> The winning function implementation lines up with that distribution.  It would not fare as well with higher entropy input.

Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit numbers, ...) decimalLength9_0 beats a simple branchless implementation by about 10%.

After shuffling the input, branchless wins by 2.4X (240%).

February 26, 2020
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>
> ...
>
> Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the args from the script line. Maybe the fdecimalLength9 version is slightly faster. Only *slightly*. Good news, I've lost my time. So I try an alternative version that uses a table of delegates instead of a switch (ffdecimalLength9) and surprise, "tada", it is like **100x** slower then the two others.
>
> How is that possible ?

Hi Basile,
  I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U
It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize.
llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv

cheers,
  Johan

February 26, 2020
On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> [...]
>
> Hi Basile,
>   I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U
> It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize.
> llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv
>
> cheers,
>   Johan

yes llvm-mca looks excellent, although I don't know if it worth continuing... You see this function is certainly not a bottleneck, it's just that I wanted to try better than the naive implementation.

Fundamentatlly the problem is that
1. the original is smaller, faster to decode
2. the alternatives (esp. the 3rd) is conceptually better but the cost of the jump table + lzcnt wastes it.
February 26, 2020
On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal wrote:
>> The winning function implementation lines up with that distribution.  It would not fare as well with higher entropy input.
>
> Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit numbers, ...) decimalLength9_0 beats a simple branchless implementation by about 10%.
>
> After shuffling the input, branchless wins by 2.4X (240%).

I've replaced the input by the front of a rndGen (that pops for count times and starting with a custom seed) and I never see the decimalLength9_3 (which seems to be the closest to the original in term of performances) doing better.

Maybe you talked about another implementation of decimalLength9 ?
« First   ‹ Prev
1 2 3 4