Thread overview | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
February 26, 2020 Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP. After struggling with bad ideas I finally find something that looks nice: - count the leading zero of the input - switch() this count to have in the worst of the cases only 1 CMP (when the bit number possibly changes the digit count, e.g 9 -> 10 (when bsr(input) == 4) after writing all the values allowing to know the cases where a comparison is necessary... min2 = 0b10 max2 = 0b11 min3 = 0b100 max3 = 0b111 ... ... min32 = 0b100.......0 max32 = 0b111.......1 ...I finally write the "nice" thing. --- #!dmd -boundscheck=off -O -release -inline import std.stdio; ubyte decimalLength9(const uint v) { if (v >= 100000000) { return 9; } if (v >= 10000000) { return 8; } if (v >= 1000000) { return 7; } if (v >= 100000) { return 6; } if (v >= 10000) { return 5; } if (v >= 1000) { return 4; } if (v >= 100) { return 3; } if (v >= 10) { return 2; } return 1; } ubyte fdecimalLength9(const uint v) pure nothrow { import core.bitop; const ubyte lzc = cast(ubyte) (bsr(v) + 1); ubyte result; switch (lzc) { case 0 : case 1 : case 2 : case 3 : result = 1; break; case 4 : result = v >= 10 ? 2 : 1; break; case 5 : case 6 : result = 2; break; case 7 : result = v >= 100 ? 3 : 2; break; case 8 : case 9 : result = 3; break; case 10: result = v >= 1000 ? 4 : 3; break; case 11: case 12: case 13: result = 4; break; case 14: result = v >= 10000 ? 5 : 4; break; case 15: case 16: result = 5; break; case 17: result = v >= 100000 ? 6 : 5; break; case 18: case 19: result = 6; break; case 20: result = v >= 1000000 ? 7 : 6; break; case 21: case 22: case 23: result = 7; break; case 24: result = v >= 10000000 ? 8 : 7; break; case 25: case 26: result = 8; break; case 27: result = v >= 100000000 ? 9 : 8; break; case 28: case 29: case 30: case 31: case 32: result = 9; break; default: assert(false); } return result; } private ubyte ffdecimalLength9(const uint v) pure nothrow { import core.bitop; const ubyte lzc = cast(ubyte) (bsr(v) + 1); static immutable pure nothrow ubyte function(uint)[33] tbl = [ 0 : (uint a) => ubyte(1), 1 : (uint a) => ubyte(1), 2 : (uint a) => ubyte(1), 3 : (uint a) => ubyte(1), 4 : (uint a) => a >= 10 ? ubyte(2) : ubyte(1), 5 : (uint a) => ubyte(2), 6 : (uint a) => ubyte(2), 7 : (uint a) => a >= 100 ? ubyte(3) : ubyte(2), 8 : (uint a) => ubyte(3), 9 : (uint a) => ubyte(3), 10: (uint a) => a >= 1000 ? ubyte(4) : ubyte(3), 11: (uint a) => ubyte(4), 12: (uint a) => ubyte(4), 13: (uint a) => ubyte(4), 14: (uint a) => a >= 10000 ? ubyte(5) : ubyte(4), 15: (uint a) => ubyte(5), 16: (uint a) => ubyte(5), 17: (uint a) => a >= 100000 ? ubyte(6) : ubyte(5), 18: (uint a) => ubyte(6), 19: (uint a) => ubyte(6), 20: (uint a) => a >= 1000000 ? ubyte(7) : ubyte(6), 21: (uint a) => ubyte(7), 22: (uint a) => ubyte(7), 23: (uint a) => ubyte(7), 24: (uint a) => a >= 10000000 ? ubyte(8) : ubyte(7), 25: (uint a) => ubyte(8), 26: (uint a) => ubyte(8), 27: (uint a) => a >= 100000000 ? ubyte(9) : ubyte(8), 28: (uint a) => ubyte(9), 29: (uint a) => ubyte(9), 30: (uint a) => ubyte(9), 31: (uint a) => ubyte(9), 32: (uint a) => ubyte(9), ]; return tbl[lzc](v); } void main() { import std.datetime.stopwatch, std.range, std.algorithm; int s1, s2, s3; benchmark!({ iota(100000000u).each!(a => s1 += decimalLength9(a+1)); })(10).writeln; benchmark!({ iota(100000000u).each!(a => s2 += fdecimalLength9(a+1));})(10).writeln; benchmark!({ iota(100000000u).each!(a => s3 += ffdecimalLength9(a+1));})(10).writeln; assert(s1 == s2); assert(s1 == s3); } --- Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the args from the script line. Maybe the fdecimalLength9 version is slightly faster. Only *slightly*. Good news, I've lost my time. So I try an alternative version that uses a table of delegates instead of a switch (ffdecimalLength9) and surprise, "tada", it is like **100x** slower then the two others. How is that possible ? |
February 25, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Wed, Feb 26, 2020 at 12:50:35AM +0000, Basile B. via Digitalmars-d-learn wrote: [...] > #!dmd -boundscheck=off -O -release -inline [...] TBH, I'm skeptical of any performance results using dmd. I wouldn't pay attention to performance numbers obtained this way, and rather look at the ldmd/ldc2 numbers. [...] > Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the args from the script line. Maybe the fdecimalLength9 version is slightly faster. Only *slightly*. Good news, I've lost my time. So I try an alternative version that uses a table of delegates instead of a switch (ffdecimalLength9) and surprise, "tada", it is like **100x** slower then the two others. > > How is that possible ? Did you check the assembly output to see what the difference is? Delegates involve a function call, which involves function call overhead, which includes a CPU pipeline hazard. Worse yet it's an indirect call, meaning you're defeating the CPU branch predictor and invalidating the instruction cache. And on top of that, delegates involve allocating a context, and you *really* do not want allocations inside an inner loop. And of course, normal function calls are easier for compilers to inline, because the destination is fixed. Indirect calls involving delegates are hard to predict, and the optimizer is more liable to just give up. These are just my educated guesses, of course. For the real answer, look at the assembly output. :-D T -- What are you when you run out of Monet? Baroque. |
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Wednesday, 26 February 2020 at 01:10:07 UTC, H. S. Teoh wrote:
> On Wed, Feb 26, 2020 at 12:50:35AM +0000, Basile B. via Digitalmars-d-learn wrote: [...]
>> #!dmd -boundscheck=off -O -release -inline
> [...]
>
> TBH, I'm skeptical of any performance results using dmd. I wouldn't pay attention to performance numbers obtained this way, and rather look at the ldmd/ldc2 numbers.
I didn't use DMD. The script line is actually interpreted by the IDE. It drops the compiler name and just parse the arg and pass them to a compiler that's defined by non overridable options. In my case I used LDMD (this is LDC you see, but with a DMD like options syntax)
|
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>
> [...]
It can be made faster using binary search. Not by much though.
You'll get ceil(log2(numberofplaces)) cmps.
|
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: > How is that possible ? It turns out that there's a problem with the benchmarking method. With command line argument the different optimization passes of LLVM don't fuck up with the literal constants. It appears that none of the alternatives based on the "most significant bit trick" are faster (at least when covering a decent range of numbers): --- #!ldmd -boundscheck=off -release -inline -O -mcpu=native -mattr=+sse2,+sse3,+sse4.1,+sse4.2,+fast-lzcnt,+avx,+avx2,+cmov,+bmi,+bmi2 import core.memory; import core.bitop; import std.stdio; import std.range; import std.algorithm; import std.getopt; ubyte decimalLength9_0(const uint v) { if (v >= 100000000) { return 9; } if (v >= 10000000) { return 8; } if (v >= 1000000) { return 7; } if (v >= 100000) { return 6; } if (v >= 10000) { return 5; } if (v >= 1000) { return 4; } if (v >= 100) { return 3; } if (v >= 10) { return 2; } return 1; } ubyte decimalLength9_1(const uint v) pure nothrow { if (v == 0) // BSR and LZCNT UB when input is 0 return 1; const ubyte lzc = cast(ubyte) bsr(v); ubyte result; switch (lzc) { case 0 : case 1 : case 2 : result = 1; break; case 3 : result = v >= 10 ? 2 : 1; break; case 4 : case 5 : result = 2; break; case 6 : result = v >= 100 ? 3 : 2; break; case 7 : case 8 : result = 3; break; case 9 : result = v >= 1000 ? 4 : 3; break; case 10: case 11: case 12: result = 4; break; case 13: result = v >= 10000 ? 5 : 4; break; case 14: case 15: result = 5; break; case 16: result = v >= 100000 ? 6 : 5; break; case 17: case 18: result = 6; break; case 19: result = v >= 1000000 ? 7 : 6; break; case 20: case 21: case 22: result = 7; break; case 23: result = v >= 10000000 ? 8 : 7; break; case 24: case 25: result = 8; break; case 26: result = v >= 100000000 ? 9 : 8; break; case 27: case 28: case 29: case 30: case 31: result = 9; break; default: assert(false); } return result; } private ubyte decimalLength9_2(const uint v) pure nothrow { if (v == 0) // BSR and LZCNT UB when input is 0 return 1; const ubyte lzc = cast(ubyte) bsr(v); static immutable pure nothrow ubyte function(uint)[32] tbl = [ 0 : (uint a) => ubyte(1), 1 : (uint a) => ubyte(1), 2 : (uint a) => ubyte(1), 3 : (uint a) => a >= 10 ? ubyte(2) : ubyte(1), 4 : (uint a) => ubyte(2), 5 : (uint a) => ubyte(2), 6 : (uint a) => a >= 100 ? ubyte(3) : ubyte(2), 7 : (uint a) => ubyte(3), 8 : (uint a) => ubyte(3), 9 : (uint a) => a >= 1000 ? ubyte(4) : ubyte(3), 10: (uint a) => ubyte(4), 11: (uint a) => ubyte(4), 12: (uint a) => ubyte(4), 13: (uint a) => a >= 10000 ? ubyte(5) : ubyte(4), 14: (uint a) => ubyte(5), 15: (uint a) => ubyte(5), 16: (uint a) => a >= 100000 ? ubyte(6) : ubyte(5), 17: (uint a) => ubyte(6), 18: (uint a) => ubyte(6), 19: (uint a) => a >= 1000000 ? ubyte(7) : ubyte(6), 20: (uint a) => ubyte(7), 21: (uint a) => ubyte(7), 22: (uint a) => ubyte(7), 23: (uint a) => a >= 10000000 ? ubyte(8) : ubyte(7), 24: (uint a) => ubyte(8), 25: (uint a) => ubyte(8), 26: (uint a) => a >= 100000000 ? ubyte(9) : ubyte(8), 27: (uint a) => ubyte(9), 28: (uint a) => ubyte(9), 29: (uint a) => ubyte(9), 30: (uint a) => ubyte(9), 31: (uint a) => ubyte(9), ]; return tbl[lzc](v); } ubyte decimalLength9_3(const uint v) pure nothrow { if (v == 0) // BSR and LZCNT UB when input is 0 return 1; ubyte result; const ubyte lzc = cast(ubyte) bsr(v); const ubyte[32] decimalLength9LUT = [ 0 : ubyte(1), 1 : ubyte(1), 2 : ubyte(1), 3 : ubyte(10), 4 : ubyte(2), 5 : ubyte(2), 6 : ubyte(11), 7 : ubyte(3), 8 : ubyte(3), 9 : ubyte(12), 10: ubyte(4), 11: ubyte(4), 12: ubyte(4), 13: ubyte(12), 14: ubyte(5), 15: ubyte(5), 16: ubyte(13), 17: ubyte(6), 18: ubyte(6), 19: ubyte(14), 20: ubyte(7), 21: ubyte(7), 22: ubyte(7), 23: ubyte(15), 24: ubyte(8), 25: ubyte(8), 26: ubyte(16), 27: ubyte(9), 28: ubyte(9), 29: ubyte(9), 30: ubyte(9), 31: ubyte(9), ]; ubyte resultOrSelector = decimalLength9LUT[lzc]; if (resultOrSelector < 10) result = resultOrSelector; else switch (lzc) { case 3 : result = v >= 10 ? 2 : 1; break; case 6 : result = v >= 100 ? 3 : 2; break; case 9 : result = v >= 1000 ? 4 : 3; break; case 13: result = v >= 10000 ? 5 : 4; break; case 16: result = v >= 100000 ? 6 : 5; break; case 19: result = v >= 1000000 ? 7 : 6; break; case 23: result = v >= 10000000 ? 8 : 7; break; case 26: result = v >= 100000000 ? 9 : 8; break; default: assert(false); } return result; } void main(string[] args) { uint sum; uint count; ubyte func; GC.disable; getopt(args, config.passThrough, "c|count", &count, "f|func", &func); static const funcs = [&decimalLength9_0, &decimalLength9_1, &decimalLength9_2, &decimalLength9_3]; foreach (i; 0 .. count) sum += funcs[func](i); writeln(sum); } --- Results: --- [mr_x@pc1 temp]$ time ./declen --count=100000000 --func=0 788888890 real 0m0,207s user 0m0,203s sys 0m0,003s [mr_x@pc1 temp]$ time ./declen --count=100000000 --func=1 788888890 real 0m0,369s user 0m0,366s sys 0m0,002s [mr_x@pc1 temp]$ time ./declen --count=100000000 --func=2 788888890 real 0m0,271s user 0m0,264s sys 0m0,003s [mr_x@pc1 temp]$ time ./declen --count=100000000 --func=3 788888890 real 0m0,253s user 0m0,250s sys 0m0,002s --- That's pathetic ;) |
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> ...
> foreach (i; 0 .. count)
> sum += funcs[func](i);
The input stream is highly predictable and strongly skewed towards higher digits.
The winning function implementation lines up with that distribution. It would not fare as well with higher entropy input.
|
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bruce Carneal | On Wednesday, 26 February 2020 at 19:44:05 UTC, Bruce Carneal wrote:
> On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
>> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> ...
>> foreach (i; 0 .. count)
>> sum += funcs[func](i);
>
> The input stream is highly predictable and strongly skewed towards higher digits.
>
> The winning function implementation lines up with that distribution. It would not fare as well with higher entropy input.
Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit numbers, ...) decimalLength9_0 beats a simple branchless implementation by about 10%.
After shuffling the input, branchless wins by 2.4X (240%).
|
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: > So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP. > > ... > > Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the args from the script line. Maybe the fdecimalLength9 version is slightly faster. Only *slightly*. Good news, I've lost my time. So I try an alternative version that uses a table of delegates instead of a switch (ffdecimalLength9) and surprise, "tada", it is like **100x** slower then the two others. > > How is that possible ? Hi Basile, I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize. llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv cheers, Johan |
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johan | On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> [...]
>
> Hi Basile,
> I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U
> It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize.
> llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv
>
> cheers,
> Johan
yes llvm-mca looks excellent, although I don't know if it worth continuing... You see this function is certainly not a bottleneck, it's just that I wanted to try better than the naive implementation.
Fundamentatlly the problem is that
1. the original is smaller, faster to decode
2. the alternatives (esp. the 3rd) is conceptually better but the cost of the jump table + lzcnt wastes it.
|
February 26, 2020 Re: Strange counter-performance in an alternative `decimalLength9` function | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bruce Carneal | On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal wrote:
>> The winning function implementation lines up with that distribution. It would not fare as well with higher entropy input.
>
> Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit numbers, ...) decimalLength9_0 beats a simple branchless implementation by about 10%.
>
> After shuffling the input, branchless wins by 2.4X (240%).
I've replaced the input by the front of a rndGen (that pops for count times and starting with a custom seed) and I never see the decimalLength9_3 (which seems to be the closest to the original in term of performances) doing better.
Maybe you talked about another implementation of decimalLength9 ?
|
Copyright © 1999-2021 by the D Language Foundation