February 27, 2020
On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
> On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal wrote:
>>
>> After shuffling the input, branchless wins by 2.4X (240%).
>
> I've replaced the input by the front of a rndGen (that pops for count times and starting with a custom seed) and I never see the decimalLength9_3 (which seems to be the closest to the original in term of performances) doing better.

The uniform random uint distribution yields a highly skewed digits distribution.  Note, for example, that only 10 of the 2**32 possible outputs from a uint PRNG can be represented by a single decimal digit.

Your current winner will be very hard to beat if the inputs are uniform random.  That implementation's high-to-low cascade of early exit ifs aligns with PRNG output.

If you aren't sure what the input distribution is, or want guaranteed good worst case behavior, then branchless may be a better way to go.

>
> Maybe you talked about another implementation of decimalLength9 ?

Yes.  It's one I wrote after I saw your post. Psuedo-code here:

  auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) ... }

Using ldc to target an x86 with the above yields a series of cmpl, seta instruction pairs in the function body followed by a summing and a return.  No branching.

Let me know if the above is unclear or insufficient.





















February 27, 2020
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal wrote:
> On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
>> On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal wrote:
>>>
>>> After shuffling the input, branchless wins by 2.4X (240%).
>>
>> 
> snip
>
> Let me know if the above is unclear or insufficient.

The 2.4X improvement came when using a shuffled uniform digits distribution.  Equal numbers of 1 digit values, 2 digit values, ... put in to an array and shuffled.

Branchless *loses* by 8.5% or so on my Zen1 when the array is not shuffled (when branch predictions are nearly perfect).










February 27, 2020
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal wrote:
>>
>> Maybe you talked about another implementation of decimalLength9 ?
>
> Yes.  It's one I wrote after I saw your post. Psuedo-code here:
>
>   auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) ... }
>
> Using ldc to target an x86 with the above yields a series of cmpl, seta instruction pairs in the function body followed by a summing and a return.  No branching.
>
> Let me know if the above is unclear or insufficient.

No thanks, it's crystal clear now.
February 27, 2020
On Thursday, 27 February 2020 at 04:44:56 UTC, Basile B. wrote:
> On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal wrote:
>>>
>>> Maybe you talked about another implementation of decimalLength9 ?
>>
>> Yes.  It's one I wrote after I saw your post. Psuedo-code here:
>>
>>   auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) ... }
>>
>> Using ldc to target an x86 with the above yields a series of cmpl, seta instruction pairs in the function body followed by a summing and a return.  No branching.
>>
>> Let me know if the above is unclear or insufficient.
>
> No thanks, it's crystal clear now.

although I don't see the performance gain. Now for me an hybrid version based on a LUT and  on the branchless one is the fatest (decimalLength9_5), although still slowest then the original.

updated program, incl your branchless version (decimalLength9_4):

---
#!ldmd -boundscheck=off -release -inline -O -mcpu=native -mattr=+sse2,+sse3,+sse4.1,+sse4.2,+fast-lzcnt,+avx,+avx2,+cmov,+bmi,+bmi2
import core.memory;
import core.bitop;
import std.stdio;
import std.range;
import std.algorithm;
import std.getopt;
import std.random;

ubyte decimalLength9_0(const uint v)
{
    if (v >= 100000000) { return 9; }
    if (v >= 10000000) { return 8; }
    if (v >= 1000000) { return 7; }
    if (v >= 100000) { return 6; }
    if (v >= 10000) { return 5; }
    if (v >= 1000) { return 4; }
    if (v >= 100) { return 3; }
    if (v >= 10) { return 2; }
    return 1;
}

ubyte decimalLength9_1(const uint v) pure nothrow
{
    if (v == 0) // BSR and LZCNT UB when input is 0
        return 1;
    const ubyte lzc = cast(ubyte) bsr(v);
    ubyte result;
    switch (lzc)
    {
        case 0 :
        case 1 :
        case 2 : result =  1; break;
        case 3 : result =  v >= 10 ? 2 : 1; break;
        case 4 :
        case 5 : result =  2; break;
        case 6 : result =  v >= 100 ? 3 : 2; break;
        case 7 :
        case 8 : result =  3; break;
        case 9 : result =  v >= 1000 ? 4 : 3; break;
        case 10:
        case 11:
        case 12: result =  4; break;
        case 13: result =  v >= 10000 ? 5 : 4; break;
        case 14:
        case 15: result =  5; break;
        case 16: result =  v >= 100000 ? 6 : 5; break;
        case 17:
        case 18: result =  6; break;
        case 19: result =  v >= 1000000 ? 7 : 6; break;
        case 20:
        case 21:
        case 22: result =  7; break;
        case 23: result =  v >= 10000000 ? 8 : 7; break;
        case 24:
        case 25: result =  8; break;
        case 26: result =  v >= 100000000 ? 9 : 8; break;
        case 27:
        case 28:
        case 29:
        case 30:
        case 31: result =  9; break;
        default: assert(false);
    }
    return result;
}

private ubyte decimalLength9_2(const uint v) pure nothrow
{
    if (v == 0) // BSR and LZCNT UB when input is 0
        return 1;
    const ubyte lzc = cast(ubyte) bsr(v);
    static immutable pure nothrow ubyte function(uint)[32] tbl =
    [
    0 : (uint a) => ubyte(1),
    1 : (uint a) => ubyte(1),
    2 : (uint a) => ubyte(1),
    3 : (uint a) => a >= 10  ? ubyte(2) : ubyte(1),
    4 : (uint a) => ubyte(2),
    5 : (uint a) => ubyte(2),
    6 : (uint a) => a >= 100  ? ubyte(3) : ubyte(2),
    7 : (uint a) => ubyte(3),
    8 : (uint a) => ubyte(3),
    9 : (uint a) => a >= 1000  ? ubyte(4) : ubyte(3),
    10: (uint a) => ubyte(4),
    11: (uint a) => ubyte(4),
    12: (uint a) => ubyte(4),
    13: (uint a) => a >= 10000  ? ubyte(5) : ubyte(4),
    14: (uint a) => ubyte(5),
    15: (uint a) => ubyte(5),
    16: (uint a) => a >= 100000  ? ubyte(6) : ubyte(5),
    17: (uint a) => ubyte(6),
    18: (uint a) => ubyte(6),
    19: (uint a) => a >= 1000000  ? ubyte(7) : ubyte(6),
    20: (uint a) => ubyte(7),
    21: (uint a) => ubyte(7),
    22: (uint a) => ubyte(7),
    23: (uint a) => a >= 10000000  ? ubyte(8) : ubyte(7),
    24: (uint a) => ubyte(8),
    25: (uint a) => ubyte(8),
    26: (uint a) => a >= 100000000  ? ubyte(9) : ubyte(8),
    27: (uint a) => ubyte(9),
    28: (uint a) => ubyte(9),
    29: (uint a) => ubyte(9),
    30: (uint a) => ubyte(9),
    31: (uint a) => ubyte(9),
    ];
    return tbl[lzc](v);
}

ubyte decimalLength9_3(const uint v) pure nothrow
{
    if (v == 0) // BSR and LZCNT UB when input is 0
        return 1;
    ubyte result;
    enum ubyte doSwitch = ubyte(0);
    const ubyte lzc = cast(ubyte) bsr(v);
    const ubyte[32] decimalLength9LUT =
    [
    0 : ubyte(1),
    1 : ubyte(1),
    2 : ubyte(1),
    3 : doSwitch,
    4 : ubyte(2),
    5 : ubyte(2),
    6 : doSwitch,
    7 : ubyte(3),
    8 : ubyte(3),
    9 : doSwitch,
    10: ubyte(4),
    11: ubyte(4),
    12: ubyte(4),
    13: doSwitch,
    14: ubyte(5),
    15: ubyte(5),
    16: doSwitch,
    17: ubyte(6),
    18: ubyte(6),
    19: doSwitch,
    20: ubyte(7),
    21: ubyte(7),
    22: ubyte(7),
    23: doSwitch,
    24: ubyte(8),
    25: ubyte(8),
    26: doSwitch,
    27: ubyte(9),
    28: ubyte(9),
    29: ubyte(9),
    30: ubyte(9),
    31: ubyte(9),
    ];
    ubyte resultOrSelector = decimalLength9LUT[lzc];
    if (resultOrSelector != doSwitch)
        result = resultOrSelector;
    else switch (lzc)
    {
        case 3 : result =  v >= 10 ? 2 : 1; break;
        case 6 : result =  v >= 100 ? 3 : 2; break;
        case 9 : result =  v >= 1000 ? 4 : 3; break;
        case 13: result =  v >= 10000 ? 5 : 4; break;
        case 16: result =  v >= 100000 ? 6 : 5; break;
        case 19: result =  v >= 1000000 ? 7 : 6; break;
        case 23: result =  v >= 10000000 ? 8 : 7; break;
        case 26: result =  v >= 100000000 ? 9 : 8; break;
        default: assert(false);
    }
    return result;
}

ubyte decimalLength9_4(const uint v) pure nothrow
{
    return 1 +  (v >= 10) +
                (v >= 100) +
                (v >= 1000) +
                (v >= 10000) +
                (v >= 100000) +
                (v >= 1000000) +
                (v >= 10000000) +
                (v >= 100000000) ;
}

ubyte decimalLength9_5(const uint v) pure nothrow
{
    if (v == 0) // BSR and LZCNT UB when input is 0
        return 1;
    ubyte result;
    enum ubyte doBranchlessWay = ubyte(0);
    const ubyte lzc = cast(ubyte) bsr(v);
    const ubyte[32] decimalLength9LUT =
    [
    0 : ubyte(1),
    1 : ubyte(1),
    2 : ubyte(1),
    3 : doBranchlessWay,
    4 : ubyte(2),
    5 : ubyte(2),
    6 : doBranchlessWay,
    7 : ubyte(3),
    8 : ubyte(3),
    9 : doBranchlessWay,
    10: ubyte(4),
    11: ubyte(4),
    12: ubyte(4),
    13: doBranchlessWay,
    14: ubyte(5),
    15: ubyte(5),
    16: doBranchlessWay,
    17: ubyte(6),
    18: ubyte(6),
    19: doBranchlessWay,
    20: ubyte(7),
    21: ubyte(7),
    22: ubyte(7),
    23: doBranchlessWay,
    24: ubyte(8),
    25: ubyte(8),
    26: doBranchlessWay,
    27: ubyte(9),
    28: ubyte(9),
    29: ubyte(9),
    30: ubyte(9),
    31: ubyte(9),
    ];
    ubyte resultOrSelector = decimalLength9LUT[lzc];
    if (resultOrSelector != doBranchlessWay)
        result = resultOrSelector;
    else
    {
    result = 1 + (v >= 10) +
                 (v >= 100) +
                 (v >= 1000) +
                 (v >= 10000) +
                 (v >= 100000) +
                 (v >= 1000000) +
                 (v >= 10000000) +
                 (v >= 100000000) ;
    }
    return result;
}

void main(string[] args)
{
    uint    sum;
    ulong   count;
    uint    seed;
    ubyte   func;

    GC.disable;
    getopt(args, config.passThrough, "c|count", &count, "f|func", &func, "s|seed", &seed);
    static const funcs = [&decimalLength9_0, &decimalLength9_1, &decimalLength9_2, &decimalLength9_3, &decimalLength9_4, &decimalLength9_5];

    rndGen.seed(seed);
    foreach (const ulong i; 0 .. count)
    {
        sum += funcs[func](rndGen.front());
        rndGen.popFront();
    }
    writeln(sum);
}
---

Could you share your benchmarking method maybe ?
February 27, 2020
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.

Perhaps you could try something like this.

<code>
int decimalDigitLength(ulong n) {
	if (n < 10000)
		if (n < 100)
			return n < 10 ? 1 : 2;
		else
			return n < 1000 ? 3 : 4;
	else
		if (n < 100000000)
			if (n < 1000000)
				return n < 100000 ? 5 : 6;
			else
				return n < 10000000 ? 7 : 8;
		else
			if (n < 1000000000000)
				if (n < 10000000000)
					return n < 1000000000 ? 9 : 10;
				else
					return n < 100000000000 ? 11 : 12;
			else
				if (n < 10000000000000000)
					if (n < 100000000000000)
						return n < 10000000000000 ? 13 : 14;
					else
						return n < 1000000000000000 ? 15 : 16;
				else
					if (n < 1000000000000000000)
						return n < 100000000000000000 ? 17 : 18;
					else
						return n < 10000000000000000000 ? 19 : 20;
}										
</code>

This uses at most 6 compares for any 64 bit number and only 3 for the most common small numbers less than 10000.

I was glad to see that with ldc at run.dlang.io using the -O3 optimization I could change the function signature to match yours and the compiler eliminated all the unreachable dead code for larger values. The compiler produced the following assembler

<code>
	.section	.text.ubyte onlineapp.decimalLength9(uint),"axG",@progbits,ubyte onlineapp.decimalLength9(uint),comdat
	.globl	ubyte onlineapp.decimalLength9(uint)
	.p2align	4, 0x90
	.type	ubyte onlineapp.decimalLength9(uint),@function
ubyte onlineapp.decimalLength9(uint):
	.cfi_startproc
	cmpl	$9999, %edi
	ja	.LBB1_5
	cmpl	$99, %edi
	ja	.LBB1_4
	cmpl	$10, %edi
	movb	$2, %al
	sbbb	$0, %al
	retq
.LBB1_5:
	cmpl	$99999999, %edi
	ja	.LBB1_9
	cmpl	$999999, %edi
	ja	.LBB1_8
	cmpl	$100000, %edi
	movb	$6, %al
	sbbb	$0, %al
	retq
.LBB1_4:
	cmpl	$1000, %edi
	movb	$4, %al
	sbbb	$0, %al
	retq
.LBB1_9:
	cmpl	$1000000000, %edi
	movb	$10, %al
	sbbb	$0, %al
	retq
.LBB1_8:
	cmpl	$10000000, %edi
	movb	$8, %al
	sbbb	$0, %al
	retq
.Lfunc_end1:
	.size	ubyte onlineapp.decimalLength9(uint), .Lfunc_end1-ubyte onlineapp.decimalLength9(uint)
	.cfi_endproc
</code>

for the same body with signature ubyte decimalLength9(uint n).

This may be faster than your sequential comparison function depending upon the distribution of numbers. In real applications, small numbers are far more common so the reduced number of compares for those values should be beneficial in most cases.

February 27, 2020
On Thursday, 27 February 2020 at 09:33:28 UTC, Dennis Cote wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP.
>
> Perhaps you could try something like this.
>
> <code>
> int decimalDigitLength(ulong n) {
> 	if (n < 10000)
> 		if (n < 100)
> 			return n < 10 ? 1 : 2;
> 		else
> 			return n < 1000 ? 3 : 4;
> 	else
> 		if (n < 100000000)
> 			if (n < 1000000)
> 				return n < 100000 ? 5 : 6;
> 			else
> 				return n < 10000000 ? 7 : 8;
> 		else
> 			if (n < 1000000000000)
> 				if (n < 10000000000)
> 					return n < 1000000000 ? 9 : 10;
> 				else
> 					return n < 100000000000 ? 11 : 12;
> 			else
> 				if (n < 10000000000000000)
> 					if (n < 100000000000000)
> 						return n < 10000000000000 ? 13 : 14;
> 					else
> 						return n < 1000000000000000 ? 15 : 16;
> 				else
> 					if (n < 1000000000000000000)
> 						return n < 100000000000000000 ? 17 : 18;
> 					else
> 						return n < 10000000000000000000 ? 19 : 20;
> }										
> </code>
>
> This uses at most 6 compares for any 64 bit number and only 3 for the most common small numbers less than 10000.
>
> I was glad to see that with ldc at run.dlang.io using the -O3 optimization I could change the function signature to match yours and the compiler eliminated all the unreachable dead code for larger values. The compiler produced the following assembler
>
> <code>
> 	.section	.text.ubyte onlineapp.decimalLength9(uint),"axG",@progbits,ubyte onlineapp.decimalLength9(uint),comdat
> 	.globl	ubyte onlineapp.decimalLength9(uint)
> 	.p2align	4, 0x90
> 	.type	ubyte onlineapp.decimalLength9(uint),@function
> ubyte onlineapp.decimalLength9(uint):
> 	.cfi_startproc
> 	cmpl	$9999, %edi
> 	ja	.LBB1_5
> 	cmpl	$99, %edi
> 	ja	.LBB1_4
> 	cmpl	$10, %edi
> 	movb	$2, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_5:
> 	cmpl	$99999999, %edi
> 	ja	.LBB1_9
> 	cmpl	$999999, %edi
> 	ja	.LBB1_8
> 	cmpl	$100000, %edi
> 	movb	$6, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_4:
> 	cmpl	$1000, %edi
> 	movb	$4, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_9:
> 	cmpl	$1000000000, %edi
> 	movb	$10, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_8:
> 	cmpl	$10000000, %edi
> 	movb	$8, %al
> 	sbbb	$0, %al
> 	retq
> .Lfunc_end1:
> 	.size	ubyte onlineapp.decimalLength9(uint), .Lfunc_end1-ubyte onlineapp.decimalLength9(uint)
> 	.cfi_endproc
> </code>
>
> for the same body with signature ubyte decimalLength9(uint n).
>
> This may be faster than your sequential comparison function depending upon the distribution of numbers. In real applications, small numbers are far more common so the reduced number of compares for those values should be beneficial in most cases.

Sorry but no. I think that you have missed how this has changed since the first message.
1. the way it was tested initially was wrong because LLVM was optimizing some stuff in some tests and not others, due to literals constants.
2. Apparently there would be a branchless version that's fast when testing with unbiased input (to be verified)

this version is:

---
ubyte decimalLength9_4(const uint v) pure nothrow
{
    return 1 +  (v >= 10) +
                (v >= 100) +
                (v >= 1000) +
                (v >= 10000) +
                (v >= 100000) +
                (v >= 1000000) +
                (v >= 10000000) +
                (v >= 100000000) ;
}
---

but i cannot see the improvment when use time on the test program and 100000000 calls feeded with a random number.

see https://forum.dlang.org/post/ctidwrnxvwwkouprjszw@forum.dlang.org for the latest evolution of the discussion.
February 27, 2020
On Thursday, 27 February 2020 at 09:41:20 UTC, Basile B. wrote:
> On Thursday, 27 February 2020 at 09:33:28 UTC, Dennis Cote wrote:
>> [...]
>
> Sorry but no. I think that you have missed how this has changed since the first message.
> 1. the way it was tested initially was wrong because LLVM was optimizing some stuff in some tests and not others, due to literals constants.
> 2. Apparently there would be a branchless version that's fast when testing with unbiased input (to be verified)
>
> this version is:
>
> ---
> ubyte decimalLength9_4(const uint v) pure nothrow
> {
>     return 1 +  (v >= 10) +
>                 (v >= 100) +
>                 (v >= 1000) +
>                 (v >= 10000) +
>                 (v >= 100000) +
>                 (v >= 1000000) +
>                 (v >= 10000000) +
>                 (v >= 100000000) ;
> }
> ---
>
> but i cannot see the improvment when use time on the test program and 100000000 calls feeded with a random number.
>
> see https://forum.dlang.org/post/ctidwrnxvwwkouprjszw@forum.dlang.org for the latest evolution of the discussion.

maybe just add you version to the test program and run

time ./declen -c100000000 -f0 -s137 // original
time ./declen -c100000000 -f4 -s137 // the 100% branchless
time ./declen -c100000000 -f5 -s137 // the LUT + branchless for the bit num that need attention
time ./declen -c100000000 -f6 -s137 // assumed to be your version

to see if it beats the original. Thing is that i cannot do it right now but otherwise will try tomorrow.
February 27, 2020
On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> [...]
>
> Hi Basile,
>   I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U

Andrei made a talk about this too a few years ago.

> It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize.
> llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv
>
> cheers,
>   Johan

February 27, 2020
On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
> On Thursday, 27 February 2020 at 04:44:56 UTC, Basile B. wrote:
>> On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal wrote:
>>>>
>>>> Maybe you talked about another implementation of decimalLength9 ?
>>>
>>> Yes.  It's one I wrote after I saw your post. Psuedo-code here:
>>>
>>>   auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) ... }
>>>
>>> Using ldc to target an x86 with the above yields a series of cmpl, seta instruction pairs in the function body followed by a summing and a return.  No branching.
>>>
>>> Let me know if the above is unclear or insufficient.
>>
>> No thanks, it's crystal clear now.
>
> although I don't see the performance gain.
> snip

> ubyte decimalLength9_0(const uint v)
> {
>     if (v >= 100000000) { return 9; }
>     if (v >= 10000000) { return 8; }
>     if (v >= 1000000) { return 7; }
>     if (v >= 100000) { return 6; }
>     if (v >= 10000) { return 5; }
>     if (v >= 1000) { return 4; }
>     if (v >= 100) { return 3; }
>     if (v >= 10) { return 2; }
>     return 1;
> }
>
> snip

> Could you share your benchmarking method maybe ?

OK. I'm working with equi-probable digits.  You're working with equi-probable values.  The use-case may be either of these or, more likely, a third as yet unknown/unspecified distribution.

The predictability of the test input sequence clearly influences the performance of the branching implementations.  The uniform random input is highly predictable wrt digits. (there are *many* more high digit numbers than low digit numbers)

Take the implementation above as an example.  If, as in the random number (equi-probable value) scenario, your inputs skew *stongly* toward a higher number of digits, then the code should perform very well.

If you believe the function will see uniform random values as input, then testing with uniform random inputs is the way to go.  If you believe some digits distribution is what will be seen, then you should alter your inputs to match.  (testing uniform random in addition to the supposed target distribution is almost always a good idea for sensitivity analysis).

My testing methodology: To obtain an equi-digit distribution, I fill an array with N 1-digit values, followed by N 2-digit values, ... followed by N 9-digit values.  I use N == 1000 and loop over the total array until I've presented about 1 billion values to the function under test.

I test against that equi-digit array before shuffling (this favors branching implementations) and then again after shuffling (this favors branchless).

If you wish to work with equi-digit distributions I'd prefer that you implement the functionality anew.  This is some small duplicated effort but will help guard against my having screwed up even that simple task (less than 6 LOC IIRC).

I will post my code if there is any meaningful difference in your subsequent results.









February 27, 2020
On Thursday, 27 February 2020 at 14:12:35 UTC, Basile B. wrote:
> On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote:
>> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>>> [...]
>>
>> Hi Basile,
>>   I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U
>
> Andrei made a talk about this too a few years ago.
>
>> It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize.
>> llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv
>>
>> cheers,
>>   Johan

https://www.youtube.com/watch?v=Qq_WaiwzOtI