Thread overview | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
June 03, 2020 Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop. There's of course the naïve method (`val` is the input ulong): uint count; foreach (i; 0 .. 64) { if (val & (1L << i)) count++; } return count; Knowing that this code is inside a hotspot, I wrote a parallel bit counter based on the idea from [1]: // Parallel bit count. // We count upper/lower halves separately, interspersed to maximize // hyperthreading. >:-) uint v1 = cast(uint)(val >> 32); uint v2 = cast(uint)val & 0xFFFF_FFFF; v1 = v1 - ((v1 >> 1) & 0x5555_5555); v2 = v2 - ((v2 >> 1) & 0x5555_5555); v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333); v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333); v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F); v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F); v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF); v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF); v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF); v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF); return cast(ulong)v1 + cast(ulong)v2; [1] https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel The basic idea behind this trick is to add bits in pairs, then in 4's, then in 8's, and so on, inside the ulong itself. I furthermore split it into upper and lower halves and interleaved the two halves to cut the data dependency between instructions and (hopefully) increase the opportunity for hyperthreading / increase out-of-order instruction execution. This is completely branchless and in theory should maximize throughput. Unfortunately it comes at the cost of high instruction count. The third alternative is known as Kernighan's trick, which involves a loop that executes exactly as many times as the number of 1 bits: uint count; for (count = 0; val; count++) { val &= val - 1; } return count; (How it does this is left as an exercise for the reader. It's really quite clever.) This comes with the advantage of a very low instruction count, but it does have a branch that isn't easily predictable, so actual execution time is marginally slower. I did some measurements with real data (not just a micro benchmark, this is the actual algorithm with the bitcount function in its hotspot), and as expected the naïve bitcount performed the worst, about 2x slower. However, between Kernighan and the parallel bit counter, the results depend on the kind of input given. For program inputs where there is a large number of 1's in the average ulong to be counted, the parallel bit counter performs better. Here's one example of such an input case: Naïve: real 0m4.159s user 0m4.172s sys 0m0.024s Kernighan: real 0m2.114s user 0m2.129s sys 0m0.020s Parallel: real 0m2.024s user 0m2.039s sys 0m0.028s As you can see, the advantage of the parallel count is only slightly (albeit consistently) better than Kernighan. However, when the input tends to generate ulongs with a low saturation of 1's, Kernighan wins out by quite a large margin: Naïve: real 0m5.661s user 0m5.706s sys 0m0.054s Kernighan: real 0m3.080s user 0m3.123s sys 0m0.058s Parallel: real 0m3.805s user 0m3.844s sys 0m0.056s This illustrates how optimization is often input-dependent, and what works well in one case may work not so well in another case. And even an overall better choice (in this case Kernighan's trick) will in some niche cases perform less well. This goes to reinforce the point Andrei recently made, that sometimes if you optimize for a micro-benchmark, you can end up with code that looks good on the benchmark but not so good on real-world data -- you end up optimizing for the benchmark rather than for the real-world use case. Real-world optimization isn't so straightforward as minimizing a single number; it's often a multi-dimensional problem. :-P T -- Amateurs built the Ark; professionals built the Titanic. |
June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote: > So recently I needed to implement a function to count the number of 1 > bits in a ulong inside a (very) hot inner loop. There's of course the > naïve method (`val` is the input ulong): [...] > This illustrates how optimization is often input-dependent, and what works well in one case may work not so well in another case. And even an overall better choice (in this case Kernighan's trick) will in some niche cases perform less well. Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here. |
June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | H. S. Teoh wrote:
> The third alternative is known as Kernighan's trick, which involves a
> loop that executes exactly as many times as the number of 1 bits:
>
> uint count;
> for (count = 0; val; count++)
> {
> val &= val - 1;
> }
> return count;
>
> (How it does this is left as an exercise for the reader. It's really
> quite clever.) This comes with the advantage of a very low instruction
> count, but it does have a branch that isn't easily predictable, so
> actual execution time is marginally slower.
actually, that branch has very good prediction rate. it should be predicted most of the time, and when predictor failed, it is mostly doesn't matter, because we're done with the loop. coincidentally, this is what your benchmarking results demonstrates. ;-)
|
June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the number of 1
> bits in a ulong inside a (very) hot inner loop. There's of course the
> naïve method (`val` is the input ulong):
>
> [...]
You should try to use the popcnt instruction of the cpu (in asm or with intrinsics). It's at least 3 or 4 order of magnitude (i.e. between 1000 and 10000 times) faster than any of these tricks. I made the change last year in my work codebase since we've left the SPARC servers (they had a very slow implementation of popcount).
All modern CPU's have the instruction.
|
June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> I did some measurements with real data (not just a micro benchmark, this is the actual algorithm with the bitcount function in its hotspot), and as expected the naïve bitcount performed the worst, about 2x slower. However, between Kernighan and the parallel bit counter, the results depend on the kind of input given. For program inputs where there is a large number of 1's in the average ulong to be counted, the parallel bit counter performs better. Here's one example of such an input case:
>
> Naïve:
> real 0m4.159s
> user 0m4.172s
> sys 0m0.024s
>
> Kernighan:
> real 0m2.114s
> user 0m2.129s
> sys 0m0.020s
>
> Parallel:
> real 0m2.024s
> user 0m2.039s
> sys 0m0.028s
>
> As you can see, the advantage of the parallel count is only slightly (albeit consistently) better than Kernighan.
Did you run the benchmark on different CPU architectures ?
I recently benchmarked code which was performing a calculation one of 2 ways.
1st way was to calculate for real, the other used look up tables.
I ran the test on an i7 haswell and an i5 skylake.
The i5 was performing faster with look up tables, the i7 performance was worse using look up tables.
The performance impact was similar to your 1st benchmark Parallel vs Kernighan.
|
June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the number of 1
> bits in a ulong inside a (very) hot inner loop. There's of course the
> naïve method (`val` is the input ulong):
>
> uint count;
> foreach (i; 0 .. 64)
> {
> if (val & (1L << i))
> count++;
> }
> return count;
>
> Knowing that this code is inside a hotspot, I wrote a parallel bit counter based on the idea from [1]:
>
> // Parallel bit count.
> // We count upper/lower halves separately, interspersed to maximize
> // hyperthreading. >:-)
> uint v1 = cast(uint)(val >> 32);
> uint v2 = cast(uint)val & 0xFFFF_FFFF;
>
> v1 = v1 - ((v1 >> 1) & 0x5555_5555);
> v2 = v2 - ((v2 >> 1) & 0x5555_5555);
> v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
> v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
> v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
> v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
> v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
> v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
> v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
> v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);
wikipedia recommends this:
x -= (x >> 1) & m1;
x = (x & m2) + ((x >> 2) & m2);
x = (x + (x >> 4)) & m4;
x += x >> 8; //put count of each 16 bits into their lowest 8 bits
x += x >> 16; //put count of each 32 bits into their lowest 8 bits
x += x >> 32; //put count of each 64 bits into their lowest 8 bits
return x & 0x7f;
so in the last 4 lines you do too much work.
better use the intrinsics: popcnt(val>>32)+popcnt(val & uint.max)
|
June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote: > On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...] > > Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here. > > On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...] > > You should try to use the popcnt instruction of the cpu (in asm or > > with intrinsics). It's at least 3 or 4 order of magnitude (i.e. > > between 1000 and 10000 times) faster than any of these tricks. [...] > As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the > instruction. :-( [...] Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it. After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck. Cool stuff! T -- What do you mean the Internet isn't filled with subliminal messages? What about all those buttons marked "submit"?? |
June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 4 June 2020 at 17:38:48 UTC, H. S. Teoh wrote: > On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote: >> On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...] >> > Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here. >> >> On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...] >> > You should try to use the popcnt instruction of the cpu (in asm or >> > with intrinsics). It's at least 3 or 4 order of magnitude (i.e. >> > between 1000 and 10000 times) faster than any of these tricks. > [...] >> As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the >> instruction. :-( > [...] > > Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it. After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck. Cool stuff! > Cool, I just wanted to comment as I have myself a Phenom II X6 1090T and was wondering. Btw, if you want to use the Kernighan function for values with a lot of set bits, count the inverted value: That's what I had in my code (gcc C) DEFINE_INLINE int PopCountFewClr(uint64_t w) { #ifdef __POPCNT__ return __builtin_popcountll(w); #else return CHAR_BIT*sizeof w - PopCountFewSet(~w); #endif } |
June 05, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote: > So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop. This made me wonder how well a lookup table approach would compare to these logical methods (and it was an excuse to blow off work a bit to fiddle with D). The popcnt instruction is going to blow away all of these, of course, but it still seems like a worthwhile exercise, if only to get a feel for how purely in-register logic performance compares to cache access performance on modern "big" CPUs. ttk@kirov:/home/ttk/prog/D$ ./popcount Starting tests, which take about 10 seconds each Naive: 5596740 iter/sec Parallel: 166352970 iter/sec Kernighan: 47668800 iter/sec Lookup 8-bit: 467161540 iter/sec Lookup 16-bit: 826179570 iter/sec It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the performance of the 8-bit lookup, since both lookup tables fit well within the L1 cache. Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call): http://ciar.org/h/popcount.d I ran this on an Intel Core i7-9750M @ 2.6GHz, with 192KB L1 / 1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O popcount.d" on Slackware-current. |
June 06, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
Posted in reply to ttk | On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
> Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call):
>
> http://ciar.org/h/popcount.d
Jesus. Shouldn't you be able to generate that instead of
copy-pasting? It's D, after all.
|
Copyright © 1999-2021 by the D Language Foundation