Jump to page: 1 2
Thread overview
Counting bits in a ulong
Jun 04, 2020
H. S. Teoh
Jun 04, 2020
Paul Backus
Jun 04, 2020
H. S. Teoh
Jun 04, 2020
Patrick Schluter
Jun 04, 2020
ketmar
Jun 04, 2020
Patrick Schluter
Jun 04, 2020
wjoe
Jun 05, 2020
ttk
Jun 06, 2020
sebasg
Jun 08, 2020
ttk
Jun 08, 2020
H. S. Teoh
June 03, 2020
So recently I needed to implement a function to count the number of 1
bits in a ulong inside a (very) hot inner loop.  There's of course the
naïve method (`val` is the input ulong):

        uint count;
        foreach (i; 0 .. 64)
        {
            if (val & (1L << i))
                count++;
        }
        return count;

Knowing that this code is inside a hotspot, I wrote a parallel bit counter based on the idea from [1]:

        // Parallel bit count.
        // We count upper/lower halves separately, interspersed to maximize
        // hyperthreading. >:-)
        uint v1 = cast(uint)(val >> 32);
        uint v2 = cast(uint)val & 0xFFFF_FFFF;

        v1 = v1 - ((v1 >> 1) & 0x5555_5555);
        v2 = v2 - ((v2 >> 1) & 0x5555_5555);
        v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
        v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
        v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
        v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
        v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
        v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
        v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
        v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);

        return cast(ulong)v1 + cast(ulong)v2;

[1] https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

The basic idea behind this trick is to add bits in pairs, then in 4's, then in 8's, and so on, inside the ulong itself.  I furthermore split it into upper and lower halves and interleaved the two halves to cut the data dependency between instructions and (hopefully) increase the opportunity for hyperthreading / increase out-of-order instruction execution.  This is completely branchless and in theory should maximize throughput. Unfortunately it comes at the cost of high instruction count.

The third alternative is known as Kernighan's trick, which involves a loop that executes exactly as many times as the number of 1 bits:

        uint count;
        for (count = 0; val; count++)
        {
            val &= val - 1;
        }
        return count;

(How it does this is left as an exercise for the reader. It's really quite clever.)  This comes with the advantage of a very low instruction count, but it does have a branch that isn't easily predictable, so actual execution time is marginally slower.


I did some measurements with real data (not just a micro benchmark, this is the actual algorithm with the bitcount function in its hotspot), and as expected the naïve bitcount performed the worst, about 2x slower. However, between Kernighan and the parallel bit counter, the results depend on the kind of input given.  For program inputs where there is a large number of 1's in the average ulong to be counted, the parallel bit counter performs better. Here's one example of such an input case:

	Naïve:
	real	0m4.159s
	user	0m4.172s
	sys	0m0.024s

	Kernighan:
	real	0m2.114s
	user	0m2.129s
	sys	0m0.020s

	Parallel:
	real	0m2.024s
	user	0m2.039s
	sys	0m0.028s

As you can see, the advantage of the parallel count is only slightly (albeit consistently) better than Kernighan.

However, when the input tends to generate ulongs with a low saturation of 1's, Kernighan wins out by quite a large margin:

	Naïve:
	real	0m5.661s
	user	0m5.706s
	sys	0m0.054s

	Kernighan:
	real	0m3.080s
	user	0m3.123s
	sys	0m0.058s

	Parallel:
	real	0m3.805s
	user	0m3.844s
	sys	0m0.056s

This illustrates how optimization is often input-dependent, and what works well in one case may work not so well in another case. And even an overall better choice (in this case Kernighan's trick) will in some niche cases perform less well.

This goes to reinforce the point Andrei recently made, that sometimes if you optimize for a micro-benchmark, you can end up with code that looks good on the benchmark but not so good on real-world data -- you end up optimizing for the benchmark rather than for the real-world use case. Real-world optimization isn't so straightforward as minimizing a single number; it's often a multi-dimensional problem. :-P


T

-- 
Amateurs built the Ark; professionals built the Titanic.
June 04, 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the number of 1
> bits in a ulong inside a (very) hot inner loop.  There's of course the
> naïve method (`val` is the input ulong):
[...]
> This illustrates how optimization is often input-dependent, and what works well in one case may work not so well in another case. And even an overall better choice (in this case Kernighan's trick) will in some niche cases perform less well.

Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.
June 04, 2020
H. S. Teoh wrote:

> The third alternative is known as Kernighan's trick, which involves a
> loop that executes exactly as many times as the number of 1 bits:
>
>         uint count;
>         for (count = 0; val; count++)
>         {
>             val &= val - 1;
>         }
>         return count;
>
> (How it does this is left as an exercise for the reader. It's really
> quite clever.)  This comes with the advantage of a very low instruction
> count, but it does have a branch that isn't easily predictable, so
> actual execution time is marginally slower.

actually, that branch has very good prediction rate. it should be predicted most of the time, and when predictor failed, it is mostly doesn't matter, because we're done with the loop. coincidentally, this is what your benchmarking results demonstrates. ;-)
June 04, 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the number of 1
> bits in a ulong inside a (very) hot inner loop.  There's of course the
> naïve method (`val` is the input ulong):
>
> [...]

You should try to use the popcnt instruction of the cpu (in asm or with intrinsics). It's at least 3 or 4 order of magnitude (i.e. between 1000 and 10000 times) faster than any of these tricks. I made the change last year in my work codebase since we've left the SPARC servers (they had a very slow implementation of popcount).
All modern CPU's have the instruction.
June 04, 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> I did some measurements with real data (not just a micro benchmark, this is the actual algorithm with the bitcount function in its hotspot), and as expected the naïve bitcount performed the worst, about 2x slower. However, between Kernighan and the parallel bit counter, the results depend on the kind of input given.  For program inputs where there is a large number of 1's in the average ulong to be counted, the parallel bit counter performs better. Here's one example of such an input case:
>
> 	Naïve:
> 	real	0m4.159s
> 	user	0m4.172s
> 	sys	0m0.024s
>
> 	Kernighan:
> 	real	0m2.114s
> 	user	0m2.129s
> 	sys	0m0.020s
>
> 	Parallel:
> 	real	0m2.024s
> 	user	0m2.039s
> 	sys	0m0.028s
>
> As you can see, the advantage of the parallel count is only slightly (albeit consistently) better than Kernighan.


Did you run the benchmark on different CPU architectures ?

I recently benchmarked code which was performing a calculation one of 2 ways.
1st way was to calculate for real, the other used look up tables.
I ran the test on an i7 haswell and an i5 skylake.
The i5 was performing faster with look up tables, the i7 performance was worse using look up tables.
The performance impact was similar to your 1st benchmark Parallel vs Kernighan.
June 04, 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the number of 1
> bits in a ulong inside a (very) hot inner loop.  There's of course the
> naïve method (`val` is the input ulong):
>
>         uint count;
>         foreach (i; 0 .. 64)
>         {
>             if (val & (1L << i))
>                 count++;
>         }
>         return count;
>
> Knowing that this code is inside a hotspot, I wrote a parallel bit counter based on the idea from [1]:
>
>         // Parallel bit count.
>         // We count upper/lower halves separately, interspersed to maximize
>         // hyperthreading. >:-)
>         uint v1 = cast(uint)(val >> 32);
>         uint v2 = cast(uint)val & 0xFFFF_FFFF;
>
>         v1 = v1 - ((v1 >> 1) & 0x5555_5555);
>         v2 = v2 - ((v2 >> 1) & 0x5555_5555);
>         v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
>         v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
>         v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
>         v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
>         v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
>         v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
>         v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
>         v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);
wikipedia recommends this:
    x -= (x >> 1) & m1;
    x = (x & m2) + ((x >> 2) & m2);
    x = (x + (x >> 4)) & m4;
    x += x >>  8;  //put count of each 16 bits into their lowest 8 bits
    x += x >> 16;  //put count of each 32 bits into their lowest 8 bits
    x += x >> 32;  //put count of each 64 bits into their lowest 8 bits
    return x & 0x7f;
so in the last 4 lines you do too much work.
better use the intrinsics: popcnt(val>>32)+popcnt(val & uint.max)

June 04, 2020
On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote:
> On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...]
> > Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.
> 
> On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
> > You should try to use the popcnt instruction of the cpu (in asm or
> > with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
> > between 1000 and 10000 times) faster than any of these tricks.
[...]
> As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
> instruction. :-(
[...]

Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it.  After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck.  Cool stuff!


T

-- 
What do you mean the Internet isn't filled with subliminal messages? What about all those buttons marked "submit"??
June 04, 2020
On Thursday, 4 June 2020 at 17:38:48 UTC, H. S. Teoh wrote:
> On Thu, Jun 04, 2020 at 06:40:25AM -0700, H. S. Teoh via Digitalmars-d wrote:
>> On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...]
>> > Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.
>> 
>> On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
>> > You should try to use the popcnt instruction of the cpu (in asm or
>> > with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
>> > between 1000 and 10000 times) faster than any of these tricks.
> [...]
>> As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
>> instruction. :-(
> [...]
>
> Actually, I was wrong, my CPU *does* have this instruction, but I needed to pass the right -mcpu flag to LDC before it will emit it.  After turning it on, I got a huge performance boost; it completely moved the function that calls popcnt out of the hotspot into the background! Now it's other functions that have become the bottleneck.  Cool stuff!
>
Cool, I just wanted to comment as I have myself a Phenom II X6 1090T and was wondering.
Btw, if you want to use the Kernighan function for values with a lot of set bits, count the inverted value:

That's what I had in my code (gcc C)

DEFINE_INLINE  int PopCountFewClr(uint64_t w)
{
#ifdef __POPCNT__
  return __builtin_popcountll(w);
#else
  return CHAR_BIT*sizeof w - PopCountFewSet(~w);
#endif
}

June 05, 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop.

This made me wonder how well a lookup table approach would compare to these logical methods (and it was an excuse to blow off work a bit to fiddle with D).

The popcnt instruction is going to blow away all of these, of course, but it still seems like a worthwhile exercise, if only to get a feel for how purely in-register logic performance compares to cache access performance on modern "big" CPUs.

ttk@kirov:/home/ttk/prog/D$ ./popcount
Starting tests, which take about 10 seconds each
Naive:         5596740 iter/sec
Parallel:      166352970 iter/sec
Kernighan:     47668800 iter/sec
Lookup 8-bit:  467161540 iter/sec
Lookup 16-bit: 826179570 iter/sec

It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the performance of the 8-bit lookup, since both lookup tables fit well within the L1 cache.

Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call):

http://ciar.org/h/popcount.d

I ran this on an Intel Core i7-9750M @ 2.6GHz, with 192KB L1 / 1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O popcount.d" on Slackware-current.

June 06, 2020
On Friday, 5 June 2020 at 19:12:43 UTC, ttk wrote:
> Source is here .. a bit large, since I manually unrolled loops 100x to minimize the impact of looping logic (particularly the now() call):
>
> http://ciar.org/h/popcount.d

Jesus. Shouldn't you be able to generate that instead of
copy-pasting? It's D, after all.
« First   ‹ Prev
1 2