June 04, 2020
On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...]
> Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here.

On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
> You should try to use the popcnt instruction of the cpu (in asm or
> with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
> between 1000 and 10000 times) faster than any of these tricks. I made
> the change last year in my work codebase since we've left the SPARC
> servers (they had a very slow implementation of popcount).
> All modern CPU's have the instruction.

Haha, I learn something new every day. Didn't even know there was such a CPU instruction as popcnt. :-)

But what's more interesting is the result of using core.bitop.popcnt: as it turns out, it runs faster than Kernighan when the input ulongs have high bit count, but *slower* than Kernighan when the inputs have low bit density. This was very puzzling to me, since a CPU instruction is supposed to be orders of magnitude faster, as Patrick said, so I looked at the disassembly.

As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the instruction. :-(  So the LDC intrinsic reverted to a software implementation, which is essentially the parallel bitcount I wrote, except better. Evidently, I miscalculated the impact of data dependency between instructions, or more likely, the AMD CPU doesn't have hyperthreading so splitting the ulong into upper/lower halves did more harm than good for the calculation.

But nonetheless, even the better implementation of parallel bitcount loses out to Kernighan when the input does not have a high bit density.

I do have a VPS that runs on an Intel CPU, but apparently also an older one that doesn't support SSE4 (either that, or the hypervisor filtered that out for whatever reason, but I doubt this).

So again we see the complex landscape of optimization: not only the optimality depend on input data, it also depends on the hardware. :-)


T

-- 
Век живи - век учись. А дураком помрёшь.