June 04, 2020 Re: Counting bits in a ulong | ||||
---|---|---|---|---|
| ||||
On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote: [...] > Interesting stuff. Did you compare your implementations with the popcnt function in core.bitop? There could be a potential improvement here. On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote: [...] > You should try to use the popcnt instruction of the cpu (in asm or > with intrinsics). It's at least 3 or 4 order of magnitude (i.e. > between 1000 and 10000 times) faster than any of these tricks. I made > the change last year in my work codebase since we've left the SPARC > servers (they had a very slow implementation of popcount). > All modern CPU's have the instruction. Haha, I learn something new every day. Didn't even know there was such a CPU instruction as popcnt. :-) But what's more interesting is the result of using core.bitop.popcnt: as it turns out, it runs faster than Kernighan when the input ulongs have high bit count, but *slower* than Kernighan when the inputs have low bit density. This was very puzzling to me, since a CPU instruction is supposed to be orders of magnitude faster, as Patrick said, so I looked at the disassembly. As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the instruction. :-( So the LDC intrinsic reverted to a software implementation, which is essentially the parallel bitcount I wrote, except better. Evidently, I miscalculated the impact of data dependency between instructions, or more likely, the AMD CPU doesn't have hyperthreading so splitting the ulong into upper/lower halves did more harm than good for the calculation. But nonetheless, even the better implementation of parallel bitcount loses out to Kernighan when the input does not have a high bit density. I do have a VPS that runs on an Intel CPU, but apparently also an older one that doesn't support SSE4 (either that, or the hypervisor filtered that out for whatever reason, but I doubt this). So again we see the complex landscape of optimization: not only the optimality depend on input data, it also depends on the hardware. :-) T -- Век живи - век учись. А дураком помрёшь. |
Copyright © 1999-2021 by the D Language Foundation