June 05, 2020
On Fri, Jun 05, 2020 at 07:12:43PM +0000, ttk via Digitalmars-d wrote:
> On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> > So recently I needed to implement a function to count the number of 1 bits in a ulong inside a (very) hot inner loop.
> 
> This made me wonder how well a lookup table approach would compare to these logical methods (and it was an excuse to blow off work a bit to fiddle with D).

Hooray for having an excuse to play with D! ;-)


> The popcnt instruction is going to blow away all of these, of course, but it still seems like a worthwhile exercise, if only to get a feel for how purely in-register logic performance compares to cache access performance on modern "big" CPUs.
> 
> ttk@kirov:/home/ttk/prog/D$ ./popcount
> Starting tests, which take about 10 seconds each
> Naive:         5596740 iter/sec
> Parallel:      166352970 iter/sec
> Kernighan:     47668800 iter/sec
> Lookup 8-bit:  467161540 iter/sec
> Lookup 16-bit: 826179570 iter/sec
> 
> It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the performance of the 8-bit lookup, since both lookup tables fit well within the L1 cache.

You might be running into alignment issues, possibly. For one thing, why does LOOKUP_1 have 257 elements instead of 256? That last byte is never accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last element is redundant.


> Source is here .. a bit large, since I manually unrolled loops 100x to
> minimize the impact of looping logic (particularly the now() call):
> 
> http://ciar.org/h/popcount.d

There's no need to manually unroll the code: just use static foreach, since the code is identical every iteration! :-P


> I ran this on an Intel Core i7-9750M @ 2.6GHz, with 192KB L1 / 1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O popcount.d" on Slackware-current.

I don't trust performance results with dmd, so I tested it on `ldc2 -O3` instead:

	Starting tests, which take about 10 seconds each
	Naive:         25665230 iter/sec
	Parallel:      124692900 iter/sec
	Kernighan:     40680010 iter/sec
	Lookup 8-bit:  1532505390 iter/sec
	Lookup 16-bit: 1690195340 iter/sec

Something is up with the 16-bit lookups. Why is it only barely faster than 8-bit lookup?

Disassembling shows that the code was *not* duplicated 100 times; apparently LDC's aggressive optimizer noticed that `count` is reassigned every unrolled iteration, so all except the last are no-ops. You need to do something with `count` that isn't immediately killed by the next unrolled iteration, otherwise your painstakingly copy-n-pasted iterations will all be elided by the optimizer. :-P

OR'ing `count` into ACCUMULATOR after every count is probably what you need to do here.


T

-- 
"640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.
June 05, 2020
On Fri, Jun 05, 2020 at 12:45:36PM -0700, H. S. Teoh via Digitalmars-d wrote: [...]
> You might be running into alignment issues, possibly. For one thing, why does LOOKUP_1 have 257 elements instead of 256? That last byte is never accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last element is redundant.
[...]
> Disassembling shows that the code was *not* duplicated 100 times; apparently LDC's aggressive optimizer noticed that `count` is reassigned every unrolled iteration, so all except the last are no-ops. You need to do something with `count` that isn't immediately killed by the next unrolled iteration, otherwise your painstakingly copy-n-pasted iterations will all be elided by the optimizer. :-P
> 
> OR'ing `count` into ACCUMULATOR after every count is probably what you need to do here.
[...]

So I made the above changes, and here are the new measurements:

	Starting tests, which take about 10 seconds each
	Naive:         25272530 iter/sec
	Parallel:      133259870 iter/sec
	Kernighan:     36447830 iter/sec
	Lookup 8-bit:  133837870 iter/sec
	Lookup 16-bit: 283343300 iter/sec

Now the 16-bit lookup makes a lot more sense! :-D  (I also confirmed via disassembly that the code is indeed unrolled 100 times, and not just elided.)

Interestingly, the parallel version seems to perform almost just as well as the 8-bit lookup.


Here's the fixed code (it's a lot shorter :-P):

------------------------------------------------------
import std.stdio;
import std.datetime.systime;

ubyte [257]   LOOKUP_1; // 8-bit lookup table
ubyte [65537] LOOKUP_2; // 16-bit lookup table
const ulong STEPPER = 0x110D0B0705030201;  // val increment step, to minimize cache effects
ulong ACCUMULATOR = 0;  // might keep optimizer from noticing we never use count and tossing it
double TEST_DURATION = 10.0;

// High-resolution time()-like, copied from VS_Common_D:
double now() {
    double tm = Clock.currStdTime();
    return (tm / 10000000.0) - 62135596800.0;
}

ubyte popcount_naive(ulong val) {
    ubyte count;
    foreach (i; 0 .. 64) if (val & (1L << i)) count++;
    return count;
}

ulong microbench_naive() {
    ulong iter_count = 0;
    ulong val = 0;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    foreach (i; 0 .. 64) if (val & (1L << i)) count++; val += STEPPER;
	    ACCUMULATOR ^= count;
	}
	iter_count += 100;
    }
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_parallel() {
    ulong iter_count = 0;
    ulong val = 0;
    uint  count = 0;
    uint  v1;
    uint  v2;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    v1 = cast(uint)(val >> 32);
	    v2 = cast(uint)val & 0xFFFF_FFFF;
	    v1 = v1 - ((v1 >> 1) & 0x5555_5555);
	    v2 = v2 - ((v2 >> 1) & 0x5555_5555);
	    v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
	    v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
	    v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
	    v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
	    v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
	    v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
	    v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
	    v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);
	    count += cast(ulong)v1 + cast(ulong)v2;
	    val += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_kernighan() {
    ulong iter_count = 0;
    ulong val = 0;
    ulong tval = 0;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    for (count = 0; val; count++) val &= val - 1;  val = tval += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_lookup_8bit() {
    ulong iter_count = 0;
    ulong val = 0;
    ubyte *av = cast(ubyte*) &val;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	// writeln("top: iter = ", iter_count, ", val = ", val, ", av = ", av, ", av0 = ", av[0], ", av1 = ", av[1], ", av2 = ", av[2], ", av3 = ", av[3], ", av4 = ", av[4], ", av5 = ", av[5], ", av6 = ", av[6], ", av7 = ", av[7]);
	static foreach (_; 0 .. 100) {
	    count = LOOKUP_1[av[0]] + LOOKUP_1[av[1]] + LOOKUP_1[av[2]] + LOOKUP_1[av[3]] + LOOKUP_1[av[4]] + LOOKUP_1[av[5]] + LOOKUP_1[av[6]] + LOOKUP_1[av[7]];
	    val += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

ulong microbench_lookup_16bit() {
    ulong iter_count = 0;
    ulong val = 0;
    ushort *av = cast(ushort*) &val;
    uint  count = 0;
    double elapsed;
    double t0 = now();
    while((elapsed = now() - t0) < TEST_DURATION) {
	// unrolling 100x should be enough to amortize overhead of calling now():
	static foreach (_; 0 .. 100) {
	    count = LOOKUP_2[av[0]] + LOOKUP_2[av[1]] + LOOKUP_2[av[2]] + LOOKUP_2[av[3]];
	    val += STEPPER;
	    ACCUMULATOR ^= count;
	}

	iter_count += 100;
    }
    ACCUMULATOR += count;
    return cast(ulong)(iter_count / elapsed);  // iterations per second
}

int main() {
    // initialize the lookup tables:
    foreach (i; 0 ..   256) LOOKUP_1[i] = popcount_naive(i);
    foreach (i; 0 .. 65536) LOOKUP_2[i] = popcount_naive(i);

    writeln("Starting tests, which take about ", TEST_DURATION, " seconds each");
    writeln("Naive:         ", microbench_naive(),        " iter/sec");
    writeln("Parallel:      ", microbench_parallel(),     " iter/sec");
    writeln("Kernighan:     ", microbench_kernighan(),    " iter/sec");
    writeln("Lookup 8-bit:  ", microbench_lookup_8bit(),  " iter/sec");
    writeln("Lookup 16-bit: ", microbench_lookup_16bit(), " iter/sec");

    // trick optimizer into thinking ACCUMULATOR, and thus count, is important:
    return ACCUMULATOR / STEPPER;
}
------------------------------------------------------


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi