core.bitop.bt not faster than & ?

Dec 17, 2014

Trollgeir

Dec 17, 2014

Adam D. Ruppe

Dec 17, 2014

Trollgeir

Dec 17, 2014

H. S. Teoh

Dec 18, 2014

Dicebot

I've been doing some benchmarking the bt function (http://dlang.org/phobos/core_bitop.html) up against a regular bit & -operator, and I'm a bit confused over the timings. I'd expect the bt function to be up to 32 times faster as I thought it only compared two bits, and not the entire length of bits in the uint. The results I got indicated that there were no difference between that and just a regular & condition. In most cases, the & was actually a tad bit faster! Ex. bt(&element, 16) vs (element & 0b00000000000000001000000000000000) Can anyone confirm this? I used a simple std.datetime benchmark template iterating over a static array[100000], trying both random values for each entry and fixed ones.

On Wednesday, 17 December 2014 at 14:12:16 UTC, Trollgeir wrote: > I'd expect the bt function to be up to 32 times faster as I thought it only compared two bits, and not the entire length of bits in the uint. The processor doesn't work in terms of bits like that - it still needs to look at the whole integer. In fact, according to my (OLD) asm reference, the bt instruction is slower than the and instruction at the cpu level. I think it has to do a wee bit more work, translating the 16 into a mask then moving the result into the flag... then moving the flag back into a register to return the value. (That last step could probably be skipped if you do an if() on it and the compiler optimizes the branch, and the first step might be skipped too if it is a constant, since the compiler can rewrite the instruction. So given that, I'd expect what you saw: no difference when they are optimized to the same thing or when the CPU's stars align right, and & a bit faster when bt isn't optimized) bt() and friends are special instructions for specialized use cases. Probably useful for threading and stuff.

On Wednesday, 17 December 2014 at 14:58:13 UTC, Adam D. Ruppe wrote: > On Wednesday, 17 December 2014 at 14:12:16 UTC, Trollgeir wrote: >> I'd expect the bt function to be up to 32 times faster as I thought it only compared two bits, and not the entire length of bits in the uint. > > The processor doesn't work in terms of bits like that - it still needs to look at the whole integer. In fact, according to my (OLD) asm reference, the bt instruction is slower than the and instruction at the cpu level. > > I think it has to do a wee bit more work, translating the 16 into a mask then moving the result into the flag... then moving the flag back into a register to return the value. (That last step could probably be skipped if you do an if() on it and the compiler optimizes the branch, and the first step might be skipped too if it is a constant, since the compiler can rewrite the instruction. So given that, I'd expect what you saw: no difference when they are optimized to the same thing or when the CPU's stars align right, and & a bit faster when bt isn't optimized) > > bt() and friends are special instructions for specialized use cases. Probably useful for threading and stuff. Thanks for the explanation, I suspected it might work something like that. For my implementation - I have bits shifting to the right every update, and I want to check if it has reached certain markers. Hence, I felt it was really inefficient to check every single bit in the uint when I was only interested in some specific ones. Is an alternative (optimized) version of bt even possible?