October 18, 2011
On 18.10.2011 11:43, Manu wrote:
> On 18 October 2011 12:12, Don <nospam@nospam.com
> <mailto:nospam@nospam.com>> wrote:
>
>     You mean bsr and bsf.
>     Unfortunately, there are some big problems with them. What is bsr(0) ?
>
>
> True ;) .. but that's why the API needs to be defined and standardised.
> On PowerPC it returns 32 (or 64), and the x86 version returns 2 values,
> the position, and also a bool telling you if it was zero or not (useful
> for loop termination)

Even worse -- Intel says that the position value of bsr(0) is undefined. But AMD does define it, they say it's what was in the register before.


> I think all hardware that I've seen is easy to factor into the win32
> intrinsic api.

That would be nice. What do you think it should do for the zero case?
Note that on x86, one possibility is to do a bsr followed by a cmov, to get the PowerPC semantics.
October 19, 2011
Nicely spotted, I didn't realise the intel/amd distinction ;)

Unless I'm mistaken, it is possible for D to return 'out' parameters by
value right? (in additional return registers, no touching the stack?) ..
Assuming that's the case you would surely standardise something more like
the win32 intrinsic rather than one resembling the PPC opcode.
If the function returns a bool that the value was zero or not, then I think
it's fair to say the position is undefined (which supports the intel
assertion).
PPC's approach is more cleanly factored into the win32 model than the other
way around I think, in terms of allowing the optimiser to trim the unused
code. If the intrinsic generates implicit code to produce a bool from the
value, it will surely be trimmed by the optimiser if that result is not
used.

While cmov might work nicely (although I really don't trust that opcode anyway, an intrinsic like bsr shouldn't be producing a hidden branch) on x86 to produce the PPC result, I'm not sure other architectures would have such a simple solution. Again, I think the win32 approach is easier for all architectures to produce and for the optimiser to truncate if the calculated result is unused.

bool bsf/bsr(int value, out int position); // this assumes that position will cleanly return in a second return register...


On 18 October 2011 22:50, Don <nospam@nospam.com> wrote:

> On 18.10.2011 11:43, Manu wrote:
>
>> On 18 October 2011 12:12, Don <nospam@nospam.com <mailto:nospam@nospam.com>> wrote:
>>
>>    You mean bsr and bsf.
>>    Unfortunately, there are some big problems with them. What is bsr(0) ?
>>
>>
>> True ;) .. but that's why the API needs to be defined and standardised.
>> On PowerPC it returns 32 (or 64), and the x86 version returns 2 values,
>> the position, and also a bool telling you if it was zero or not (useful
>> for loop termination)
>>
>
> Even worse -- Intel says that the position value of bsr(0) is undefined. But AMD does define it, they say it's what was in the register before.
>
>
>
>  I think all hardware that I've seen is easy to factor into the win32
>> intrinsic api.
>>
>
> That would be nice. What do you think it should do for the zero case?
> Note that on x86, one possibility is to do a bsr followed by a cmov, to get
> the PowerPC semantics.
>


October 20, 2011
On 19.10.2011 10:13, Manu wrote:
> Nicely spotted, I didn't realise the intel/amd distinction ;)
>
> Unless I'm mistaken, it is possible for D to return 'out' parameters by
> value right? (in additional return registers, no touching the stack?) ..
> Assuming that's the case you would surely standardise something more
> like the win32 intrinsic rather than one resembling the PPC opcode.
> If the function returns a bool that the value was zero or not, then I
> think it's fair to say the position is undefined (which supports the
> intel assertion).
>
> PPC's approach is more cleanly factored into the win32 model than the
> other way around I think, in terms of allowing the optimiser to trim the
> unused code. If the intrinsic generates implicit code to produce a bool
> from the value, it will surely be trimmed by the optimiser if that
> result is not used.
>
> While cmov might work nicely (although I really don't trust that opcode
> anyway, an intrinsic like bsr shouldn't be producing a hidden branch) on
> x86 to produce the PPC result, I'm not sure other architectures would
> have such a simple solution.

Most other architectures that I know of, use lzcnt instead.

On AMD64 (not Intel) and on ARM, there's an LZCNT resp. CLZ instruction, which gives:
lzcnt(x) = x? 63-bsr(x) : 64;

Here's how it could be done:

RAX lzcnt(EBX)
{
  bsr RAX, RBX;
  cmovz RAX, -1
  xor RAX, 63;
}


> Again, I think the win32 approach is easier
> for all architectures to produce and for the optimiser to truncate if
> the calculated result is unused.
>
> bool bsf/bsr(int value, out int position); // this assumes that position
> will cleanly return in a second return register...

Seems to be equivalent to replacing the bsr with a comma expression:

(position = native_bsr(value), value == 0)

Do we really gain much by this?
The more painful signature of the function somewhat discourages users from calling it with a zero value, but the undefined position is still exposed. So the original problem of undefined behaviour remains.
There's maybe a performance improvement in the fairly rare case where there's a branch on zero value. Although theoretically, in existing code the optimizer could check for the sequence:
bsr dest, src
cmp src, 0   where only Z flag is required

and remove the cmp, so I don't think the performance aspect should be rated very highly.

It's a bit of a problem that AMD's bsf and bsr are so slow. They're really slow on Pentium 4 and Atom as well.

Interestingly AMD's lzcnt is faster than their bsr. But since Intel doesn't support it, it's pretty useless outside of inline asm.



I think we need to do a survey of as many architectures as possible, before we can decide what to do. As far as I know, bsr/bsf is unique to x86. If this is true, then bsf/bsr should probably be wrapped in version(x86), and discouraged from general use. A portable function (perhaps leadz, trailz) would need to provided as well, and recommended for general use.
1 2
Next ›   Last »