On 23 July 2013 17:05, bearophile <bearophileHUGS@lycos.com> wrote:

Manu:

// return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
tricky... it's not very portable.

// Maybe:
return select([-1, 0], arb-aib, arb+aib);
// Hopefully the x86 optimiser will generate the proper opcode. Or a
bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.
}

I think that would be better. More portable, and it eliminates the code
that implies a vector->float->vector cast sequence, which I maintain,
should be syntactically discouraged at all costs.
You don't want to be giving people bad ideas that it's reasonable code to
write ;)

Apparently that select() uses __builtin_ia32_pblendvb128, that is a SSE4.1 instruction. At the moment I have only SSE3 :-(

It's probably better to use a shuf, or a shift for compatibility, since the selection predicate is constant anyway.