On 19 July 2013 19:33, bearophile <bearophileHUGS@lycos.com> wrote:
Manu:

Interesting. Almost all his points are what we do already in D.
Always nice to see others come to the same conclusions :)

While trying to write a multiplication of two complex numbers using SSE3 with LDC2 I have found about seven or more bugs, that I will discuss elsewhere. But regarding the syntax, in nice code like this D requires to add ".array" before all those subscripts (code adapted from Fog):


double2 complexMult(in double2 a, in double2 b) pure nothrow {
    double2 b_flip = [b.array[1], b.array[0]];
    double2 a_im = [a.array[1], a.array[1]];
    double2 a_re = [a.array[0], a.array[0]];
    double2 aib = a_im * b_flip;
    double2 arb = a_re * b;
    return [arb.array[0] - aib.array[0], arb.array[1] + aib.array[1]];
}


A line like this:

double2 b_flip = [b.array[1], b.array[0]];

becomes something like:

pshufd   $238,  %xmm1, %xmm3

Similarly all the other lines become single instructions (but the last one, because LDC2 misses to use a addsubpd).

I vaguely remember you saying that slow SIMD operations shouldn't have a too much short syntax to avoid giving an illusion of efficiency. But given that "often" the CPU executes such array subscripting and shuffling efficiently, isn't it nicer/enough to support a simpler syntax like this in D?

double2 complexMult(in double2 a, in double2 b) pure nothrow {
    double2 b_flip = [b[1], b[0]];
    double2 a_im = [a[1], a[1]];
    double2 a_re = [a[0], a[0]];
    double2 aib = a_im * b_flip;
    double2 arb = a_re * b;
    return [arb[0] - aib[0], arb[1] + aib[1]];
}

The point about eliminating the index operator is because it implies a vector->float cast.
You want to perform a shuffle(/swizzle), but you are only really performing the operation incidentally.
What you're really doing is casting a bunch of vector components to floats, and then rebuilding a vector, and LLVM can helpfully deal with that.

I would suggest calling a spade a spade and using a swizzle function to perform a swizzle, instead of code like what you wrote.
Wouldn't this be better:

double2 complexMult(in double2 a, in double2 b) pure nothrow {
    double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to include an opDispatch in the basic type
    double2 a_im = a.yy;
    double2 a_re = a.xx;
    double2 aib = a_im * b_flip;
    double2 arb = a_re * b;

//    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is tricky... it's not very portable.

    // Maybe:
    return select([-1, 0], arb-aib, arb+aib);
    // Hopefully the x86 optimiser will generate the proper opcode. Or a bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.
}

I think that would be better. More portable, and it eliminates the code that implies a vector->float->vector cast sequence, which I maintain, should be syntactically discouraged at all costs.
You don't want to be giving people bad ideas that it's reasonable code to write ;)