SIMD ideas for Rust - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » SIMD ideas for Rust

Thread overview

SIMD ideas for Rust
Jul 18, 2013 bearophile
Jul 18, 2013 Manu
Jul 18, 2013 bearophile
Jul 19, 2013 bearophile
Jul 19, 2013 Manu
Jul 19, 2013 bearophile
Jul 20, 2013 Manu
Jul 23, 2013 bearophile
Jul 24, 2013 Manu

July 18, 2013

SIMD ideas for Rust

Posted by bearophile

bearophile

A small blog post about SIMD ideas for Rust language:

http://blog.aventine.se/post/55669497784/my-vision-for-rust-simd

Some info on one of the discussed intrinsics, _mm_mul_epu32:
http://msdn.microsoft.com/en-us/library/f3d9e6fk%28v=vs.90%29.aspx

Bye,
bearophile

July 18, 2013

Re: SIMD ideas for Rust

Posted by Manu
in reply to bearophile

Manu

Posted in reply to bearophile

Attachments:

text/html part

Interesting. Almost all his points are what we do already in D. Always nice to see others come to the same conclusions :)


On 18 July 2013 11:15, bearophile <bearophileHUGS@lycos.com> wrote:

> A small blog post about SIMD ideas for Rust language:
>
> http://blog.aventine.se/post/**55669497784/my-vision-for-**rust-simd<http://blog.aventine.se/post/55669497784/my-vision-for-rust-simd>
>
> Some info on one of the discussed intrinsics, _mm_mul_epu32: http://msdn.microsoft.com/en-**us/library/f3d9e6fk%28v=vs.90%**29.aspx<http://msdn.microsoft.com/en-us/library/f3d9e6fk%28v=vs.90%29.aspx>
>
> Bye,
> bearophile
>

July 18, 2013

Re: SIMD ideas for Rust

Posted by bearophile
in reply to Manu

bearophile

Posted in reply to Manu

Manu:

> Interesting. Almost all his points are what we do already in D.
> Always nice to see others come to the same conclusions :)

There is also some discussion here:
http://www.reddit.com/r/rust/comments/1igvye/vision_for_rust_simd/

Regarding SIMD, few days ago I have added some notes in this thread, that are significant:
http://forum.dlang.org/thread/kpt0ja$1fk6$1@digitalmars.com?page=2

Bye,
bearophile

July 19, 2013

Re: SIMD ideas for Rust

Posted by bearophile
in reply to Manu

bearophile

Posted in reply to Manu

Manu:

> Interesting. Almost all his points are what we do already in D.
> Always nice to see others come to the same conclusions :)

While trying to write a multiplication of two complex numbers using SSE3 with LDC2 I have found about seven or more bugs, that I will discuss elsewhere. But regarding the syntax, in nice code like this D requires to add ".array" before all those subscripts (code adapted from Fog):


double2 complexMult(in double2 a, in double2 b) pure nothrow {
    double2 b_flip = [b.array[1], b.array[0]];
    double2 a_im = [a.array[1], a.array[1]];
    double2 a_re = [a.array[0], a.array[0]];
    double2 aib = a_im * b_flip;
    double2 arb = a_re * b;
    return [arb.array[0] - aib.array[0], arb.array[1] + aib.array[1]];
}


A line like this:

double2 b_flip = [b.array[1], b.array[0]];

becomes something like:

pshufd   $238,  %xmm1, %xmm3

Similarly all the other lines become single instructions (but the last one, because LDC2 misses to use a addsubpd).

I vaguely remember you saying that slow SIMD operations shouldn't have a too much short syntax to avoid giving an illusion of efficiency. But given that "often" the CPU executes such array subscripting and shuffling efficiently, isn't it nicer/enough to support a simpler syntax like this in D?

double2 complexMult(in double2 a, in double2 b) pure nothrow {
    double2 b_flip = [b[1], b[0]];
    double2 a_im = [a[1], a[1]];
    double2 a_re = [a[0], a[0]];
    double2 aib = a_im * b_flip;
    double2 arb = a_re * b;
    return [arb[0] - aib[0], arb[1] + aib[1]];
}


Bye,
bearophile

July 19, 2013

Re: SIMD ideas for Rust

Posted by Manu
in reply to bearophile

Manu

Posted in reply to bearophile

Attachments:

text/html part

On 19 July 2013 19:33, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
>  Interesting. Almost all his points are what we do already in D.
>> Always nice to see others come to the same conclusions :)
>>
>
> While trying to write a multiplication of two complex numbers using SSE3 with LDC2 I have found about seven or more bugs, that I will discuss elsewhere. But regarding the syntax, in nice code like this D requires to add ".array" before all those subscripts (code adapted from Fog):
>
>
> double2 complexMult(in double2 a, in double2 b) pure nothrow {
>     double2 b_flip = [b.array[1], b.array[0]];
>     double2 a_im = [a.array[1], a.array[1]];
>     double2 a_re = [a.array[0], a.array[0]];
>     double2 aib = a_im * b_flip;
>     double2 arb = a_re * b;
>     return [arb.array[0] - aib.array[0], arb.array[1] + aib.array[1]];
> }
>
>
> A line like this:
>
> double2 b_flip = [b.array[1], b.array[0]];
>
> becomes something like:
>
> pshufd   $238,  %xmm1, %xmm3
>
> Similarly all the other lines become single instructions (but the last one, because LDC2 misses to use a addsubpd).
>
> I vaguely remember you saying that slow SIMD operations shouldn't have a too much short syntax to avoid giving an illusion of efficiency. But given that "often" the CPU executes such array subscripting and shuffling efficiently, isn't it nicer/enough to support a simpler syntax like this in D?
>
> double2 complexMult(in double2 a, in double2 b) pure nothrow {
>     double2 b_flip = [b[1], b[0]];
>     double2 a_im = [a[1], a[1]];
>     double2 a_re = [a[0], a[0]];
>     double2 aib = a_im * b_flip;
>     double2 arb = a_re * b;
>     return [arb[0] - aib[0], arb[1] + aib[1]];
> }
>

The point about eliminating the index operator is because it implies a
vector->float cast.
You want to perform a shuffle(/swizzle), but you are only really performing
the operation incidentally.
What you're really doing is casting a bunch of vector components to floats,
and then rebuilding a vector, and LLVM can helpfully deal with that.

I would suggest calling a spade a spade and using a swizzle function to
perform a swizzle, instead of code like what you wrote.
Wouldn't this be better:

double2 complexMult(in double2 a, in double2 b) pure nothrow {
    double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to
include an opDispatch in the basic type
    double2 a_im = a.yy;
    double2 a_re = a.xx;
    double2 aib = a_im * b_flip;
    double2 arb = a_re * b;

//    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is tricky... it's not very portable.

    // Maybe:
    return select([-1, 0], arb-aib, arb+aib);
    // Hopefully the x86 optimiser will generate the proper opcode. Or a
bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.
}

I think that would be better. More portable, and it eliminates the code
that implies a vector->float->vector cast sequence, which I maintain,
should be syntactically discouraged at all costs.
You don't want to be giving people bad ideas that it's reasonable code to
write ;)

July 19, 2013

Re: SIMD ideas for Rust

Posted by bearophile
in reply to Manu

bearophile

Posted in reply to Manu

Manu:

> What you're really doing is casting a bunch of vector components to floats,
> and then rebuilding a vector, and LLVM can helpfully deal with that.
>
> I would suggest calling a spade a spade and using a swizzle function to
> perform a swizzle, instead of code like what you wrote.
> Wouldn't this be better:
>
> double2 complexMult(in double2 a, in double2 b) pure nothrow {
>     double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to
> include an opDispatch in the basic type
>     double2 a_im = a.yy;
>     double2 a_re = a.xx;
>     double2 aib = a_im * b_flip;
>     double2 arb = a_re * b;

I see and you are right.

(If I turn the basic type into a struct containing a double2
aliased-this to the whole structure, the generated code becomes
awful).

A YMM that already contains 8 floats, and probably SIMD registers
will keep growing, maybe to become 1024 bits long. So the swizzle
item names like x y z w will not suffice and some more general
naming scheme is needed.


> //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
> tricky... it's not very portable.
>
>     // Maybe:
>     return select([-1, 0], arb-aib, arb+aib);
>     // Hopefully the x86 optimiser will generate the proper opcode. Or a
> bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.
> }
>
> I think that would be better. More portable, and it eliminates the code
> that implies a vector->float->vector cast sequence, which I maintain,
> should be syntactically discouraged at all costs.
> You don't want to be giving people bad ideas that it's reasonable code to
> write ;)

My experience in writing such kind of code is limited. I will try
your select to see what kind of code LDC2-LLVM generates.

Bye,
bearophile

July 20, 2013

Re: SIMD ideas for Rust

Posted by Manu
in reply to bearophile

Manu

Posted in reply to bearophile

Attachments:

text/html part

On 20 July 2013 03:43, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
>  What you're really doing is casting a bunch of vector components to
>> floats,
>> and then rebuilding a vector, and LLVM can helpfully deal with that.
>>
>> I would suggest calling a spade a spade and using a swizzle function to
>> perform a swizzle, instead of code like what you wrote.
>> Wouldn't this be better:
>>
>> double2 complexMult(in double2 a, in double2 b) pure nothrow {
>>     double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to
>> include an opDispatch in the basic type
>>     double2 a_im = a.yy;
>>     double2 a_re = a.xx;
>>     double2 aib = a_im * b_flip;
>>     double2 arb = a_re * b;
>>
>
> I see and you are right.
>
> (If I turn the basic type into a struct containing a double2 aliased-this to the whole structure, the generated code becomes awful).
>
> A YMM that already contains 8 floats, and probably SIMD registers will keep growing, maybe to become 1024 bits long. So the swizzle item names like x y z w will not suffice and some more general naming scheme is needed.

Swizzling bytes already has that problem. Hexadecimal swizzle strings work
nicely up to 16 elements, but past that, I'd probably require the template
receive a tuple of int's.
These are trivial details. .xyzw are particularly useful for 2-4d vectors.
They can be removed for anything higher. The nicest/most preferred
interface can be decided with experience.
As yet there's not a lot of practical experience with >128bit registers,
and the sorts of patterns that appear frequently.

 //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
>> tricky... it's not very portable.
>>
>>     // Maybe:
>>     return select([-1, 0], arb-aib, arb+aib);
>>     // Hopefully the x86 optimiser will generate the proper opcode. Or a
>> bunch of other options; a multi-vector shuffle, shift, swizzle,
>> interleave.
>> }
>>
>> I think that would be better. More portable, and it eliminates the code
>> that implies a vector->float->vector cast sequence, which I maintain,
>> should be syntactically discouraged at all costs.
>> You don't want to be giving people bad ideas that it's reasonable code to
>> write ;)
>>
>
> My experience in writing such kind of code is limited. I will try your select to see what kind of code LDC2-LLVM generates.
>

It probably won't be good because I haven't paid attention to how it
optimises on SSE yet.
You need to encourage the compiler to generate ADDSUBPD for SSE, and any
(or none) of the possible expressions may result in it choosing the proper
opcode.
I'm apprehensive to add a helper function for that operation, since it's
dreadfully SSE-specific. It's the sort of thing where you might rather
carefully make sure the standard API will reliably encourage the optimiser
to do it.
If you can find a pattern of operations that optimises to ADDSUBPD, I'm
interested to know what the sequence(/s) are.
If not, we'll consider an explicit function. It can be emulated within
reason on other architectures, but I think it would be better to work a
different solution though. Ie, perform 2 (or 4) side by side (stream
processing)... That will work well on all architectures.

July 23, 2013

Re: SIMD ideas for Rust

Posted by bearophile
in reply to Manu

bearophile

Posted in reply to Manu

Manu:

> //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
> tricky... it's not very portable.
>
>     // Maybe:
>     return select([-1, 0], arb-aib, arb+aib);
>     // Hopefully the x86 optimiser will generate the proper opcode. Or a
> bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.
> }
>
> I think that would be better. More portable, and it eliminates the code
> that implies a vector->float->vector cast sequence, which I maintain,
> should be syntactically discouraged at all costs.
> You don't want to be giving people bad ideas that it's reasonable code to
> write ;)

Apparently that select() uses __builtin_ia32_pblendvb128, that is a SSE4.1 instruction. At the moment I have only SSE3 :-(

Bye,
bearophile

July 24, 2013

Re: SIMD ideas for Rust

Posted by Manu
in reply to bearophile

Manu

Posted in reply to bearophile

Attachments:

text/html part

On 23 July 2013 17:05, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
>
>  //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
>> tricky... it's not very portable.
>>
>>     // Maybe:
>>     return select([-1, 0], arb-aib, arb+aib);
>>     // Hopefully the x86 optimiser will generate the proper opcode. Or a
>> bunch of other options; a multi-vector shuffle, shift, swizzle,
>> interleave.
>> }
>>
>> I think that would be better. More portable, and it eliminates the code
>> that implies a vector->float->vector cast sequence, which I maintain,
>> should be syntactically discouraged at all costs.
>> You don't want to be giving people bad ideas that it's reasonable code to
>> write ;)
>>
>
> Apparently that select() uses __builtin_ia32_pblendvb128, that is a SSE4.1
> instruction. At the moment I have only SSE3 :-(
>

It's probably better to use a shuf, or a shift for compatibility, since the selection predicate is constant anyway.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation