January 06, 2012
On 6 January 2012 11:04, Andrew Wiley <wiley.andrew.j@gmail.com> wrote:

> On Fri, Jan 6, 2012 at 2:43 AM, Walter Bright <newshound2@digitalmars.com> wrote:
> > On 1/5/2012 5:42 PM, Manu wrote:
> >>
> >> So I've been hassling about this for a while now, and Walter asked me to
> >> pitch
> >> an email detailing a minimal implementation with some initial thoughts.
> >
> >
> > Takeaways:
> >
> > 1. SIMD behavior is going to be very machine specific.
> >
> > 2. Even trying to do something with + is fraught with peril, as integer
> adds
> > with SIMD can be saturated or unsaturated.
> >
> > 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D
> would
> > have to support internally maybe a 100 or more new operators.
> >
> > So some simplification is in order, perhaps a low level layer that is
> fairly
> > extensible for new instructions, and for which a library can be layered
> over
> > for a more presentable interface. A half-formed idea of mine is, taking a cue from yours:
> >
> > Declare one new basic type:
> >
> >    __v128
> >
> > which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment.
> The
> > __ prefix signals that it is non-portable.
> >
> > Then, have:
> >
> >   import core.simd;
> >
> > which provides two functions:
> >
> >   __v128 simdop(operator, __v128 op1);
> >   __v128 simdop(operator, __v128 op1, __v128 op2);
> >
> > This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates
> its
> > operation, but I doubt that it would be worth anyone's while to use
> that.)
> >
> > The operators would be an enum listing of the SIMD opcodes,
> >
> >    PFACC, PFADD, PFCMPEQ, etc.
> >
> > For:
> >
> >    z = simdop(PFADD, x, y);
> >
> > the compiler would generate:
> >
> >    MOV z,x
> >    PFADD z,y
> >
>
> Would this tie SIMD support directly to x86/x86_64, or would it possible to also support NEON on ARM (also 128 bit SIMD, see
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
> ) ?
> (Obviously not for DMD, but if the syntax wasn't directly tied to
> x86/64, GDC and LDC could support this)
> It seems like using a standard naming convention instead of directly
> referencing instructions could let the underlying SIMD instructions
> vary across platforms, but I don't know enough about the technologies
> to say whether NEON's capabilities match SSE closely enough that they
> could be handled the same way.
>

The underlying architectures are too different to try and map opcodes
across architectures.
__v128 should map to each architecutres native SIMD type, allowing for the
compiler to express the hardware, but the opcodes would come from
architecture specific opcodes available in each compiler.

As I keep suggesting, LIBRARIES would be created to supply the types like float4, int4, etc, which may also use version() liberally behind the scenes to support all architectures, allowing a common and efficient API for all architectures at this level.


January 06, 2012
Walter:

> One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types:

What are the disadvantages of making it typeless?
If it is typeless how do you tell it to perform a 4 float sum instead of a 2 double sum?
Is this low level layer able to support AVX and AVX2 3-way comparison instructions too, and the fused multiplication-add instruction?

---------------

For Manu: LDC compiler has this too: http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

Bye,
bearophile
January 06, 2012
On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/5/2012 5:42 PM, Manu wrote:
>> So I've been hassling about this for a while now, and Walter asked me to pitch
>> an email detailing a minimal implementation with some initial thoughts.
>
> Takeaways:
>
> 1. SIMD behavior is going to be very machine specific.
>
> 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated.
>
> 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators.
>
> So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours:
>
> Declare one new basic type:
>
>      __v128
>
> which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable.
>
> Then, have:
>
>     import core.simd;
>
> which provides two functions:
>
>     __v128 simdop(operator, __v128 op1);
>     __v128 simdop(operator, __v128 op1, __v128 op2);
>
> This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.)
>
> The operators would be an enum listing of the SIMD opcodes,
>
>      PFACC, PFADD, PFCMPEQ, etc.
>
> For:
>
>      z = simdop(PFADD, x, y);
>
> the compiler would generate:
>
>      MOV z,x
>      PFADD z,y
>
> The code generator knows enough about these instructions to do register assignments reasonably optimally.
>
> What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler.
>
> One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types:
>
>      __vdouble2
>      __vfloat4
>      __vlong2
>      __vulong2
>      __vint4
>      __vuint4
>      __vshort8
>      __vushort8
>      __vbyte16
>      __vubyte16

Those could be typedefs, i.e. alias this wrapper.
Still simdop would not be typesafe.

As much as this proposal presents a viable solution,
why not spending the time to extend inline asm.

void foo()
{
    __v128 a = loadss(1.0f);
    __v128 b = loadss(1.0f);
    a = addss(a, b);
}

__v128 load(float v)
{
    __v128 res; // allocates register
    asm
    {
        movss res, v[RBP];
    }
    return res; // return in XMM1 but inlineable return assignment
}

__v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
{
    __v128 res = a;
    // asm prolog, allocates registers for every __v128 used within the asm
    asm
    {
        addss res, b;
    }
    // asm epilog, possibly restore spilled registers
    return res;
}

What would be needed?
 - Implement the asm allocation logic.
 - Functions containing asm statements should participate in inlining.
 - Determining inline cost of asm statements.

When being used with typedefs for __vubyte16 et.al. this would
allow a really clean and simple library implementation of intrinsics.
January 06, 2012
On 6 January 2012 14:54, bearophile <bearophileHUGS@lycos.com> wrote:

> Walter:
>
> > One caveat is it is typeless; a __v128 could be used as 4 packed ints or
> 2
> > packed doubles. One problem with making it typed is it'll add 10 more
> types to
> > the base compiler, instead of one. Maybe we should just bite the bullet
> and do
> > the types:
>
> What are the disadvantages of making it typeless?
> If it is typeless how do you tell it to perform a 4 float sum instead of a
> 2 double sum?
> Is this low level layer able to support AVX and AVX2 3-way comparison
> instructions too, and the fused multiplication-add instruction?
>

I don't believe there are any. I can see only advantages to implementing the typed versions in libraries.

To make it perform float4 math, or double2 match, you either write the pseudo assembly you want directly, but more realistically, you use the __float4 type supplied in the standard library, which will already associate all the float4 related functionality, and try and map it across various architectures as efficiently as possible.

AVX needs a __v256 type in addition to the __v128 type already discussed.
This should be trivial to add in addition to __v128. Again, the libraries
take care of presenting a nice API to the users.
The comparisons and m-sum you mention are just opcodes like any other that
may be used on the raw type, and will be wrapped up nicely in the strongly
typed libraries.


January 06, 2012
On Fri, 06 Jan 2012 13:56:58 +0100, Martin Nowak <dawg@dawgfoto.de> wrote:

> On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright <newshound2@digitalmars.com> wrote:
>
>> On 1/5/2012 5:42 PM, Manu wrote:
>>> So I've been hassling about this for a while now, and Walter asked me to pitch
>>> an email detailing a minimal implementation with some initial thoughts.
>>
>> Takeaways:
>>
>> 1. SIMD behavior is going to be very machine specific.
>>
>> 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated.
>>
>> 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators.
>>
>> So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours:
>>
>> Declare one new basic type:
>>
>>      __v128
>>
>> which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable.
>>
>> Then, have:
>>
>>     import core.simd;
>>
>> which provides two functions:
>>
>>     __v128 simdop(operator, __v128 op1);
>>     __v128 simdop(operator, __v128 op1, __v128 op2);
>>
>> This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.)
>>
>> The operators would be an enum listing of the SIMD opcodes,
>>
>>      PFACC, PFADD, PFCMPEQ, etc.
>>
>> For:
>>
>>      z = simdop(PFADD, x, y);
>>
>> the compiler would generate:
>>
>>      MOV z,x
>>      PFADD z,y
>>
>> The code generator knows enough about these instructions to do register assignments reasonably optimally.
>>
>> What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler.
>>
>> One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types:
>>
>>      __vdouble2
>>      __vfloat4
>>      __vlong2
>>      __vulong2
>>      __vint4
>>      __vuint4
>>      __vshort8
>>      __vushort8
>>      __vbyte16
>>      __vubyte16
>
> Those could be typedefs, i.e. alias this wrapper.
> Still simdop would not be typesafe.
>
> As much as this proposal presents a viable solution,
> why not spending the time to extend inline asm.
>
> void foo()
> {
>      __v128 a = loadss(1.0f);
>      __v128 b = loadss(1.0f);
>      a = addss(a, b);
> }
>
> __v128 load(float v)
> {
>      __v128 res; // allocates register
>      asm
>      {
>          movss res, v[RBP];
>      }
>      return res; // return in XMM1 but inlineable return assignment
> }
>
> __v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
> {
>      __v128 res = a;
>      // asm prolog, allocates registers for every __v128 used within the asm
>      asm
>      {
>          addss res, b;
>      }
>      // asm epilog, possibly restore spilled registers
>      return res;
> }
>
> What would be needed?
>   - Implement the asm allocation logic.
>   - Functions containing asm statements should participate in inlining.
>   - Determining inline cost of asm statements.
>
> When being used with typedefs for __vubyte16 et.al. this would
> allow a really clean and simple library implementation of intrinsics.

Also addss is a pure function which could be important to optimize
out certain calls. Maybe we should allow to attribute asm with pure.
January 06, 2012
On 6 January 2012 12:16, a <a@a.com> wrote:

> Walter Bright Wrote:
>
> > which provides two functions:
> >
> >     __v128 simdop(operator, __v128 op1);
> >     __v128 simdop(operator, __v128 op1, __v128 op2);
>
> You would also need functions that take an immediate too to support instructions such as shufps.
>
> > One caveat is it is typeless; a __v128 could be used as 4 packed ints or
> 2
> > packed doubles. One problem with making it typed is it'll add 10 more
> types to
> > the base compiler, instead of one. Maybe we should just bite the bullet
> and do
> > the types:
> >
> >      __vdouble2
> >      __vfloat4
> >      __vlong2
> >      __vulong2
> >      __vint4
> >      __vuint4
> >      __vshort8
> >      __vushort8
> >      __vbyte16
> >      __vubyte16
>
> I don't see it being typeless as a problem. The purpose of this is to expose hardware capabilities to D code and the vector registers are typeless, so why shouldn't vector type be "typeless" too? Types such as vfloat4 can be implemented in a library (which could also be made portable and have a nice API).
>

Hooray! I think we're on exactly the same page. That's refreshing :)

I think this __simdop( op, v1, v2, etc ) api is a bit of a bad idea...
there are too many permutations of arguments.
I know some PPC functions that receive FIVE arguments (2-3 regs, and 2-3
literals)..
Why not just expose the opcodes as intrinsic functions directly, for
instance (maybe in std.simd.sse)?
__v128 __sse_mul_ss( __v128 v1, __v128 v2 );
__v128 __sse_mul_ps( __v128 v1, __v128 v2 );
__v128 __sse_madd_epi16( __v128 v1, __v128 v2, __v128 v3 ); // <- some have
more args
__v128 __sse_shuffle_ps( __v128 v1, __v128 v2, immutable int i ); // <-
some need literal ints
etc...

This works best for other architectures too I think, they expose their own
set of intrinsics, and some have rather different parameter layouts.
VMX for instance (perhaps in std.simd.vmx?):
__v128 __vmx_vmsum4fp( __v128 v1, __v128 v2, __v128 v3 );
__v128 __vmx_vpermwi( __v128 v1, immutable int i ); // <-- needs a literal
__v128 __vmx_vrlimi( __v128 v1, __v128 v2, immutable int
mask, immutable int rot ); // <-- you really don't want to add your enum
style function for all these prototypes?
etc...

I have seen at least these argument lists:
( v1 )
( v1, v2 )
( v1, v2, v3 )
( v1, immutable int )
( v1, v2, immutable int )
( v1, v2,  immutable int,  immutable int )


January 06, 2012
On 6 January 2012 14:56, Martin Nowak <dawg@dawgfoto.de> wrote:

> On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright < newshound2@digitalmars.com> wrote:
>
>> One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types:
>>
>>     __vdouble2
>>     __vfloat4
>>     __vlong2
>>     __vulong2
>>     __vint4
>>     __vuint4
>>     __vshort8
>>     __vushort8
>>     __vbyte16
>>     __vubyte16
>>
>
> Those could be typedefs, i.e. alias this wrapper.
> Still simdop would not be typesafe.
>

I think they should by well defined structs with lots of type safety and sensible methods. Not just a typedef of the typeless primitive.


> As much as this proposal presents a viable solution,
> why not spending the time to extend inline asm.
>

I think there are too many risky problems with the inline assembler (as
raised in my discussion about supporting pseudo registers in inline asm
blocks).
  * No way to allow the compiler to assign registers (pseudo registers)
  * Assembly blocks present problems for the optimiser, it's not reliable
that it can optimise around an inline asm blocks. How bad will it be when
trying to optimise around 100 small inlined functions each containing its
own inline asm blocks?
  * D's inline assembly syntax has to be carefully translated to GCC's
inline asm format when using GCC, and this needs to be done
PER-ARCHITECTURE, which Iain should not be expected to do for all the
obscure architectures GCC supports.


> What would be needed?
>  - Implement the asm allocation logic.
>  - Functions containing asm statements should participate in inlining.
>  - Determining inline cost of asm statements.
>

I raised these points in my other thread, these are all far more complicated problems I think than exposing opcode intrinsics would be. Opcode intrinsics are almost certainly the way to go.

When being used with typedefs for __vubyte16 et.al. this would
> allow a really clean and simple library implementation of intrinsics.
>

The type safety you're imagining here might actually be annoying when
working with the raw type and opcodes..
Consider this common situation and the code that will be built around it:
__v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); // pack
some other useful data in W
If vec were strongly typed, I would now need to start casting all over the
place to use various float and uint opcodes on this value?
I think it's correct when using SIMD at the raw level to express the type
as it is, typeless... SIMD regs are infact typeless regs, they only gain
concept of type the moment you perform an opcode on it, and only for the
duration of that opcode.

You will get your strong type safety when you make use of the float4 types which will be created in the libs.


January 06, 2012
On 6 January 2012 08:22, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/5/2012 7:42 PM, Manu wrote:
>
>> Perhaps I misunderstand, I can't see the problem?
>> In the function preamble, you just align it... something like:
>>   mov reg, esp ; take a backup of the stack pointer
>>   and esp, -16 ; align it
>>
>> ... function
>>
>>   mov esp, reg ; restore the stack pointer
>>   ret 0
>>
>
> And now you cannot access the function's parameters anymore, because the stack offset for them is now variable rather than fixed.
>

Hehe, true, but not insurmountable. Scheduling of parameter pops before you perform the alignment may solve that straight up, or else don't align esp its self; store the vector to the stack through some other aligned reg copied from esp...

I just wrote some test functions using __m128 in VisualC, it seems to do
something in between the simplicity of my initial suggestion, and my
refined ideas one above :)
If you have VisualC, check out what it does, it's very simple, looks pretty
good, and I'm sure it's optimal (MS have enough R&D money to assure this)

I can paste some disassemblies if you don't have VC...


January 06, 2012
Manu:

> To make it perform float4 math, or double2 match, you either write the pseudo assembly you want directly, but more realistically, you use the __float4 type supplied in the standard library, which will already associate all the float4 related functionality, and try and map it across various architectures as efficiently as possible.

I see. While you design, you need to think about the other features of D :-) Is it possible to mix CPU SIMD with D vector ops?

__float4[10] a, b, c;
c[] = a[] + b[];

Bye,
bearophile
January 06, 2012
Manu:

> I can paste some disassemblies if you don't have VC...

Pasting it is useful for all other people reading this thread too, like me.

Bye,
bearophile