Implementation of gcc SIMD builtins

Feb 10, 2011

Sorry to start a new thread on this, but I didn't want it to get lost in the middle of the previous comments. I have the start of a working implementation in gdc giving access to the __builtin_ia32_* functions. The way I did it so far manages to compile down to very tight SSE code. That's the good part. The bad part is that there are a couple of limitations that engender slightly ugly code at the moment. Issue 1: In order for the compiler to actually recognize the builtins when you call them, I had to define a set of custom types that represent gcc's types with the vector_size attribute that get passed into the builtins. I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't set the alignment, and it automatically generates calls to _d_array_init and _d_array_copy and such, rather than instead just staying in SSE registers. Let's just say the code was very non-optimal. Instead, I create a struct declaration in gcc.builtins for all of the types expected by those builtins, and I name them to match what they would nominally contain: __v4sf would have 4 floats, __v32qi would have 32 bytes, __v2df would have 2 doubles, and so forth. Each struct has 16-byte alignment and the correct size. But here's the rub: they have no fields in them. I tried my darndest to add VarDeclarations to them, but the fact that the actual gcc tree type wasn't a struct, it would just ICE the compiler when instantiating any of those structs. I'd like to fix this, so that you can literally access the contents of the struct like it had a float[4], or a double[2], or a byte[32], or whatever it should actually have; or instead it should give you an overloaded [] operator for direct indexing. Issue 2: The builtin structs I generate are *not* recognized by the frontend as having support for +, -, *, /, etc like the gcc vector_size types automatically do in C and C++. I might be able to add those and have them contain code to drop into the builtins. For now, you *must* use the builtin functions to perform operations on these types. I'm obviously aiming to use the builtin functions, myself (for now). Quick example: ///// File VectorsMain.d ///// import gcc.builtins; import mmintrins; import std.stdio; void main() { __v4sf bv1; setvelem(bv1, 0, 1.0f); setvelem(bv1, 1, 2.0f); setvelem(bv1, 2, 3.0f); setvelem(bv1, 3, 0.0f); __v4sf bv2; setvelem(bv2, 0, 1.0f); setvelem(bv2, 1, 1.0f); setvelem(bv2, 2, 1.0f); setvelem(bv2, 3, 0.0f); __v4sf bv3 = _mm_add_ps(bv1, bv2); std.stdio.writefln("Result: (%s, %s, %s, %s)", velem!float(bv3, 0), velem!float(bv3, 1), velem!float(bv3, 2), velem!float(bv3, 3)); } ///// File mmintrins.d ///// module mmintrins; import gcc.builtins; T velem(T, VT)(VT vector, uint elem) { return (cast(T*) &vector)[elem]; } void setvelem(T, VT)(ref VT vector, uint elem, T value) { (cast(T*) &vector)[elem] = value; } //pragma(set_attribute, _mm_add_ps, always_inline, artificial); T _mm_add_ps(T)(const(T) v1, const(T) v2) { return __builtin_ia32_addps(v1, v2); } ///// End example ///// Note a few things: I made _mm_add_ps templated on vector type (I'll constrain it eventually to appropriate types), and this solves a couple of problems: cross-module inlining works as the other module gets the whole definition, and you can technically addps types other than v4sf. Note the velem and setvelem methods are just to add a pretty face on the fact that the data of the struct is hidden, with no fields to access it. More checks are needed (at least in debug mode), and there will be some other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup of vectors easier. I'll admit that this part is a bit ugly, but it works, and it generates excellent code. I compared the actual assembly generated to my own C++ code with the same intrinsics, and so far the D side is keeping up. Please don't collectively throw up when you see this...fast vector ops are kindof a big deal for me, so be gentle. =) What do you all think? -Mike

== Quote from Mike Farnsworth (mike.farnsworth@gmail.com)'s article > Sorry to start a new thread on this, but I didn't want it to get lost in > the middle of the previous comments. I have the start of a working > implementation in gdc giving access to the __builtin_ia32_* functions. > The way I did it so far manages to compile down to very tight SSE code. > That's the good part. > The bad part is that there are a couple of limitations that engender > slightly ugly code at the moment. > Issue 1: > In order for the compiler to actually recognize the builtins when you > call them, I had to define a set of custom types that represent gcc's > types with the vector_size attribute that get passed into the builtins. > I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't > set the alignment, and it automatically generates calls to _d_array_init > and _d_array_copy and such, rather than instead just staying in SSE > registers. Let's just say the code was very non-optimal. I didn't hope for anything, I'm not the crazy one using them. =) > Instead, I create a struct declaration in gcc.builtins for all of the > types expected by those builtins, and I name them to match what they > would nominally contain: __v4sf would have 4 floats, __v32qi would have > 32 bytes, __v2df would have 2 doubles, and so forth. Each struct has > 16-byte alignment and the correct size. But here's the rub: they have > no fields in them. I tried my darndest to add VarDeclarations to them, > but the fact that the actual gcc tree type wasn't a struct, it would > just ICE the compiler when instantiating any of those structs. > I'd like to fix this, so that you can literally access the contents of > the struct like it had a float[4], or a double[2], or a byte[32], or > whatever it should actually have; or instead it should give you an > overloaded [] operator for direct indexing. > Issue 2: > The builtin structs I generate are *not* recognized by the frontend as > having support for +, -, *, /, etc like the gcc vector_size types > automatically do in C and C++. I might be able to add those and have > them contain code to drop into the builtins. For now, you *must* use > the builtin functions to perform operations on these types. I'm > obviously aiming to use the builtin functions, myself (for now). Actually, more I think about it, the more I feel a user-defined union would be better to scale the shortcomings of gcc attribute support in gdc. And trying to use whatever builtins gcc has to offer won't get you anywhere far anytime soon. There's one or two ICEs when using arithmetic operations (+,-,/,*,=) for typedef'd types with vector attributes assigned to them. This has mostly been fixed in my local tree (with hopefully kind error message for invalid ops too), which will be pushed soon after the next dmd release merge. > Quick example: > ///// File VectorsMain.d ///// > import gcc.builtins; > import mmintrins; > import std.stdio; > void main() > { > __v4sf bv1; > setvelem(bv1, 0, 1.0f); > setvelem(bv1, 1, 2.0f); > setvelem(bv1, 2, 3.0f); > setvelem(bv1, 3, 0.0f); > __v4sf bv2; > setvelem(bv2, 0, 1.0f); > setvelem(bv2, 1, 1.0f); > setvelem(bv2, 2, 1.0f); > setvelem(bv2, 3, 0.0f); > __v4sf bv3 = _mm_add_ps(bv1, bv2); > std.stdio.writefln("Result: (%s, %s, %s, %s)", > velem!float(bv3, 0), > velem!float(bv3, 1), > velem!float(bv3, 2), > velem!float(bv3, 3)); > } > ///// File mmintrins.d ///// > module mmintrins; > import gcc.builtins; > T velem(T, VT)(VT vector, uint elem) > { > return (cast(T*) &vector)[elem]; > } > void setvelem(T, VT)(ref VT vector, uint elem, T value) > { > (cast(T*) &vector)[elem] = value; > } > //pragma(set_attribute, _mm_add_ps, always_inline, artificial); > T _mm_add_ps(T)(const(T) v1, const(T) v2) > { > return __builtin_ia32_addps(v1, v2); > } > ///// End example ///// > Note a few things: I made _mm_add_ps templated on vector type (I'll > constrain it eventually to appropriate types), and this solves a couple > of problems: cross-module inlining works as the other module gets the > whole definition, and you can technically addps types other than v4sf. > Note the velem and setvelem methods are just to add a pretty face on the > fact that the data of the struct is hidden, with no fields to access it. > More checks are needed (at least in debug mode), and there will be some > other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup > of vectors easier. I'll admit that this part is a bit ugly, but it > works, and it generates excellent code. I compared the actual assembly > generated to my own C++ code with the same intrinsics, and so far the D > side is keeping up. > Please don't collectively throw up when you see this...fast vector ops > are kindof a big deal for me, so be gentle. =) What do you all think? > -Mike I think I'm gonna throw up... :~)

February 10, 2011

Re: Implementation of gcc SIMD builtins

Posted by Mike Farnsworth
in reply to Iain Buclaw

Permalink

Mike Farnsworth

Posted in reply to Iain Buclaw

Permalink

Iain Buclaw Wrote:

> == Quote from Mike Farnsworth (mike.farnsworth@gmail.com)'s article
> > Sorry to start a new thread on this, but I didn't want it to get lost in
> > the middle of the previous comments.  I have the start of a working
> > implementation in gdc giving access to the __builtin_ia32_* functions.
> > The way I did it so far manages to compile down to very tight SSE code.
> >  That's the good part.
> > The bad part is that there are a couple of limitations that engender
> > slightly ugly code at the moment.
> > Issue 1:
> > In order for the compiler to actually recognize the builtins when you
> > call them, I had to define a set of custom types that represent gcc's
> > types with the vector_size attribute that get passed into the builtins.
> > I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't
> > set the alignment, and it automatically generates calls to _d_array_init
> > and _d_array_copy and such, rather than instead just staying in SSE
> > registers.  Let's just say the code was very non-optimal.
> 
> I didn't hope for anything, I'm not the crazy one using them. =)
> 
> > Instead, I create a struct declaration in gcc.builtins for all of the
> > types expected by those builtins, and I name them to match what they
> > would nominally contain: __v4sf would have 4 floats, __v32qi would have
> > 32 bytes, __v2df would have 2 doubles, and so forth.  Each struct has
> > 16-byte alignment and the correct size.  But here's the rub: they have
> > no fields in them.  I tried my darndest to add VarDeclarations to them,
> > but the fact that the actual gcc tree type wasn't a struct, it would
> > just ICE the compiler when instantiating any of those structs.
> > I'd like to fix this, so that you can literally access the contents of
> > the struct like it had a float[4], or a double[2], or a byte[32], or
> > whatever it should actually have; or instead it should give you an
> > overloaded [] operator for direct indexing.
> > Issue 2:
> > The builtin structs I generate are *not* recognized by the frontend as
> > having support for +, -, *, /, etc like the gcc vector_size types
> > automatically do in C and C++.  I might be able to add those and have
> > them contain code to drop into the builtins.  For now, you *must* use
> > the builtin functions to perform operations on these types.  I'm
> > obviously aiming to use the builtin functions, myself (for now).
> 
> Actually, more I think about it, the more I feel a user-defined union would be better to scale the shortcomings of gcc attribute support in gdc. And trying to use whatever builtins gcc has to offer won't get you anywhere far anytime soon.
> 
> There's one or two ICEs when using arithmetic operations (+,-,/,*,=) for typedef'd types with vector attributes assigned to them. This has mostly been fixed in my local tree (with hopefully kind error message for invalid ops too), which will be pushed soon after the next dmd release merge.

I tried to use the typedef'd types, but even with simple ops it gave me all sorts of weird compile errors as soon as I tried to pass them to the intrinsics.  E.g. I can get the +,-,*,/,... operators to work but not the builtin functions, or else with my builtin structs I can get the builtin functions to work, but not the +,-,*,/,... operators to work.  I'm hoping at some point I can get the best of both worlds, but I still feel pretty lost in all the gdc code (although I'm slowly learning).

> > Quick example:
> > ///// File VectorsMain.d /////
> > import gcc.builtins;
> > import mmintrins;
> > import std.stdio;
> > void main()
> > {
> >     __v4sf bv1;
> >     setvelem(bv1, 0, 1.0f);
> >     setvelem(bv1, 1, 2.0f);
> >     setvelem(bv1, 2, 3.0f);
> >     setvelem(bv1, 3, 0.0f);
> >     __v4sf bv2;
> >     setvelem(bv2, 0, 1.0f);
> >     setvelem(bv2, 1, 1.0f);
> >     setvelem(bv2, 2, 1.0f);
> >     setvelem(bv2, 3, 0.0f);
> >     __v4sf bv3 = _mm_add_ps(bv1, bv2);
> >     std.stdio.writefln("Result: (%s, %s, %s, %s)",
> >                        velem!float(bv3, 0),
> >                        velem!float(bv3, 1),
> >                        velem!float(bv3, 2),
> >                        velem!float(bv3, 3));
> > }
> > ///// File mmintrins.d /////
> > module mmintrins;
> > import gcc.builtins;
> > T velem(T, VT)(VT vector, uint elem)
> > {
> >     return (cast(T*) &vector)[elem];
> > }
> > void setvelem(T, VT)(ref VT vector, uint elem, T value)
> > {
> >     (cast(T*) &vector)[elem] = value;
> > }
> > //pragma(set_attribute, _mm_add_ps, always_inline, artificial);
> > T _mm_add_ps(T)(const(T) v1, const(T) v2)
> > {
> >     return __builtin_ia32_addps(v1, v2);
> > }
> > ///// End example /////
> > Note a few things: I made _mm_add_ps templated on vector type (I'll
> > constrain it eventually to appropriate types), and this solves a couple
> > of problems: cross-module inlining works as the other module gets the
> > whole definition, and you can technically addps types other than v4sf.
> > Note the velem and setvelem methods are just to add a pretty face on the
> > fact that the data of the struct is hidden, with no fields to access it.
> >  More checks are needed (at least in debug mode), and there will be some
> > other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup
> > of vectors easier.  I'll admit that this part is a bit ugly, but it
> > works, and it generates excellent code.  I compared the actual assembly
> > generated to my own C++ code with the same intrinsics, and so far the D
> > side is keeping up.
> > Please don't collectively throw up when you see this...fast vector ops
> > are kindof a big deal for me, so be gentle. =)  What do you all think?
> > -Mike
> 
> I think I'm gonna throw up... :~)

Well, keep in mind that I still have a few things to do in the near term that should help:

1) Add at minimum an overloaded [] op to allow direct indexing into the vector structs, which should make the velem/setvelem crap extraneous.

2) Add a bunch more intrinsics wrappers that follow the Intel standard, so the utility of this goes up.

3) Add some example wrapper structs that define all of the relevant operators, dot/cross products, etc.  These will (at least for typical usage) hide all of the ugliness and hopefully will compile down to very good code, while still being proper D types that are easy to understand how to use.

Hopefully then nobody will want to throw up anymore.

-Mike

Forums