Jump to page: 1 2
Thread overview
Re: SIMD support...
Jan 06, 2012
Iain Buclaw
Jan 06, 2012
Walter Bright
Jan 06, 2012
Vladimir Panteleev
Jan 06, 2012
Walter Bright
Jan 06, 2012
Brad Roberts
Jan 07, 2012
Vladimir Panteleev
Jan 06, 2012
Iain Buclaw
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Martin Nowak
Jan 06, 2012
Peter Alexander
January 06, 2012
On 6 January 2012 01:42, Manu <turkeyman@gmail.com> wrote:
> So I've been hassling about this for a while now, and Walter asked me to pitch an email detailing a minimal implementation with some initial thoughts.
>
> The first thing I'd like to say is that a lot of people seem to have this
> idea that float[4] should be specialised as a candidate for simd
> optimisations somehow. It's obviously been discussed, and this general
> opinion seems to be shared by a good few people here.
> I've had a whole bunch of rants why I think this is wrong in other threads,
> so I won't repeat them here... and that said, I'll attempt to detail an
> approach based on explicit vector types.
>
> So, what do we need...? A language defined primitive vector type... that's all.
>
>
> -- What shall we call it? --
>
> Doesn't really matter... open to suggestions.
> VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
> float' (a name I particularly hate, not specifying any size, and trying to
> associate it with a specific type)
>
> I like v128, or something like that. I'll use that for the sake of this
> document. I think it is preferable to float4 for a few reasons:
>  * v128 says what the register intends to be, a general purpose 128bit
> register that may be used for a variety of simd operations that aren't
> necessarily type bound.
>  * float4 implies it is a specific 4 component float type, which is not what
> the raw type should be.
>  * If we use names like float4, it stands to reason that (u)int4, (u)short8,
> etc should also exist, and it also stands to reason that one might expect
> math operators and such to be defined...
>

It is a fine start to support with a float128 type. :)


With vectors, there are around 20 tied to the x86 architecture iirc, to whom's base types cover D equivalents of: byte, ubyte, short, ushort, int, uint, long, ulong, float and double.  I think for consistent naming convention, much like 'c' is used to denote complex types, I think 'v' should be used to denote vector types.  For example, the types available on x86 will be:

64bits:
vfloat[2], vlong[1], vint[2], vshort[4], vbyte[8]

128bits:
vdouble[2], vfloat[4], vlong[2], vint[4], vshort[8], vbyte[16],
vulong[2], vuint[4], vushort[8], vubyte[16]

256bits:
vdouble[4], vfloat[8], vlong[4], vint[8], vshort[16], vbyte[32]


For portability, vectors should be defined with the following logic:For example:

* vector size cannot be zero
* vector size must be a power of 2.
* if there is no hardware support for the vector type/size, then fall
back to static array type of same size.
* defining a vector without a size ie: vint foo;  will default the
size to zero, which will error.


That's all I can think of so far.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
January 06, 2012
On 6 January 2012 04:40, Iain Buclaw <ibuclaw@ubuntu.com> wrote:

> It is a fine start to support with a float128 type. :)
>
> With vectors, there are around 20 tied to the x86 architecture iirc, to whom's base types cover D equivalents of: byte, ubyte, short, ushort, int, uint, long, ulong, float and double.  I think for consistent naming convention, much like 'c' is used to denote complex types, I think 'v' should be used to denote vector types.  For example, the types available on x86 will be:
>
> 64bits:
> vfloat[2], vlong[1], vint[2], vshort[4], vbyte[8]
>
> 128bits:
> vdouble[2], vfloat[4], vlong[2], vint[4], vshort[8], vbyte[16],
> vulong[2], vuint[4], vushort[8], vubyte[16]
>
> 256bits:
> vdouble[4], vfloat[8], vlong[4], vint[8], vshort[16], vbyte[32]
>
>
> For portability, vectors should be defined with the following logic:For example:
>
> * vector size cannot be zero
> * vector size must be a power of 2.
> * if there is no hardware support for the vector type/size, then fall
> back to static array type of same size.
> * defining a vector without a size ie: vint foo;  will default the
> size to zero, which will error.
>
> That's all I can think of so far.


I'm confused, are you advocating using arrays of primitive types to express SIMD values again? Why are you using the square brackets in your example?

I think starting with float128 is wrong. SIMD registers aren't necessarily
floats. The default (language implemented) type should be typeless, and
allow libraries to build typed api's on top of that.
The only thing the language needs to think about is a single typeless,
128bit, 16 byte aligned value.


January 06, 2012
Something about the way you are posting is breaking the threading of these message threads.

On 1/5/2012 6:40 PM, Iain Buclaw wrote:


January 06, 2012
I have to go to bed, so I'll leave these thoughts here... I apologise if I misunderstood your suggestion.

These comments lead me to suspect you're talking again about vectors as special cased arrays:

On 6 January 2012 04:40, Iain Buclaw <ibuclaw@ubuntu.com> wrote:

> * vector size cannot be zero
> * vector size must be a power of 2.
> * if there is no hardware support for the vector type/size, then fall
> back to static array type of same size.
> * defining a vector without a size ie: vint foo;  will default the
> size to zero, which will error.
>

The idea that you're talking about the possibility of arbitrary sizes suggests you must be proposing array syntax?

See, again, I think this is precisely what NOT to do.
There are so many problems with this it's not funny, not least of all that
a user who doesn't examine the disassembly would have no idea if their code
is actually working or not.

As I've said above, I think vectors as arrays is the worst idea possible:
  * How do I declare an ARRAY now that we've hijacked the syntax?
  * You can't index SIMD components! exposing them with array syntax
encourages indexing, scalar access, usage in loops, etc. The API can't
permit that.
  * It's not clear that an array needs to be aligned any further than a
single component.
  * how do you do comparisons? what does your if statement look like? ... I
can tell you what people will do... if( v[0] > 0 && v[1] > 0 && v[1] < 10
&& v[2] > 0 ) ... This is not possible for the hardware, and shouldn't be
possible syntactically.
  * SIMD values can't interact with their scalar counterparts.. they live
in different register types, and there is no path between these registers
other than via memory... welcome to redundant memory access and LHS hazard
central.
  * You have absolutely no idea if the compiler is generating good code or
not... A more strict and formalised API will give confidence that it's
working correctly.

I promise you, as soon as you allow this, hidden costs of hitting memory
while transferring values back and forwards between SIMD and scalar
register types, and attempts at performing operations the hardware can not
actually perform will result in code that's slower than not using the SIMD
unit at all.
SIMD usage is an all or nothing thing, you can't sprinkle it in here and
there... the moment you start interacting with the scalar registers and
introducing redundant memory accesses, you've wasted more time in thrashing
the stack than you will gain from your failed attempt at optimisation.

SIMD needs to be implemented with a discreet type, which asserts it's
concept as a unique type of register, just as primitive in its own right as
'int', and encourage proper use through the carefully crafted APIs provided.
What I described is the very best way to get the best performance and type
safety from a SIMD unit, and encourage users to use the hardware
correctly...
I promise... I've written so many SIMD vector libraries over the years now.

Here's a silly analogy...
You don't go casting back and fourth between int and float liberally just
for the sake of it... You just don't do that. You know the cost, and other
issues involved and avoid it like the plague, only doing it in the
extremely rare case that you absolutely have to. And when you do, you do it
deliberately, and probably contemplate ways to avoid doing so before you
submit to it...
SIMD operations are no different, except the costs involved in
intercommunication are much bigger than int<->float. You need to have the
mentality that they're on separate planets. Exposing SIMD like it's an
array of some other primitive type completely goes against that sentiment,
and fosters the wrong idea of what SIMD math is to users. They'll try it
out, and wonder why it doesn't make their program any faster (probably
slower actually)...


January 06, 2012
On 1/5/2012 7:07 PM, Manu wrote:
> The only thing the language needs to think about is a single typeless, 128bit,
> 16 byte aligned value.

Currently, you can do this on 64 bit Linux:

  union V128
  {
    void[16] v;
    real dummy;
  }

as reals are aligned on 16 byte boundaries. I wonder how far that can be pushed.
January 06, 2012
On 6 January 2012 05:27, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/5/2012 7:07 PM, Manu wrote:
>
>> The only thing the language needs to think about is a single typeless,
>> 128bit,
>> 16 byte aligned value.
>>
>
> Currently, you can do this on 64 bit Linux:
>
>  union V128
>  {
>    void[16] v;
>    real dummy;
>  }
>
> as reals are aligned on 16 byte boundaries. I wonder how far that can be pushed.
>

You still haven't expressed the concept of the SIMD register anywhere in the language. The code gen needs to assign XMM regs, and schedule all appropriate loads/stores/etc...


January 06, 2012
On Fri, 06 Jan 2012 02:42:44 +0100, Manu <turkeyman@gmail.com> wrote:

> So I've been hassling about this for a while now, and Walter asked me to
> pitch an email detailing a minimal implementation with some initial
> thoughts.
>
> The first thing I'd like to say is that a lot of people seem to have this
> idea that float[4] should be specialised as a candidate for simd
> optimisations somehow. It's obviously been discussed, and this general
> opinion seems to be shared by a good few people here.
> I've had a whole bunch of rants why I think this is wrong in other threads,
> so I won't repeat them here... and that said, I'll attempt to detail an
> approach based on explicit vector types.
>
> So, what do we need...? A language defined primitive vector type... that's
> all.
>
>
> -- What shall we call it? --
>
> Doesn't really matter... open to suggestions.
> VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
> float' (a name I particularly hate, not specifying any size, and trying to
> associate it with a specific type)
>
> I like v128, or something like that. I'll use that for the sake of this
> document. I think it is preferable to float4 for a few reasons:
>  * v128 says what the register intends to be, a general purpose 128bit
> register that may be used for a variety of simd operations that aren't
> necessarily type bound.
>  * float4 implies it is a specific 4 component float type, which is not
> what the raw type should be.
>  * If we use names like float4, it stands to reason that (u)int4,
> (u)short8, etc should also exist, and it also stands to reason that one
> might expect math operators and such to be defined...
>
> I suggest initial language definition and implementation of something like
> v128, and then types like float4, (u)int4, etc, may be implemented in the
> std library with complex behaviour like casting mechanics, and basic math
> operators...
>
>
> -- Alignment --
>
> This type needs to be 16byte aligned. Unaligned loads/stores are very
> expensive, and also tend to produce extremely costly LHS hazards on most
> architectures when accessing vectors in arrays. If they are not aligned,
> they are useless... honestly.
>
Actually unaligned loads/stores are free if you have a recent core i5.
But then my processor has AVX support where loading/storing YMMs will
benefit from 32-byte alignment.
This will always be too system specific and volatile to make it a specialized type.

I also don't think that we can efficiently provide arbitrary alignment
for stack variables.
The performance penalty will kill your efforts.
Gcc doesn't do it either.

As a good alternative you should use a segmented stack (https://github.com/dsimcha/TempAlloc)
and ajust alignment to your needs.

Providing intrinsics should happen through library support.
Either through expression templates or with GPGPU in mind using
a DSL compiler for string mixins.

auto result = vectorize!q{
  auto v  = float4(a, b, c, d);
  auto v2 = float4(2 * a, 2.0, c - d, d + a);
  auto v3 = v * v2;
  auto v4 = __hadd(v3, v3);
  auto v5 = __hadd(v4, v4);
  return v5[0];
}(0.2, 0.2, 0.3, 0.4);

> ** Does this cause problems with class allocation? Are/can classes be
> allocated to an alignment as inherited from an aligned member? ... If not,
> this might be the bulk of the work.
>
> There is one other problem I know of that is only of concern on x86.
> In the C ABI, passing 16byte ALIGNED vectors by value is a problem,
> since x86 ALWAYS uses the stack to pass arguments, and has no way to align
> the stack.
> I wonder if D can get creative with its ABI here, passing vectors in
> registers, even though that's not conventional on x86... the C ABI was
> invented long before these hardware features.
> In lieu of that, x86 would (sadly) need to silently pass by const ref...
> and also do this in the case of register overflow.
>
> Every other architecture (including x64) is fine, since all other
> architectures pass in regs, and can align the stack as needed when
> overflowing the regs (since stack management is manual and not performed
> with special opcodes).
>
>
> -- What does this type do? --
>
> The primitive v128 type DOES nothing... it is a type that facilitates the
> compiler allocating SIMD registers, managing assignments, loads, and
> stores, and allow passing to/from functions BY VALUE in registers.
> Ie, the only valid operations would be:
>   v128 myVec = someStruct.vecMember; // and vice versa...
>   v128 result = someFunc(myVec); // and calling functions, passing by value.
>
> Nice bonus: This alone is enough to allow implementation of fast memcpy
> functions that copy 16 bytes at a time... ;)
>
>
> -- So, it does nothing... so what good is it? --
>
> Initially you could use this type in conjunction with inline asm, or
> architecture intrinsics to do useful stuff. This would be using the
> hardware totally raw, which is an important feature to have, but I imagine
> most of the good stuff would come from libraries built on top of this.
>
>
> -- Literal assignment --
>
> This is a hairy one. Endian issues appear in 2 layers here...
> Firstly, if you consider the vector to be 4 int's, the ints themselves may
> be little or big endian, but in addition, the outer layer (ie. the order of
> x,y,z,w) may also be in reverse order on some architectures... This makes a
> single 128bit hex literal hard to apply.
> I'll have a dig and try and confirm this, but I have a suspicion that VMX
> defines its components reverse to other architectures... (Note: not usually
> a problem in C, because vector code is sooo non-standard in C that this is
> ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order
> can suit)
>
> For the primitive v128 type, I generally like the idea of using a huge
> 128bit hex literal.
>   v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)
>
> Since the primitive v128 type is effectively typeless, it makes no sense to
> use syntax like this:
>   v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be
> reserved for use with a float4 type defined in a library somewhere.
>
> ... The problem is, this may not be linearly applicable to all hardware. If
> the order of the components match the endian, then it is fine...
> I suspect VMX orders the components reverse to match the fact the values
> are big endian, which would be good, but I need to check. And if not...
> then literals may need to get a lot more complicated :)
>
> Assignment of literals to the primitive type IS actually important, it's
> common to generate bit masks in these registers which are type-independent.
> I also guess libraries still need to leverage this primitive assignment
> functionality to assign their more complex literal expressions.
>
>
> -- Libraries --
>
> With this type, we can write some useful standard libraries. For a start,
> we can consider adding float4, int4, etc, and make them more intelligent...
> they would have basic maths operators defined, and probably implement type
> conversion when casting between types.
>
>   int4 intVec = floatVec; // perform a type conversion from float to int..
> or vice versa... (perhaps we make this require an explicit cast?)
>
>   v128 vec = floatVec; // implicit cast to the raw type always possible,
> and does no type casting, just a reinterpret
>   int4 intVec = vec; // conversely, the primitive type would implicitly
> assign to other types.
>   int4  intVec = (v128)floatVec; // piping through the primitive v128
> allows to easily perform a reinterpret between vector types, rather than
> the usual type conversion.
>
> There are also a truckload of other operations that would be fleshed out.
> For instance, strongly typed literal assignment, vector comparisons that
> can be used with if() (usually these allow you to test if ALL components,
> or if ANY components meet a given condition). Conventional logic operators
> can't be neatly applied to vectors. You need to do something like this:
>   if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...
>
> We can discuss the libraries at a later date, but it's possible that you
> might also want to make some advanced functions in the library that are
> only supported on particular architectures, std.simd.sse...,
> std.simd.vmx..., etc. which may be version()-ed.
>
>
> -- Exceptions, flags, and error conditions --
>
> SIMD units usually have their own control register for controlling various
> behaviours, most importantly NaN policy and exception semantics...
> I'm open to input here... what should be default behaviour?
> I'll bet the D community opt for strict NaNs, and throw by default... but
> it is actually VERY common to disable hardware exceptions when working with
> SIMD code:
>   * often precision is less important than speed when using SIMD, and some
> SIMD units perform faster when these features are disabled.
>   * most SIMD algorithms (at least in performance oriented code) are
> designed to tolerate '0,0,0,0' as the result of a divide by zero, or some
> other error condition.
>   * realtime physics tends to suffer error creep and freaky random
> explosions, and you can't have those crashing the program :) .. they're not
> really 'errors', they're expected behaviour, often producing 0,0,0,0 as a
> result, so they're easy to deal with.
>
> I presume it'll end up being NaNs and throw by default, but we do need some
> mechanism to change the SIMD unit flags for realtime use... A runtime
> function? Perhaps a compiler switch (C does this sort of thing a lot)?
>
> It's also worth noting that there are numerous SIMD units out there that
> DON'T follow strict ieee float rules, and don't support NaNs or hardware
> exceptions at all... others may simply set a divide-by-zero flag, but not
> actually trigger a hardware exception, requiring you to explicitly check
> the flag if you're interested.
> Will it be okay that the languages default behaviour of NaN's and throws is
> unsupported on such platforms? What are the implications of this?
>
>
> -- Future --
>
> AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256
> type, everything else is precisely the same.
> I think this is perfectly reasonable... AVX is to SSE exactly as long is to
> int, or double is to float. They are different types with different
> register allocation and addressing semantics, and deserve a discreet type.
> As with v128, libraries may then be created to allow the types to interact.
>
> I know of 2 architectures that support 512bit (4x4 matrix) registers...
> same story; implement a primitive type, then using intrinsics, we can build
> interesting types in libraries.
>
> We may also consider a v64 type, which would map to older MMX registers on
> x86... there are also other architectures with 64bit 'vector' registers
> (nintendo wii for one), supporting a pair of floats, or 4 shorts, etc...
> Same general concept, but only 64 bits wide.
>
>
> -- Conclusion --
>
> I think that's about it for a start. I don't think it's particularly a lot
> of work, the potential trouble points are 16byte alignment, and literal
> expression. Potential issues relating to language guarantees of
> exception/error conditions...
> Go on, tear it apart!
>
> Discuss...
January 06, 2012
On 1/5/2012 7:32 PM, Manu wrote:
> You still haven't expressed the concept of the SIMD register anywhere in the
> language. The code gen needs to assign XMM regs, and schedule all appropriate
> loads/stores/etc...

I understand that, I was looking at proof of concept in making it a library type. I agree it wouldn't be very efficient.
January 06, 2012
On 1/5/2012 7:21 PM, Manu wrote:
> I have to go to bed, so I'll leave these thoughts here...

Ok, you convinced me. A fundamental type on the CPU ought to be a fundamental type in the language.
January 06, 2012
On Friday, 6 January 2012 at 03:17:55 UTC, Walter Bright wrote:
> Something about the way you are posting is breaking the threading of these message threads.

This is because the mailing list gateway assigns new Message IDs to posts forwarded from the mailing list to the newsgroup. Mailing list users don't see these IDs, so a post from a mailing list user replying to a post by another mailing list user will never appear as a "reply" to newsgroup users.

I have an idea of how to counter this in DFeed (subscribing DFeed to mailing lists, in addition to polling the newsgroup server, and saving both Message IDs for each post), but it'll still be broken for everything else tied to the newsgroup server.
« First   ‹ Prev
1 2