Thread overview | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 06, 2012 SIMD support... | ||||
---|---|---|---|---|
| ||||
Attachments:
| So I've been hassling about this for a while now, and Walter asked me to pitch an email detailing a minimal implementation with some initial thoughts. The first thing I'd like to say is that a lot of people seem to have this idea that float[4] should be specialised as a candidate for simd optimisations somehow. It's obviously been discussed, and this general opinion seems to be shared by a good few people here. I've had a whole bunch of rants why I think this is wrong in other threads, so I won't repeat them here... and that said, I'll attempt to detail an approach based on explicit vector types. So, what do we need...? A language defined primitive vector type... that's all. -- What shall we call it? -- Doesn't really matter... open to suggestions. VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector float' (a name I particularly hate, not specifying any size, and trying to associate it with a specific type) I like v128, or something like that. I'll use that for the sake of this document. I think it is preferable to float4 for a few reasons: * v128 says what the register intends to be, a general purpose 128bit register that may be used for a variety of simd operations that aren't necessarily type bound. * float4 implies it is a specific 4 component float type, which is not what the raw type should be. * If we use names like float4, it stands to reason that (u)int4, (u)short8, etc should also exist, and it also stands to reason that one might expect math operators and such to be defined... I suggest initial language definition and implementation of something like v128, and then types like float4, (u)int4, etc, may be implemented in the std library with complex behaviour like casting mechanics, and basic math operators... -- Alignment -- This type needs to be 16byte aligned. Unaligned loads/stores are very expensive, and also tend to produce extremely costly LHS hazards on most architectures when accessing vectors in arrays. If they are not aligned, they are useless... honestly. ** Does this cause problems with class allocation? Are/can classes be allocated to an alignment as inherited from an aligned member? ... If not, this might be the bulk of the work. There is one other problem I know of that is only of concern on x86. In the C ABI, passing 16byte ALIGNED vectors by value is a problem, since x86 ALWAYS uses the stack to pass arguments, and has no way to align the stack. I wonder if D can get creative with its ABI here, passing vectors in registers, even though that's not conventional on x86... the C ABI was invented long before these hardware features. In lieu of that, x86 would (sadly) need to silently pass by const ref... and also do this in the case of register overflow. Every other architecture (including x64) is fine, since all other architectures pass in regs, and can align the stack as needed when overflowing the regs (since stack management is manual and not performed with special opcodes). -- What does this type do? -- The primitive v128 type DOES nothing... it is a type that facilitates the compiler allocating SIMD registers, managing assignments, loads, and stores, and allow passing to/from functions BY VALUE in registers. Ie, the only valid operations would be: v128 myVec = someStruct.vecMember; // and vice versa... v128 result = someFunc(myVec); // and calling functions, passing by value. Nice bonus: This alone is enough to allow implementation of fast memcpy functions that copy 16 bytes at a time... ;) -- So, it does nothing... so what good is it? -- Initially you could use this type in conjunction with inline asm, or architecture intrinsics to do useful stuff. This would be using the hardware totally raw, which is an important feature to have, but I imagine most of the good stuff would come from libraries built on top of this. -- Literal assignment -- This is a hairy one. Endian issues appear in 2 layers here... Firstly, if you consider the vector to be 4 int's, the ints themselves may be little or big endian, but in addition, the outer layer (ie. the order of x,y,z,w) may also be in reverse order on some architectures... This makes a single 128bit hex literal hard to apply. I'll have a dig and try and confirm this, but I have a suspicion that VMX defines its components reverse to other architectures... (Note: not usually a problem in C, because vector code is sooo non-standard in C that this is ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order can suit) For the primitive v128 type, I generally like the idea of using a huge 128bit hex literal. v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;) Since the primitive v128 type is effectively typeless, it makes no sense to use syntax like this: v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be reserved for use with a float4 type defined in a library somewhere. ... The problem is, this may not be linearly applicable to all hardware. If the order of the components match the endian, then it is fine... I suspect VMX orders the components reverse to match the fact the values are big endian, which would be good, but I need to check. And if not... then literals may need to get a lot more complicated :) Assignment of literals to the primitive type IS actually important, it's common to generate bit masks in these registers which are type-independent. I also guess libraries still need to leverage this primitive assignment functionality to assign their more complex literal expressions. -- Libraries -- With this type, we can write some useful standard libraries. For a start, we can consider adding float4, int4, etc, and make them more intelligent... they would have basic maths operators defined, and probably implement type conversion when casting between types. int4 intVec = floatVec; // perform a type conversion from float to int.. or vice versa... (perhaps we make this require an explicit cast?) v128 vec = floatVec; // implicit cast to the raw type always possible, and does no type casting, just a reinterpret int4 intVec = vec; // conversely, the primitive type would implicitly assign to other types. int4 intVec = (v128)floatVec; // piping through the primitive v128 allows to easily perform a reinterpret between vector types, rather than the usual type conversion. There are also a truckload of other operations that would be fleshed out. For instance, strongly typed literal assignment, vector comparisons that can be used with if() (usually these allow you to test if ALL components, or if ANY components meet a given condition). Conventional logic operators can't be neatly applied to vectors. You need to do something like this: if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ... We can discuss the libraries at a later date, but it's possible that you might also want to make some advanced functions in the library that are only supported on particular architectures, std.simd.sse..., std.simd.vmx..., etc. which may be version()-ed. -- Exceptions, flags, and error conditions -- SIMD units usually have their own control register for controlling various behaviours, most importantly NaN policy and exception semantics... I'm open to input here... what should be default behaviour? I'll bet the D community opt for strict NaNs, and throw by default... but it is actually VERY common to disable hardware exceptions when working with SIMD code: * often precision is less important than speed when using SIMD, and some SIMD units perform faster when these features are disabled. * most SIMD algorithms (at least in performance oriented code) are designed to tolerate '0,0,0,0' as the result of a divide by zero, or some other error condition. * realtime physics tends to suffer error creep and freaky random explosions, and you can't have those crashing the program :) .. they're not really 'errors', they're expected behaviour, often producing 0,0,0,0 as a result, so they're easy to deal with. I presume it'll end up being NaNs and throw by default, but we do need some mechanism to change the SIMD unit flags for realtime use... A runtime function? Perhaps a compiler switch (C does this sort of thing a lot)? It's also worth noting that there are numerous SIMD units out there that DON'T follow strict ieee float rules, and don't support NaNs or hardware exceptions at all... others may simply set a divide-by-zero flag, but not actually trigger a hardware exception, requiring you to explicitly check the flag if you're interested. Will it be okay that the languages default behaviour of NaN's and throws is unsupported on such platforms? What are the implications of this? -- Future -- AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256 type, everything else is precisely the same. I think this is perfectly reasonable... AVX is to SSE exactly as long is to int, or double is to float. They are different types with different register allocation and addressing semantics, and deserve a discreet type. As with v128, libraries may then be created to allow the types to interact. I know of 2 architectures that support 512bit (4x4 matrix) registers... same story; implement a primitive type, then using intrinsics, we can build interesting types in libraries. We may also consider a v64 type, which would map to older MMX registers on x86... there are also other architectures with 64bit 'vector' registers (nintendo wii for one), supporting a pair of floats, or 4 shorts, etc... Same general concept, but only 64 bits wide. -- Conclusion -- I think that's about it for a start. I don't think it's particularly a lot of work, the potential trouble points are 16byte alignment, and literal expression. Potential issues relating to language guarantees of exception/error conditions... Go on, tear it apart! Discuss... |
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On 1/5/2012 5:42 PM, Manu wrote:
> The first thing I'd like to say is that a lot of people seem to have this idea
> that float[4] should be specialised as a candidate for simd optimisations
> somehow. It's obviously been discussed, and this general opinion seems to be
> shared by a good few people here.
> I've had a whole bunch of rants why I think this is wrong in other threads, so I
> won't repeat them here...
If you could cut&paste them here, I would find it most helpful. I have some ideas on making that work, but I need to know everything wrong with it first.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On 1/5/2012 5:42 PM, Manu wrote:
> So I've been hassling about this for a while now, and Walter asked me to pitch
> an email detailing a minimal implementation with some initial thoughts.
Another question:
Is this worth doing for 32 bit code? Or is anyone doing this doing it for 64 bit only?
The reason I ask is because 64 bit is 16 byte aligned, but aligning the stack in 32 bit code is inefficient for everything else.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On 1/5/2012 5:42 PM, Manu wrote:
> -- Alignment --
>
> This type needs to be 16byte aligned. Unaligned loads/stores are very expensive,
> and also tend to produce extremely costly LHS hazards on most architectures when
> accessing vectors in arrays. If they are not aligned, they are useless... honestly.
>
> ** Does this cause problems with class allocation? Are/can classes be allocated
> to an alignment as inherited from an aligned member? ... If not, this might be
> the bulk of the work.
The only real issue with alignment is getting the stack aligned to 16 bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS X.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments:
| On 6 January 2012 04:12, Walter Bright <newshound2@digitalmars.com> wrote: > If you could cut&paste them here, I would find it most helpful. I have some ideas on making that work, but I need to know everything wrong with it first. > On 5 January 2012 11:02, Manu <turkeyman@gmail.com> wrote: > On 5 January 2012 02:42, bearophile <bearophileHUGS@lycos.com> wrote: > >> Think about future CPU evolution with SIMD registers 128, then 256, then 512, then 1024 bits long. In theory a good compiler is able to use them with no changes in the D code that uses vector operations. >> > > These are all fundamentally different types, like int and long.. float and > double... and I certainly want a keyword to identify each of them. Even if > the compiler is trying to make auto vector optimisations, you can't deny > programmers explicit control to the hardware when they want/need it. > Look at x86 compilers, been TRYING to perform automatic SSE optimisations > for 10 years, with basically no success... do you really think you can do > better then all that work by microsoft and GCC? > In my experience, I've even run into a lot of VC's auto-SSE-ed code that > is SLOWER than the original float code. > Let's not even mention architectures that receive much less love than x86, > and are arguably more important (ARM; slower, simpler processors with more > demand to perform well, and not waste power) > ... Vector ops and SIMD ops are different things. float[4] (or more > realistically, float[3]) should NOT be a candidate for automatic SIMD implementation, likewise, simd_type should not have its components individually accessible. These are operations the hardware can not actually perform. So no syntax to worry about, just a type. > > >> I think the good Hara will be able to implement those syntax fixes in a matter of just one day or very few days if a consensus is reached about what actually is to be fixed in D vector ops syntax. >> > > >> Instead of discussing about *adding* something (register intrinsics) I suggest to discuss about what to fix about the *already present* vector op syntax. This is not a request to just you Manu, but to this whole newsgroup. >> > > And I think this is exactly the wrong approach. A vector is NOT an array of 4 (actually, usually 3) floats. It should not appear as one. This is overly complicated and ultimately wrong way to engage this hardware. Imagine the complexity in the compiler to try and force float[4] operations into vector arithmetic vs adding a 'v128' type which actually does what people want anyway... What about when when you actually WANT a float[4] array, and NOT a vector? > > SIMD units are not float units, they should not appear like an aggregation > of float units. They have: > * Different error semantics, exception handling rules, sometimes > different precision... > * Special alignment rules. > * Special literal expression/assignment. > * You can NOT access individual components at will. > * May be reinterpreted at any time as float[1] float[4] double[2] > short[8] char[16], etc... (up to the architecture intrinsics) > * Can not be involved in conventional comparison logic (array of floats > would make you think they could) > *** Can NOT interact with the regular 'float' unit... Vectors as an array > of floats certainly suggests that you can interact with scalar floats... > > I will use architecture intrinsics to operate on these regs, and put that nice and neatly behind a hardware vector type with version()'s for each architecture, and an API with a whole lot of sugar to make them nice and friendly to use. > > My argument is that even IF the compiler some day attempts to make vector optimisations to float[4] arrays, the raw hardware should be exposed first, and allow programmers to use it directly. This starts with a language defined (platform independant) v128 type. > ... Other rants have been on IRC. |
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments:
| On 6 January 2012 04:16, Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/5/2012 5:42 PM, Manu wrote:
>
>> -- Alignment --
>>
>> This type needs to be 16byte aligned. Unaligned loads/stores are very
>> expensive,
>> and also tend to produce extremely costly LHS hazards on most
>> architectures when
>> accessing vectors in arrays. If they are not aligned, they are useless...
>> honestly.
>>
>> ** Does this cause problems with class allocation? Are/can classes be
>> allocated
>> to an alignment as inherited from an aligned member? ... If not, this
>> might be
>> the bulk of the work.
>>
>
> The only real issue with alignment is getting the stack aligned to 16 bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS X.
>
It's important for all implementations of simd units, x32, x64, and others. As said, if aligning the x32 stack is too much trouble, I suggest silently passing by const ref on x86.
Are you talking about for parameter passing, or for local variable
assignment on the stack?
For parameter passing, I understand the x32 problems with aligning the
arguments (I think it's possible to work around though), but there should
be no problem with aligning the stack for allocating local variables.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments:
| >
> The reason I ask is because 64 bit is 16 byte aligned, but aligning the stack in 32 bit code is inefficient for everything else.
>
Note: you only need to align the stack when a vector is actually stored on it by value. Probably very rare, more rare than you think.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On 1/5/2012 6:25 PM, Manu wrote:
> Are you talking about for parameter passing, or for local variable assignment on
> the stack?
> For parameter passing, I understand the x32 problems with aligning the arguments
> (I think it's possible to work around though), but there should be no problem
> with aligning the stack for allocating local variables.
Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments:
| On 6 January 2012 05:22, Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/5/2012 6:25 PM, Manu wrote:
>
>> Are you talking about for parameter passing, or for local variable
>> assignment on
>> the stack?
>> For parameter passing, I understand the x32 problems with aligning the
>> arguments
>> (I think it's possible to work around though), but there should be no
>> problem
>> with aligning the stack for allocating local variables.
>>
>
> Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
>
Perhaps I misunderstand, I can't see the problem?
In the function preamble, you just align it... something like:
mov reg, esp ; take a backup of the stack pointer
and esp, -16 ; align it
... function
mov esp, reg ; restore the stack pointer
ret 0
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Attachments:
| On 6 January 2012 05:42, Manu <turkeyman@gmail.com> wrote:
> On 6 January 2012 05:22, Walter Bright <newshound2@digitalmars.com> wrote:
>
>> On 1/5/2012 6:25 PM, Manu wrote:
>>
>>> Are you talking about for parameter passing, or for local variable
>>> assignment on
>>> the stack?
>>> For parameter passing, I understand the x32 problems with aligning the
>>> arguments
>>> (I think it's possible to work around though), but there should be no
>>> problem
>>> with aligning the stack for allocating local variables.
>>>
>>
>> Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
>>
>
> Perhaps I misunderstand, I can't see the problem?
> In the function preamble, you just align it... something like:
> mov reg, esp ; take a backup of the stack pointer
> and esp, -16 ; align it
>
> ... function
>
> mov esp, reg ; restore the stack pointer
> ret 0
>
That said, most of the time values used in smaller functions will only ever exist in regs, and won't ever be written to the stack... in this case, there's no problem.
|
Copyright © 1999-2021 by the D Language Foundation