January 29, 2007
Bill Baxter wrote:
> "Most CPUs today have *some* kind of SSE/Altivec type thing"
> 
> That may be, but I've heard that at least SSE is really not that suited to many calculations -- especially ones in graphics.  Something like you have to pack your data so that all the x components are together, and all y components together, and all z components together.  Rather than the way everyone normally stores these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that though.  At any rate I think maybe Intel's finally getting tired of being laughed at for their graphics performance so things are probably changing.
> 
> 

I have never heard of any SIMD architecture where vectors works that way.  On SSE, Altivec or MMX the components for the vectors are always stored in contiguous memory.

In terms of graphics, this is pretty much optimal.  Most manipulations on vectors like rotations, normalization, cross product etc. require access to all components simultaneously.  I honestly don't know why you would want to split each of them into separate buffers...

Surely it is simpler to do something like this:

x y z w x y z w x y z w ...

vs.

x x x x ... y y y y ... z z z z ... w w w ...


> "Library vs Core"
> 
> I think there's really not much that you can ask from the core.  A small vector of 4 numbers can represent any number of things.  So I think your best hope for the core is to support some very basic operations on small vectors -- like component-wise +,-,*,/, and dot product -- to optimize those kind of expressions as best as possible, and leave everything else to libraries.  I guess that's pretty much how it works with HW shader languages.  Except they add swizzles to the set of primitive ops.
> 
> 

Yes, I think this is probably the best course of action.  Because of the nature of vectors, and the fact that they require such careful compiler integration, they must be placed in the core language.  On the other hand, most products like dot, cross, perp or outer should be in a library.  The reason for this is that there are simply too many types of products and operators for a language to reasonably support them all. At any rate, once you have the basics you can quickly build up the others.  Here is an example of a cross product:

float3 cross(float3 a, float3 b)
{
	return a.yzx * b.zxy - a.zxy * b.yzx;
}

Implementing most products and vector operations is easy once you have a simple component-wise vector library.


> "Getting it in the standard library"
> 
> I agree, though, that lo-D math is common enough that it should be included in a standard library.  I wonder if the Tango developers would be willing to include a vector math class...or if they already have one in there.
> 

It may eventually get in.  However, it would be far more optimal if the language simply supported them in the core spec.


-Mik
January 29, 2007
Joel C. Salomon wrote:
> As I understand it, D’s inline assembler would be the tool to use for this in a library implementation.  I don’t think the complex types use SIMD, so the vectors can be the only things using those registers.
> 
>

I can tell you right now that this won't work.  I have tried using the inline assembler with a vector class and the speedup was at barely noticeable.  You can see the results here:  http://assertfalse.com

Here are just a few of the things that become a problem for a library implementation:

1. Function calls

	Inline assmeber can not be inlined.  Period.  The compiler has to think of inline assembler as a sort of black box, which takes inputs one way and returns them another way.  It can not poke around in there and change your hand-tuned opcodes in order to pass arguments in arguments more efficiently.  Nor can it change the way you allocate registers so you don't accidentally trash the local frame.  It can't manipulate where you put the result, such that it can be used immediately by the next block of code.  Therefore any asm vector class will have a lot of wasteful function calls which quickly add up:


a = b + c * d;

becomes:

a = b.opAdd(c.opMul(d));


2. Register allocation

	This point is related to 1.  Most SIMD architectures have many registers, and a good compiler can easily use that to optimize stuff like parameter passing and function returns.  This is totally impossible for a library to do, since it has no knowledge of the contents of any registers as it executes.

3. Data alignment

	This is a big problem for libraries.  Most vector architectures require properly aligned data.  D only provides facilities for aligning attributes within a struct, not according to any type of global system alignment.  To fix this in D, we will need the compiler's help.  This will allow us to pack vectors in a function such that they are properly aligned within each local call frame.

-Mik
January 29, 2007
Mikola Lysenko wrote:
> Joel C. Salomon wrote:
>> As I understand it, D’s inline assembler would be the tool to use for this in a library implementation.  I don’t think the complex types use SIMD, so the vectors can be the only things using those registers.
>>
>>
> 
> I can tell you right now that this won't work.  I have tried using the inline assembler with a vector class and the speedup was at barely noticeable.  You can see the results here:  http://assertfalse.com
> 
> Here are just a few of the things that become a problem for a library implementation:
> 
> 1. Function calls
> 
>     Inline assmeber can not be inlined.  Period.  The compiler has to think of inline assembler as a sort of black box, which takes inputs one way and returns them another way.  It can not poke around in there and change your hand-tuned opcodes in order to pass arguments in arguments more efficiently.  Nor can it change the way you allocate registers so you don't accidentally trash the local frame.  It can't manipulate where you put the result, such that it can be used immediately by the next block of code.  Therefore any asm vector class will have a lot of wasteful function calls which quickly add up:
> 
> 
> a = b + c * d;
> 
> becomes:
> 
> a = b.opAdd(c.opMul(d));
> 
> 
> 2. Register allocation
> 
>     This point is related to 1.  Most SIMD architectures have many registers, and a good compiler can easily use that to optimize stuff like parameter passing and function returns.  This is totally impossible for a library to do, since it has no knowledge of the contents of any registers as it executes.

Can GCC-like extended assembler (recently implemented in GDC: http://dgcc.sourceforge.net/gdc/manual.html) help for these first two points?
It allows you to let the compiler allocate registers. That should fix point two.
You can also tell the compiler where to put variables and where you're going to put any results. That means your asm doesn't necessarily need to access memory to do anything useful. If the compiler sees it doesn't inlining the function should be possible, I think.
It won't fix all asm, of course, but it might make it possible to write inlinable asm.

I do think it needs different syntax. Strings? AT&T asm syntax? Bah. But the idea itself is a good one, I think.
January 29, 2007
Frits van Bommel wrote:
> Can GCC-like extended assembler (recently implemented in GDC: http://dgcc.sourceforge.net/gdc/manual.html) help for these first two points?
> It allows you to let the compiler allocate registers. That should fix point two.
> You can also tell the compiler where to put variables and where you're going to put any results. That means your asm doesn't necessarily need to access memory to do anything useful. If the compiler sees it doesn't inlining the function should be possible, I think.
> It won't fix all asm, of course, but it might make it possible to write inlinable asm.
> 


For regular functions, I completely agree.  The system should work at least as well as compiler inlining (for certain cases).

Unfortunately vector assembler is not the same.  At the very minimum the implementation needs to be aware of the vector registers on the system, and it needs to be able to pass parameters/returns in those registers. Otherwise, you still have to use costly functions like movups, and you still pay the same basic bookkeeping costs.

It just seems like it would be simpler to make small vectors primitive types and be done with it.

-Mik
January 29, 2007
Mikola Lysenko wrote:
> It just seems like it would be simpler to make small vectors primitive types and be done with it.

If you could write up a short proposal on exactly what operations should be supported, alignment, etc., that would be most helpful.
January 29, 2007
Reiner Pope wrote:

> struct MyVector
> {
>     template opFakeMember(char[] ident)

I wouldn't call it "opFakeMember", since those members are not really fakes. How about "opVirt[ual]Member" or "opAttr[ibute]"?.

Wolfgang Draxinger
-- 
E-Mail address works, Jabber: hexarith@jabber.org, ICQ: 134682867

January 29, 2007
Mikola Lysenko wrote:
> Bill Baxter wrote:
>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>
>> That may be, but I've heard that at least SSE is really not that suited to many calculations -- especially ones in graphics.  Something like you have to pack your data so that all the x components are together, and all y components together, and all z components together.  Rather than the way everyone normally stores these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that though.  At any rate I think maybe Intel's finally getting tired of being laughed at for their graphics performance so things are probably changing.
>>
>>
> 
> I have never heard of any SIMD architecture where vectors works that way.  On SSE, Altivec or MMX the components for the vectors are always stored in contiguous memory.

Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, so it was just heresay.  But the source was someone I know in the graphics group at Intel.  I must have just misunderstood his gripe, in that case.

> In terms of graphics, this is pretty much optimal.  Most manipulations on vectors like rotations, normalization, cross product etc. require access to all components simultaneously.  I honestly don't know why you would want to split each of them into separate buffers...
> 
> Surely it is simpler to do something like this:
> 
> x y z w x y z w x y z w ...
> 
> vs.
> 
> x x x x ... y y y y ... z z z z ... w w w ...


Yep, I agree, but I thought that was exactly the gist of what this friend of mine was griping about.  As I understood it at the time, he was complaining that the CPU instructions are good at planar layout x x x x y y y y ... but not interleaved x y x y x y.

If that's not the case, then great.

--bb
January 30, 2007
Bill Baxter wrote:
> Mikola Lysenko wrote:
> 
>> Bill Baxter wrote:
>>
>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>
>>> That may be, but I've heard that at least SSE is really not that suited to many calculations -- especially ones in graphics.  Something like you have to pack your data so that all the x components are together, and all y components together, and all z components together.  Rather than the way everyone normally stores these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that though.  At any rate I think maybe Intel's finally getting tired of being laughed at for their graphics performance so things are probably changing.
>>>
>>>
>>
>> I have never heard of any SIMD architecture where vectors works that way.  On SSE, Altivec or MMX the components for the vectors are always stored in contiguous memory.
> 
> 
> Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, so it was just heresay.  But the source was someone I know in the graphics group at Intel.  I must have just misunderstood his gripe, in that case.
> 
>> In terms of graphics, this is pretty much optimal.  Most manipulations on vectors like rotations, normalization, cross product etc. require access to all components simultaneously.  I honestly don't know why you would want to split each of them into separate buffers...
>>
>> Surely it is simpler to do something like this:
>>
>> x y z w x y z w x y z w ...
>>
>> vs.
>>
>> x x x x ... y y y y ... z z z z ... w w w ...
> 
> 
> 
> Yep, I agree, but I thought that was exactly the gist of what this friend of mine was griping about.  As I understood it at the time, he was complaining that the CPU instructions are good at planar layout x x x x y y y y ... but not interleaved x y x y x y.
> 
> If that's not the case, then great.
> 
> --bb

Seems it's great.

It doesn't really matter what the underlying data is.  An SSE instruction will add four 32-bit floats in parallel, nevermind whether the floats are x x x x or x y z w.  What meaning the floats have is up to the programmer.

Of course, channelwise operations will be faster in planer (EX: add 24 to all red values, don't spend time on the other channels), while pixelwise operations will be faster in interleaved (EX: alpha blending) - these facts don't have much to do with SIMD.

Maybe the guy from intel wanted to help planar pixelwise operations (some mechanism to help the need to dereference 3-4 different places at once) or help interleaved channelwise operations (only operate on every fourth float in an array without having to do 4 mov/adds to fill a 128 bit register).
January 30, 2007
Joel C. Salomon wrote:

>> Effective vector code needs correct data alignment, instruction scheduling and register use.  Each of these issues is most effectively handled in the compiler/code gen stage, and therefore suggests that at the very least the compiler implementation ought to be aware of the vector type in some way.  By applying the "D Builtin Rationale," it is easy to see that vectors meet all the required criteria.
> 
> As I understand it, D’s inline assembler would be the tool to use for this in a library implementation.  I don’t think the complex types use SIMD, so the vectors can be the only things using those registers.

SIMD instructions can also be useful for complex operations, so I don't think that's a safe assumption to make.
http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/pentium4/optimization/66717.htm

--bb
January 30, 2007
Chad J wrote:
> Bill Baxter wrote:
>> Mikola Lysenko wrote:
>>
>>> Bill Baxter wrote:
>>>
>> Yep, I agree, but I thought that was exactly the gist of what this friend of mine was griping about.  As I understood it at the time, he was complaining that the CPU instructions are good at planar layout x x x x y y y y ... but not interleaved x y x y x y.
>>
>> If that's not the case, then great.
>>
>> --bb
> 
> Seems it's great.
> 
> It doesn't really matter what the underlying data is.  An SSE instruction will add four 32-bit floats in parallel, nevermind whether the floats are x x x x or x y z w.  What meaning the floats have is up to the programmer.
> 
> Of course, channelwise operations will be faster in planer (EX: add 24 to all red values, don't spend time on the other channels), while pixelwise operations will be faster in interleaved (EX: alpha blending) - these facts don't have much to do with SIMD.
> 
> Maybe the guy from intel wanted to help planar pixelwise operations (some mechanism to help the need to dereference 3-4 different places at once) or help interleaved channelwise operations (only operate on every fourth float in an array without having to do 4 mov/adds to fill a 128 bit register).

That could be.  I seem to remember now the specific thing we were talking about was transforming a batch of vectors.  Is there a good way  to do that with SSE stuff? I.e for a 4x4 matrix with rows M1,M2,M3,M4 you want to do something like:

  foreach(i,v; vector_batch)
     out[i] = [dot(M1,v),dot(M2,v),dot(M3,v),dot(M4,v)];

Maybe it had to do with not being able to operate 'horizontally'.  E.g. to do a dot product you can multiply x y z w times a b c d easily, but then you need the sum of those.  Apparently SSE3 has some instructions to help this case some.  You can add  x+y and z+w in one step.

By the way, are there any good tutorials on programming with SIMD (specifically for Intel/AMD)?  Everytime I've looked I come up with pretty much nothing.  Googling for "SSE tutorial" doesn't result in much.

As far as making use of SIMD goes (in C++), I ran across this project that looks very promising, but have yet to give it a real try:
http://www.pixelglow.com/macstl/

--bb