View mode: basic / threaded / horizontal-split · Log in · Help
January 29, 2007
Re: seeding the pot for 2.0 features [small vectors]
Bill Baxter wrote:
> "Most CPUs today have *some* kind of SSE/Altivec type thing"
> 
> That may be, but I've heard that at least SSE is really not that suited 
> to many calculations -- especially ones in graphics.  Something like you 
> have to pack your data so that all the x components are together, and 
> all y components together, and all z components together.  Rather than 
> the way everyone normally stores these things as xyz, xyz.  Maybe 
> Altivec, SSE2 and SSE3 fix that though.  At any rate I think maybe 
> Intel's finally getting tired of being laughed at for their graphics 
> performance so things are probably changing.
> 
> 

I have never heard of any SIMD architecture where vectors works that 
way.  On SSE, Altivec or MMX the components for the vectors are always 
stored in contiguous memory.

In terms of graphics, this is pretty much optimal.  Most manipulations 
on vectors like rotations, normalization, cross product etc. require 
access to all components simultaneously.  I honestly don't know why you 
would want to split each of them into separate buffers...

Surely it is simpler to do something like this:

x y z w x y z w x y z w ...

vs.

x x x x ... y y y y ... z z z z ... w w w ...


> "Library vs Core"
> 
> I think there's really not much that you can ask from the core.  A small 
> vector of 4 numbers can represent any number of things.  So I think your 
> best hope for the core is to support some very basic operations on small 
> vectors -- like component-wise +,-,*,/, and dot product -- to optimize 
> those kind of expressions as best as possible, and leave everything else 
> to libraries.  I guess that's pretty much how it works with HW shader 
> languages.  Except they add swizzles to the set of primitive ops.
> 
> 

Yes, I think this is probably the best course of action.  Because of the 
nature of vectors, and the fact that they require such careful compiler 
integration, they must be placed in the core language.  On the other 
hand, most products like dot, cross, perp or outer should be in a 
library.  The reason for this is that there are simply too many types of 
products and operators for a language to reasonably support them all. 
At any rate, once you have the basics you can quickly build up the 
others.  Here is an example of a cross product:

float3 cross(float3 a, float3 b)
{
	return a.yzx * b.zxy - a.zxy * b.yzx;
}

Implementing most products and vector operations is easy once you have a 
simple component-wise vector library.


> "Getting it in the standard library"
> 
> I agree, though, that lo-D math is common enough that it should be 
> included in a standard library.  I wonder if the Tango developers would 
> be willing to include a vector math class...or if they already have one 
> in there.
> 

It may eventually get in.  However, it would be far more optimal if the 
language simply supported them in the core spec.


-Mik
January 29, 2007
Re: seeding the pot for 2.0 features [small vectors]
Joel C. Salomon wrote:
> As I understand it, D’s inline assembler would be the tool to use for 
> this in a library implementation.  I don’t think the complex types use 
> SIMD, so the vectors can be the only things using those registers.
> 
>

I can tell you right now that this won't work.  I have tried using the 
inline assembler with a vector class and the speedup was at barely 
noticeable.  You can see the results here:  http://assertfalse.com

Here are just a few of the things that become a problem for a library 
implementation:

1. Function calls

	Inline assmeber can not be inlined.  Period.  The compiler has to think 
of inline assembler as a sort of black box, which takes inputs one way 
and returns them another way.  It can not poke around in there and 
change your hand-tuned opcodes in order to pass arguments in arguments 
more efficiently.  Nor can it change the way you allocate registers so 
you don't accidentally trash the local frame.  It can't manipulate where 
you put the result, such that it can be used immediately by the next 
block of code.  Therefore any asm vector class will have a lot of 
wasteful function calls which quickly add up:


a = b + c * d;

becomes:

a = b.opAdd(c.opMul(d));


2. Register allocation

	This point is related to 1.  Most SIMD architectures have many 
registers, and a good compiler can easily use that to optimize stuff 
like parameter passing and function returns.  This is totally impossible 
for a library to do, since it has no knowledge of the contents of any 
registers as it executes.

3. Data alignment

	This is a big problem for libraries.  Most vector architectures require 
properly aligned data.  D only provides facilities for aligning 
attributes within a struct, not according to any type of global system 
alignment.  To fix this in D, we will need the compiler's help.  This 
will allow us to pack vectors in a function such that they are properly 
aligned within each local call frame.

-Mik
January 29, 2007
Re: seeding the pot for 2.0 features [small vectors]
Mikola Lysenko wrote:
> Joel C. Salomon wrote:
>> As I understand it, D’s inline assembler would be the tool to use for 
>> this in a library implementation.  I don’t think the complex types use 
>> SIMD, so the vectors can be the only things using those registers.
>>
>>
> 
> I can tell you right now that this won't work.  I have tried using the 
> inline assembler with a vector class and the speedup was at barely 
> noticeable.  You can see the results here:  http://assertfalse.com
> 
> Here are just a few of the things that become a problem for a library 
> implementation:
> 
> 1. Function calls
> 
>     Inline assmeber can not be inlined.  Period.  The compiler has to 
> think of inline assembler as a sort of black box, which takes inputs one 
> way and returns them another way.  It can not poke around in there and 
> change your hand-tuned opcodes in order to pass arguments in arguments 
> more efficiently.  Nor can it change the way you allocate registers so 
> you don't accidentally trash the local frame.  It can't manipulate where 
> you put the result, such that it can be used immediately by the next 
> block of code.  Therefore any asm vector class will have a lot of 
> wasteful function calls which quickly add up:
> 
> 
> a = b + c * d;
> 
> becomes:
> 
> a = b.opAdd(c.opMul(d));
> 
> 
> 2. Register allocation
> 
>     This point is related to 1.  Most SIMD architectures have many 
> registers, and a good compiler can easily use that to optimize stuff 
> like parameter passing and function returns.  This is totally impossible 
> for a library to do, since it has no knowledge of the contents of any 
> registers as it executes.

Can GCC-like extended assembler (recently implemented in GDC: 
http://dgcc.sourceforge.net/gdc/manual.html) help for these first two 
points?
It allows you to let the compiler allocate registers. That should fix 
point two.
You can also tell the compiler where to put variables and where you're 
going to put any results. That means your asm doesn't necessarily need 
to access memory to do anything useful. If the compiler sees it doesn't 
inlining the function should be possible, I think.
It won't fix all asm, of course, but it might make it possible to write 
inlinable asm.

I do think it needs different syntax. Strings? AT&T asm syntax? Bah. But 
the idea itself is a good one, I think.
January 29, 2007
Re: seeding the pot for 2.0 features [small vectors]
Frits van Bommel wrote:
> Can GCC-like extended assembler (recently implemented in GDC: 
> http://dgcc.sourceforge.net/gdc/manual.html) help for these first two 
> points?
> It allows you to let the compiler allocate registers. That should fix 
> point two.
> You can also tell the compiler where to put variables and where you're 
> going to put any results. That means your asm doesn't necessarily need 
> to access memory to do anything useful. If the compiler sees it doesn't 
> inlining the function should be possible, I think.
> It won't fix all asm, of course, but it might make it possible to write 
> inlinable asm.
> 


For regular functions, I completely agree.  The system should work at 
least as well as compiler inlining (for certain cases).

Unfortunately vector assembler is not the same.  At the very minimum the 
implementation needs to be aware of the vector registers on the system, 
and it needs to be able to pass parameters/returns in those registers. 
Otherwise, you still have to use costly functions like movups, and you 
still pay the same basic bookkeeping costs.

It just seems like it would be simpler to make small vectors primitive 
types and be done with it.

-Mik
January 29, 2007
Re: seeding the pot for 2.0 features [small vectors]
Mikola Lysenko wrote:
> It just seems like it would be simpler to make small vectors primitive 
> types and be done with it.

If you could write up a short proposal on exactly what operations should 
be supported, alignment, etc., that would be most helpful.
January 29, 2007
Re: seeding the pot for 2.0 features
Reiner Pope wrote:

> struct MyVector
> {
>     template opFakeMember(char[] ident)

I wouldn't call it "opFakeMember", since those members are not
really fakes. How about "opVirt[ual]Member" or "opAttr[ibute]"?.

Wolfgang Draxinger
-- 
E-Mail address works, Jabber: hexarith@jabber.org, ICQ: 134682867
January 29, 2007
Re: seeding the pot for 2.0 features [small vectors]
Mikola Lysenko wrote:
> Bill Baxter wrote:
>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>
>> That may be, but I've heard that at least SSE is really not that 
>> suited to many calculations -- especially ones in graphics.  Something 
>> like you have to pack your data so that all the x components are 
>> together, and all y components together, and all z components 
>> together.  Rather than the way everyone normally stores these things 
>> as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that though.  At any 
>> rate I think maybe Intel's finally getting tired of being laughed at 
>> for their graphics performance so things are probably changing.
>>
>>
> 
> I have never heard of any SIMD architecture where vectors works that 
> way.  On SSE, Altivec or MMX the components for the vectors are always 
> stored in contiguous memory.

Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, 
so it was just heresay.  But the source was someone I know in the 
graphics group at Intel.  I must have just misunderstood his gripe, in 
that case.

> In terms of graphics, this is pretty much optimal.  Most manipulations 
> on vectors like rotations, normalization, cross product etc. require 
> access to all components simultaneously.  I honestly don't know why you 
> would want to split each of them into separate buffers...
> 
> Surely it is simpler to do something like this:
> 
> x y z w x y z w x y z w ...
> 
> vs.
> 
> x x x x ... y y y y ... z z z z ... w w w ...


Yep, I agree, but I thought that was exactly the gist of what this 
friend of mine was griping about.  As I understood it at the time, he 
was complaining that the CPU instructions are good at planar layout x x 
x x y y y y ... but not interleaved x y x y x y.

If that's not the case, then great.

--bb
January 30, 2007
Re: seeding the pot for 2.0 features [small vectors]
Bill Baxter wrote:
> Mikola Lysenko wrote:
> 
>> Bill Baxter wrote:
>>
>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>
>>> That may be, but I've heard that at least SSE is really not that 
>>> suited to many calculations -- especially ones in graphics.  
>>> Something like you have to pack your data so that all the x 
>>> components are together, and all y components together, and all z 
>>> components together.  Rather than the way everyone normally stores 
>>> these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that 
>>> though.  At any rate I think maybe Intel's finally getting tired of 
>>> being laughed at for their graphics performance so things are 
>>> probably changing.
>>>
>>>
>>
>> I have never heard of any SIMD architecture where vectors works that 
>> way.  On SSE, Altivec or MMX the components for the vectors are always 
>> stored in contiguous memory.
> 
> 
> Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, 
> so it was just heresay.  But the source was someone I know in the 
> graphics group at Intel.  I must have just misunderstood his gripe, in 
> that case.
> 
>> In terms of graphics, this is pretty much optimal.  Most manipulations 
>> on vectors like rotations, normalization, cross product etc. require 
>> access to all components simultaneously.  I honestly don't know why 
>> you would want to split each of them into separate buffers...
>>
>> Surely it is simpler to do something like this:
>>
>> x y z w x y z w x y z w ...
>>
>> vs.
>>
>> x x x x ... y y y y ... z z z z ... w w w ...
> 
> 
> 
> Yep, I agree, but I thought that was exactly the gist of what this 
> friend of mine was griping about.  As I understood it at the time, he 
> was complaining that the CPU instructions are good at planar layout x x 
> x x y y y y ... but not interleaved x y x y x y.
> 
> If that's not the case, then great.
> 
> --bb

Seems it's great.

It doesn't really matter what the underlying data is.  An SSE 
instruction will add four 32-bit floats in parallel, nevermind whether 
the floats are x x x x or x y z w.  What meaning the floats have is up 
to the programmer.

Of course, channelwise operations will be faster in planer (EX: add 24 
to all red values, don't spend time on the other channels), while 
pixelwise operations will be faster in interleaved (EX: alpha blending) 
- these facts don't have much to do with SIMD.

Maybe the guy from intel wanted to help planar pixelwise operations 
(some mechanism to help the need to dereference 3-4 different places at 
once) or help interleaved channelwise operations (only operate on every 
fourth float in an array without having to do 4 mov/adds to fill a 128 
bit register).
January 30, 2007
Re: seeding the pot for 2.0 features [small vectors]
Joel C. Salomon wrote:

>> Effective vector code needs correct data alignment, instruction 
>> scheduling and register use.  Each of these issues is most effectively 
>> handled in the compiler/code gen stage, and therefore suggests that at 
>> the very least the compiler implementation ought to be aware of the 
>> vector type in some way.  By applying the "D Builtin Rationale," it is 
>> easy to see that vectors meet all the required criteria.
> 
> As I understand it, D’s inline assembler would be the tool to use for 
> this in a library implementation.  I don’t think the complex types use 
> SIMD, so the vectors can be the only things using those registers.

SIMD instructions can also be useful for complex operations, so I don't 
think that's a safe assumption to make.
http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/pentium4/optimization/66717.htm

--bb
January 30, 2007
Re: seeding the pot for 2.0 features [small vectors]
Chad J wrote:
> Bill Baxter wrote:
>> Mikola Lysenko wrote:
>>
>>> Bill Baxter wrote:
>>>
>> Yep, I agree, but I thought that was exactly the gist of what this 
>> friend of mine was griping about.  As I understood it at the time, he 
>> was complaining that the CPU instructions are good at planar layout x 
>> x x x y y y y ... but not interleaved x y x y x y.
>>
>> If that's not the case, then great.
>>
>> --bb
> 
> Seems it's great.
> 
> It doesn't really matter what the underlying data is.  An SSE 
> instruction will add four 32-bit floats in parallel, nevermind whether 
> the floats are x x x x or x y z w.  What meaning the floats have is up 
> to the programmer.
> 
> Of course, channelwise operations will be faster in planer (EX: add 24 
> to all red values, don't spend time on the other channels), while 
> pixelwise operations will be faster in interleaved (EX: alpha blending) 
> - these facts don't have much to do with SIMD.
> 
> Maybe the guy from intel wanted to help planar pixelwise operations 
> (some mechanism to help the need to dereference 3-4 different places at 
> once) or help interleaved channelwise operations (only operate on every 
> fourth float in an array without having to do 4 mov/adds to fill a 128 
> bit register).

That could be.  I seem to remember now the specific thing we were 
talking about was transforming a batch of vectors.  Is there a good way 
 to do that with SSE stuff? I.e for a 4x4 matrix with rows M1,M2,M3,M4 
you want to do something like:

  foreach(i,v; vector_batch)
     out[i] = [dot(M1,v),dot(M2,v),dot(M3,v),dot(M4,v)];

Maybe it had to do with not being able to operate 'horizontally'.  E.g. 
to do a dot product you can multiply x y z w times a b c d easily, but 
then you need the sum of those.  Apparently SSE3 has some instructions 
to help this case some.  You can add  x+y and z+w in one step.

By the way, are there any good tutorials on programming with SIMD 
(specifically for Intel/AMD)?  Everytime I've looked I come up with 
pretty much nothing.  Googling for "SSE tutorial" doesn't result in much.

As far as making use of SIMD goes (in C++), I ran across this project 
that looks very promising, but have yet to give it a real try:
http://www.pixelglow.com/macstl/

--bb
1 2 3 4 5
Top | Discussion index | About this forum | D home