January 30, 2007
Chad J wrote:
> Bill Baxter wrote:
>> Mikola Lysenko wrote:
>>
>>> Bill Baxter wrote:
>>>
>>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>>
>>>> That may be, but I've heard that at least SSE is really not that suited to many calculations -- especially ones in graphics.  Something like you have to pack your data so that all the x components are together, and all y components together, and all z components together.  Rather than the way everyone normally stores these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that though.  At any rate I think maybe Intel's finally getting tired of being laughed at for their graphics performance so things are probably changing.
>>>>
>>>>
>>>
>>> I have never heard of any SIMD architecture where vectors works that way.  On SSE, Altivec or MMX the components for the vectors are always stored in contiguous memory.
>>
>>
>> Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, so it was just heresay.  But the source was someone I know in the graphics group at Intel.  I must have just misunderstood his gripe, in that case.
>>
>>> In terms of graphics, this is pretty much optimal.  Most manipulations on vectors like rotations, normalization, cross product etc. require access to all components simultaneously.  I honestly don't know why you would want to split each of them into separate buffers...
>>>
>>> Surely it is simpler to do something like this:
>>>
>>> x y z w x y z w x y z w ...
>>>
>>> vs.
>>>
>>> x x x x ... y y y y ... z z z z ... w w w ...
>>
>>
>>
>> Yep, I agree, but I thought that was exactly the gist of what this friend of mine was griping about.  As I understood it at the time, he was complaining that the CPU instructions are good at planar layout x x x x y y y y ... but not interleaved x y x y x y.
>>
>> If that's not the case, then great.
>>
>> --bb
> 
> Seems it's great.
> 
> It doesn't really matter what the underlying data is.  An SSE instruction will add four 32-bit floats in parallel, nevermind whether the floats are x x x x or x y z w.  What meaning the floats have is up to the programmer.
> 
> Of course, channelwise operations will be faster in planer (EX: add 24 to all red values, don't spend time on the other channels), while pixelwise operations will be faster in interleaved (EX: alpha blending) - these facts don't have much to do with SIMD.
> 
> Maybe the guy from intel wanted to help planar pixelwise operations (some mechanism to help the need to dereference 3-4 different places at once) or help interleaved channelwise operations (only operate on every fourth float in an array without having to do 4 mov/adds to fill a 128 bit register).


Sorry to keep harping on this, but here's an article that basically says exactly what my friend was saying.
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2350

From the article:
"""
The hadd and hsub instructions are horizontal additions and horizontal subtractions. These allow faster processing of data stored "horizontally" in (for example) vertex arrays. Here is a 4-element array of vertex structures.

    x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4

SSE and SSE2 are organized such that performance is better when processing vertical data, or structures that contain arrays; for example, a vertex structure with 4-element arrays for each component:

    x1 x2 x3 x4
    y1 y2 y3 y4
    z1 z2 z3 z4
    w1 w2 w3 w4

Generally, the preferred organizational method for vertecies is the former. Under SSE2, the compiler (or very unfortunate programmer) would have to reorganize the data during processing.
"""

The article is talking about how hadd and hsub in SSE3 help to corrects the situation, but SSE3 isn't yet nearly as ubiquitous as SSE/SSE2, I would imagine.

--bb
January 30, 2007
Bill Baxter wrote:
> Chad J wrote:
> 
>> Bill Baxter wrote:
>>
>>> Mikola Lysenko wrote:
>>>
>>>> Bill Baxter wrote:
>>>>
>>>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>>>
>>>>> That may be, but I've heard that at least SSE is really not that suited to many calculations -- especially ones in graphics.  Something like you have to pack your data so that all the x components are together, and all y components together, and all z components together.  Rather than the way everyone normally stores these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that though.  At any rate I think maybe Intel's finally getting tired of being laughed at for their graphics performance so things are probably changing.
>>>>>
>>>>>
>>>>
>>>> I have never heard of any SIMD architecture where vectors works that way.  On SSE, Altivec or MMX the components for the vectors are always stored in contiguous memory.
>>>
>>>
>>>
>>> Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, so it was just heresay.  But the source was someone I know in the graphics group at Intel.  I must have just misunderstood his gripe, in that case.
>>>
>>>> In terms of graphics, this is pretty much optimal.  Most manipulations on vectors like rotations, normalization, cross product etc. require access to all components simultaneously.  I honestly don't know why you would want to split each of them into separate buffers...
>>>>
>>>> Surely it is simpler to do something like this:
>>>>
>>>> x y z w x y z w x y z w ...
>>>>
>>>> vs.
>>>>
>>>> x x x x ... y y y y ... z z z z ... w w w ...
>>>
>>>
>>>
>>>
>>> Yep, I agree, but I thought that was exactly the gist of what this friend of mine was griping about.  As I understood it at the time, he was complaining that the CPU instructions are good at planar layout x x x x y y y y ... but not interleaved x y x y x y.
>>>
>>> If that's not the case, then great.
>>>
>>> --bb
>>
>>
>> Seems it's great.
>>
>> It doesn't really matter what the underlying data is.  An SSE instruction will add four 32-bit floats in parallel, nevermind whether the floats are x x x x or x y z w.  What meaning the floats have is up to the programmer.
>>
>> Of course, channelwise operations will be faster in planer (EX: add 24 to all red values, don't spend time on the other channels), while pixelwise operations will be faster in interleaved (EX: alpha blending) - these facts don't have much to do with SIMD.
>>
>> Maybe the guy from intel wanted to help planar pixelwise operations (some mechanism to help the need to dereference 3-4 different places at once) or help interleaved channelwise operations (only operate on every fourth float in an array without having to do 4 mov/adds to fill a 128 bit register).
> 
> 
> 
> Sorry to keep harping on this, but here's an article that basically says exactly what my friend was saying.
> http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2350
> 
>  From the article:
> """
> The hadd and hsub instructions are horizontal additions and horizontal subtractions. These allow faster processing of data stored "horizontally" in (for example) vertex arrays. Here is a 4-element array of vertex structures.
> 
>     x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4
> 
> SSE and SSE2 are organized such that performance is better when processing vertical data, or structures that contain arrays; for example, a vertex structure with 4-element arrays for each component:
> 
>     x1 x2 x3 x4
>     y1 y2 y3 y4
>     z1 z2 z3 z4
>     w1 w2 w3 w4
> 
> Generally, the preferred organizational method for vertecies is the former. Under SSE2, the compiler (or very unfortunate programmer) would have to reorganize the data during processing.
> """
> 
> The article is talking about how hadd and hsub in SSE3 help to corrects the situation, but SSE3 isn't yet nearly as ubiquitous as SSE/SSE2, I would imagine.
> 
> --bb

That makes a lot of sense.

I also remember running into trouble finding material on SSE as well.  I never really got past looking at what all of the instructions do, or maybe implementing an algorithm or two.  I would have needed the SSE2 instructions to do the integer stuff that I wanted to do, and I don't think I even had those on my old computer when I was doing this stuff :/  For my purposes, MMX was much easier to use and find resources for.

You'll probably have better luck searching for "SSE Instruction Set" and just messing around with the instructions (probably what I'd do).  There should also be some (probably meager) Intel documentation and comments on SSE.

Here are some pages I found:
http://softpixel.com/~cwright/programming/simd/sse.php
http://www.cpuid.com/sse.php
http://www.hayestechnologies.com/en/techsimd.htm
January 31, 2007
Mikola Lysenko wrote:
>     Inline assmeber can not be inlined.  Period.  The compiler has to think of inline assembler as a sort of black box, which takes inputs one way and returns them another way.  It can not poke around in there and change your hand-tuned opcodes in order to pass arguments in arguments more efficiently.  Nor can it change the way you allocate registers so you don't accidentally trash the local frame.  It can't manipulate where you put the result, such that it can be used immediately by the next block of code.  Therefore any asm vector class will have a lot of wasteful function calls which quickly add up:

Sounds like what’s needed is in-line intermediate code rather than in-line assembly — or, at least some way to tell the compiler “do this operation on some block of eight SIMD registers; I don’t particularly care which.”

One way do do this could be with something like C’s ‘register’ keyword.  Not sure on D’s inline assembler syntax, but a possible solution could look like:
	register int a;
	asm{
		addl	0x32, $a
	}
where the compiler substitutes the allocated register (say ECX) for $a.  With a little extra syntax to say what sort of register is required, we get all the power your vector library needs without adding yet another tacked-on feature.  Plus, while you get your vectors, I can get my quaternions and matrix manipulation for the same price.

--Joel
January 31, 2007
Joel C. Salomon wrote:
> Mikola Lysenko wrote:
>>     Inline assmeber can not be inlined.  Period.  The compiler has to think of inline assembler as a sort of black box, which takes inputs one way and returns them another way.  It can not poke around in there and change your hand-tuned opcodes in order to pass arguments in arguments more efficiently.  Nor can it change the way you allocate registers so you don't accidentally trash the local frame.  It can't manipulate where you put the result, such that it can be used immediately by the next block of code.  Therefore any asm vector class will have a lot of wasteful function calls which quickly add up:
> 
> Sounds like what’s needed is in-line intermediate code rather than in-line assembly — or, at least some way to tell the compiler “do this operation on some block of eight SIMD registers; I don’t particularly care which.”

GCC (and recent GDC) already has such a mechanism, they call it 'extended assembler'.
The basic idea is that you can specify several substitutions for registers. You can have them pre-filled by GCC with specific expressions, tell GCC to put certain registers in variables afterwards, and specify what other registers (and optionally "memory") may be changed afterwards.
(That last part could perhaps be omitted if the compiler did some extra work to analyze the asm?)

The GCC version has pretty ugly syntax (IMHO), and I have no idea if they support SIMD registers with it, but I've always liked the idea itself.

But I've already mentioned much of this in a previous post in this thread.
1 2 3 4 5
Next ›   Last »