View mode: basic / threaded / horizontal-split · Log in · Help
January 30, 2007
Re: seeding the pot for 2.0 features [small vectors]
Chad J wrote:
> Bill Baxter wrote:
>> Mikola Lysenko wrote:
>>
>>> Bill Baxter wrote:
>>>
>>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>>
>>>> That may be, but I've heard that at least SSE is really not that 
>>>> suited to many calculations -- especially ones in graphics.  
>>>> Something like you have to pack your data so that all the x 
>>>> components are together, and all y components together, and all z 
>>>> components together.  Rather than the way everyone normally stores 
>>>> these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that 
>>>> though.  At any rate I think maybe Intel's finally getting tired of 
>>>> being laughed at for their graphics performance so things are 
>>>> probably changing.
>>>>
>>>>
>>>
>>> I have never heard of any SIMD architecture where vectors works that 
>>> way.  On SSE, Altivec or MMX the components for the vectors are 
>>> always stored in contiguous memory.
>>
>>
>> Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, 
>> so it was just heresay.  But the source was someone I know in the 
>> graphics group at Intel.  I must have just misunderstood his gripe, in 
>> that case.
>>
>>> In terms of graphics, this is pretty much optimal.  Most 
>>> manipulations on vectors like rotations, normalization, cross product 
>>> etc. require access to all components simultaneously.  I honestly 
>>> don't know why you would want to split each of them into separate 
>>> buffers...
>>>
>>> Surely it is simpler to do something like this:
>>>
>>> x y z w x y z w x y z w ...
>>>
>>> vs.
>>>
>>> x x x x ... y y y y ... z z z z ... w w w ...
>>
>>
>>
>> Yep, I agree, but I thought that was exactly the gist of what this 
>> friend of mine was griping about.  As I understood it at the time, he 
>> was complaining that the CPU instructions are good at planar layout x 
>> x x x y y y y ... but not interleaved x y x y x y.
>>
>> If that's not the case, then great.
>>
>> --bb
> 
> Seems it's great.
> 
> It doesn't really matter what the underlying data is.  An SSE 
> instruction will add four 32-bit floats in parallel, nevermind whether 
> the floats are x x x x or x y z w.  What meaning the floats have is up 
> to the programmer.
> 
> Of course, channelwise operations will be faster in planer (EX: add 24 
> to all red values, don't spend time on the other channels), while 
> pixelwise operations will be faster in interleaved (EX: alpha blending) 
> - these facts don't have much to do with SIMD.
> 
> Maybe the guy from intel wanted to help planar pixelwise operations 
> (some mechanism to help the need to dereference 3-4 different places at 
> once) or help interleaved channelwise operations (only operate on every 
> fourth float in an array without having to do 4 mov/adds to fill a 128 
> bit register).


Sorry to keep harping on this, but here's an article that basically says 
exactly what my friend was saying.
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2350

From the article:
"""
The hadd and hsub instructions are horizontal additions and horizontal 
subtractions. These allow faster processing of data stored 
"horizontally" in (for example) vertex arrays. Here is a 4-element array 
of vertex structures.

    x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4

SSE and SSE2 are organized such that performance is better when 
processing vertical data, or structures that contain arrays; for 
example, a vertex structure with 4-element arrays for each component:

    x1 x2 x3 x4
    y1 y2 y3 y4
    z1 z2 z3 z4
    w1 w2 w3 w4

Generally, the preferred organizational method for vertecies is the 
former. Under SSE2, the compiler (or very unfortunate programmer) would 
have to reorganize the data during processing.
"""

The article is talking about how hadd and hsub in SSE3 help to corrects 
the situation, but SSE3 isn't yet nearly as ubiquitous as SSE/SSE2, I 
would imagine.

--bb
January 30, 2007
Re: seeding the pot for 2.0 features [small vectors]
Bill Baxter wrote:
> Chad J wrote:
> 
>> Bill Baxter wrote:
>>
>>> Mikola Lysenko wrote:
>>>
>>>> Bill Baxter wrote:
>>>>
>>>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>>>
>>>>> That may be, but I've heard that at least SSE is really not that 
>>>>> suited to many calculations -- especially ones in graphics.  
>>>>> Something like you have to pack your data so that all the x 
>>>>> components are together, and all y components together, and all z 
>>>>> components together.  Rather than the way everyone normally stores 
>>>>> these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that 
>>>>> though.  At any rate I think maybe Intel's finally getting tired of 
>>>>> being laughed at for their graphics performance so things are 
>>>>> probably changing.
>>>>>
>>>>>
>>>>
>>>> I have never heard of any SIMD architecture where vectors works that 
>>>> way.  On SSE, Altivec or MMX the components for the vectors are 
>>>> always stored in contiguous memory.
>>>
>>>
>>>
>>> Ok.  Well, I've never used any of these MMX/SSE/Altivec things 
>>> myself, so it was just heresay.  But the source was someone I know in 
>>> the graphics group at Intel.  I must have just misunderstood his 
>>> gripe, in that case.
>>>
>>>> In terms of graphics, this is pretty much optimal.  Most 
>>>> manipulations on vectors like rotations, normalization, cross 
>>>> product etc. require access to all components simultaneously.  I 
>>>> honestly don't know why you would want to split each of them into 
>>>> separate buffers...
>>>>
>>>> Surely it is simpler to do something like this:
>>>>
>>>> x y z w x y z w x y z w ...
>>>>
>>>> vs.
>>>>
>>>> x x x x ... y y y y ... z z z z ... w w w ...
>>>
>>>
>>>
>>>
>>> Yep, I agree, but I thought that was exactly the gist of what this 
>>> friend of mine was griping about.  As I understood it at the time, he 
>>> was complaining that the CPU instructions are good at planar layout x 
>>> x x x y y y y ... but not interleaved x y x y x y.
>>>
>>> If that's not the case, then great.
>>>
>>> --bb
>>
>>
>> Seems it's great.
>>
>> It doesn't really matter what the underlying data is.  An SSE 
>> instruction will add four 32-bit floats in parallel, nevermind whether 
>> the floats are x x x x or x y z w.  What meaning the floats have is up 
>> to the programmer.
>>
>> Of course, channelwise operations will be faster in planer (EX: add 24 
>> to all red values, don't spend time on the other channels), while 
>> pixelwise operations will be faster in interleaved (EX: alpha 
>> blending) - these facts don't have much to do with SIMD.
>>
>> Maybe the guy from intel wanted to help planar pixelwise operations 
>> (some mechanism to help the need to dereference 3-4 different places 
>> at once) or help interleaved channelwise operations (only operate on 
>> every fourth float in an array without having to do 4 mov/adds to fill 
>> a 128 bit register).
> 
> 
> 
> Sorry to keep harping on this, but here's an article that basically says 
> exactly what my friend was saying.
> http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2350
> 
>  From the article:
> """
> The hadd and hsub instructions are horizontal additions and horizontal 
> subtractions. These allow faster processing of data stored 
> "horizontally" in (for example) vertex arrays. Here is a 4-element array 
> of vertex structures.
> 
>     x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4
> 
> SSE and SSE2 are organized such that performance is better when 
> processing vertical data, or structures that contain arrays; for 
> example, a vertex structure with 4-element arrays for each component:
> 
>     x1 x2 x3 x4
>     y1 y2 y3 y4
>     z1 z2 z3 z4
>     w1 w2 w3 w4
> 
> Generally, the preferred organizational method for vertecies is the 
> former. Under SSE2, the compiler (or very unfortunate programmer) would 
> have to reorganize the data during processing.
> """
> 
> The article is talking about how hadd and hsub in SSE3 help to corrects 
> the situation, but SSE3 isn't yet nearly as ubiquitous as SSE/SSE2, I 
> would imagine.
> 
> --bb

That makes a lot of sense.

I also remember running into trouble finding material on SSE as well.  I 
never really got past looking at what all of the instructions do, or 
maybe implementing an algorithm or two.  I would have needed the SSE2 
instructions to do the integer stuff that I wanted to do, and I don't 
think I even had those on my old computer when I was doing this stuff :/ 
 For my purposes, MMX was much easier to use and find resources for.

You'll probably have better luck searching for "SSE Instruction Set" and 
just messing around with the instructions (probably what I'd do).  There 
should also be some (probably meager) Intel documentation and comments 
on SSE.

Here are some pages I found:
http://softpixel.com/~cwright/programming/simd/sse.php
http://www.cpuid.com/sse.php
http://www.hayestechnologies.com/en/techsimd.htm
January 31, 2007
Re: seeding the pot for 2.0 features [small vectors]
Mikola Lysenko wrote:
>     Inline assmeber can not be inlined.  Period.  The compiler has to 
> think of inline assembler as a sort of black box, which takes inputs one 
> way and returns them another way.  It can not poke around in there and 
> change your hand-tuned opcodes in order to pass arguments in arguments 
> more efficiently.  Nor can it change the way you allocate registers so 
> you don't accidentally trash the local frame.  It can't manipulate where 
> you put the result, such that it can be used immediately by the next 
> block of code.  Therefore any asm vector class will have a lot of 
> wasteful function calls which quickly add up:

Sounds like what’s needed is in-line intermediate code rather than 
in-line assembly — or, at least some way to tell the compiler “do this 
operation on some block of eight SIMD registers; I don’t particularly 
care which.”

One way do do this could be with something like C’s ‘register’ keyword. 
 Not sure on D’s inline assembler syntax, but a possible solution could 
look like:
	register int a;
	asm{
		addl	0x32, $a
	}
where the compiler substitutes the allocated register (say ECX) for $a. 
 With a little extra syntax to say what sort of register is required, 
we get all the power your vector library needs without adding yet 
another tacked-on feature.  Plus, while you get your vectors, I can get 
my quaternions and matrix manipulation for the same price.

--Joel
January 31, 2007
Re: seeding the pot for 2.0 features [small vectors]
Joel C. Salomon wrote:
> Mikola Lysenko wrote:
>>     Inline assmeber can not be inlined.  Period.  The compiler has to 
>> think of inline assembler as a sort of black box, which takes inputs 
>> one way and returns them another way.  It can not poke around in there 
>> and change your hand-tuned opcodes in order to pass arguments in 
>> arguments more efficiently.  Nor can it change the way you allocate 
>> registers so you don't accidentally trash the local frame.  It can't 
>> manipulate where you put the result, such that it can be used 
>> immediately by the next block of code.  Therefore any asm vector class 
>> will have a lot of wasteful function calls which quickly add up:
> 
> Sounds like what’s needed is in-line intermediate code rather than 
> in-line assembly — or, at least some way to tell the compiler “do this 
> operation on some block of eight SIMD registers; I don’t particularly 
> care which.”

GCC (and recent GDC) already has such a mechanism, they call it 
'extended assembler'.
The basic idea is that you can specify several substitutions for 
registers. You can have them pre-filled by GCC with specific 
expressions, tell GCC to put certain registers in variables afterwards, 
and specify what other registers (and optionally "memory") may be 
changed afterwards.
(That last part could perhaps be omitted if the compiler did some extra 
work to analyze the asm?)

The GCC version has pretty ugly syntax (IMHO), and I have no idea if 
they support SIMD registers with it, but I've always liked the idea itself.

But I've already mentioned much of this in a previous post in this thread.
Next ›   Last »
1 2 3 4 5
Top | Discussion index | About this forum | D home