Thread overview
MMX/SSE/SIMD without using assembly
Jan 15, 2006
Chad J
Jan 15, 2006
Knud Sørensen
Jan 16, 2006
Chad J
Jan 16, 2006
James Dunne
Jan 17, 2006
Chad J
Jan 22, 2006
mclysenk
Jan 23, 2006
Robert.AtkinsonNO
Jan 26, 2006
Brian Chapman
January 15, 2006
I am hoping to find a way to use SIMD instructions in my programs to make them faster without rolling out assembly code each and every time.  About a week ago I started working on some functions that would do MMX operations to arrays of data.  So far so good, but now I wonder if someone has done this already.  I couldn't find anything in dsource or wiki4D.  Maybe there is something like this in C that could easily be used in D?

Then I began thinking beyond MMX, since MMX is very old and apparently slated for removal in the new x86 64 bit processors (in native 64 bit mode).  That brings me to SSE, SSE2, and maybe SSE3.  Now for these to be very fast, the data they operate on must be aligned to 16-byte boundaries in memory.  I found other posts on this forum where people had problems doing that in D, and they were either unresolved or there was no follow up post.  So I was wondering if there was a good (fast) way to make sure the data in arrays is aligned to a 16-byte boundary?

Also, I'm not sure if making a struct like so will work:

align(16) struct someName
{
	...
}

Does that determine how the elements are packed or does that ensure that the struct's address is a multiple of 16?
January 15, 2006

hi Chad.

You should take a look at the vectorization suggestion on. http://all-technology.com/eigenpolls/dwishlist/

Also do a search for vectorization on the news archive.

I think Walter is planing this for 2.0.

Knud

On Sun, 15 Jan 2006 02:38:07 -0500, Chad J wrote:

> I am hoping to find a way to use SIMD instructions in my programs to
> make them faster without rolling out assembly code each and every time.
>   About a week ago I started working on some functions that would do MMX
> operations to arrays of data.  So far so good, but now I wonder if
> someone has done this already.  I couldn't find anything in dsource or
> wiki4D.  Maybe there is something like this in C that could easily be
> used in D?
> 
> Then I began thinking beyond MMX, since MMX is very old and apparently slated for removal in the new x86 64 bit processors (in native 64 bit mode).  That brings me to SSE, SSE2, and maybe SSE3.  Now for these to be very fast, the data they operate on must be aligned to 16-byte boundaries in memory.  I found other posts on this forum where people had problems doing that in D, and they were either unresolved or there was no follow up post.  So I was wondering if there was a good (fast) way to make sure the data in arrays is aligned to a 16-byte boundary?
> 
> Also, I'm not sure if making a struct like so will work:
> 
> align(16) struct someName
> {
> 	...
> }
> 
> Does that determine how the elements are packed or does that ensure that the struct's address is a multiple of 16?

January 16, 2006
Knud Sørensen wrote:
> 
> hi Chad.
> 
> You should take a look at the vectorization suggestion on.
> http://all-technology.com/eigenpolls/dwishlist/
> 
> Also do a search for vectorization on the news archive.
> 
> I think Walter is planing this for 2.0.
> 
> Knud
> 

Right now I am trying to do this by simply writing functions that do the basics.
I have the following functions mostly working, they all use MMX if available:

void padds( ubyte[] lvalue, ubyte[] rvalue )
void padds( ubyte[] lvalue, ubyte rvalue )
void padds( ushort[] lvalue, ushort[] rvalue )
void padds( ushort[] lvalue, ushort rvalue )
void padds( byte[] lvalue, byte[] rvalue )
void padds( byte[] lvalue, byte rvalue )
void padds( short[] lvalue, short[] rvalue )
void padds( short[] lvalue, short rvalue )
	
void padd( ubyte[] lvalue, ubyte[] rvalue )
void padd( ubyte[] lvalue, ubyte rvalue )
void padd( byte[] lvalue, byte[] rvalue )
void padd( byte[] lvalue, byte rvalue )

padds does saturated addition on the array lvalue, choosing signed or unsigned based on type.
padd does unsaturated addition (wraps on overflow or underflow) on the array lvalue.  Signage doesn't matter so the signed function casts to unsigned and calls the unsigned function.

If rvalue is an array, it adds every element of rvalue onto the corresponding element in lvalue.  If rvalue is not an array, it adds rvalue onto every element in lvalue.  I'm probably going to get rid of non-array rvalues and make it use a global 64-bit variable instead, so that you can do stuff like darken an image without touching the alpha channel.

OK so would an implementation like I was thinking of even be worth it, or would it just be replaced in a year or so and not help much in the mean time?
January 16, 2006
Chad J wrote:
> Knud Sørensen wrote:
> 
>>
>> hi Chad.
>>
>> You should take a look at the vectorization suggestion on.
>> http://all-technology.com/eigenpolls/dwishlist/
>>
>> Also do a search for vectorization on the news archive.
>>
>> I think Walter is planing this for 2.0.
>>
>> Knud
>>
> 
> Right now I am trying to do this by simply writing functions that do the basics.
> I have the following functions mostly working, they all use MMX if available:
> 
> void padds( ubyte[] lvalue, ubyte[] rvalue )
> void padds( ubyte[] lvalue, ubyte rvalue )
> void padds( ushort[] lvalue, ushort[] rvalue )
> void padds( ushort[] lvalue, ushort rvalue )
> void padds( byte[] lvalue, byte[] rvalue )
> void padds( byte[] lvalue, byte rvalue )
> void padds( short[] lvalue, short[] rvalue )
> void padds( short[] lvalue, short rvalue )
>     void padd( ubyte[] lvalue, ubyte[] rvalue )
> void padd( ubyte[] lvalue, ubyte rvalue )
> void padd( byte[] lvalue, byte[] rvalue )
> void padd( byte[] lvalue, byte rvalue )
> 
> padds does saturated addition on the array lvalue, choosing signed or unsigned based on type.
> padd does unsaturated addition (wraps on overflow or underflow) on the array lvalue.  Signage doesn't matter so the signed function casts to unsigned and calls the unsigned function.
> 
> If rvalue is an array, it adds every element of rvalue onto the corresponding element in lvalue.  If rvalue is not an array, it adds rvalue onto every element in lvalue.  I'm probably going to get rid of non-array rvalues and make it use a global 64-bit variable instead, so that you can do stuff like darken an image without touching the alpha channel.
> 
> OK so would an implementation like I was thinking of even be worth it, or would it just be replaced in a year or so and not help much in the mean time?

I would consider dropping MMX support, since the AMD64 architecture plans to deprecate it.  As of now, it can't be used in long-mode and is considered legacy (similar to 3DNow! although it seems they don't want to admit it).  Go for SSE2/3, 64-bit media, or 128-bit media instructions instead - there're a lot of 'em.

Then again, if you're following Intel it's best to read up on it yourself and ignore this post. =P
January 17, 2006
> 
> I would consider dropping MMX support, since the AMD64 architecture plans to deprecate it.  As of now, it can't be used in long-mode and is considered legacy (similar to 3DNow! although it seems they don't want to admit it).  Go for SSE2/3, 64-bit media, or 128-bit media instructions instead - there're a lot of 'em.
> 
> Then again, if you're following Intel it's best to read up on it yourself and ignore this post. =P

Well, one reason why I'd like to do MMX is for legacy support.  I have an AMD 2600+ and the utilities I run say it has no SSE2 support.  The CPU is kinda old, but not THAT old, so I wouldn't be suprised if these types of CPUs are around for another 5 years.

Anyhow, I am more worried about duplication of effort and whether people would actually use this or not.
January 22, 2006
In article <dqcu4p$29j9$1@digitaldaemon.com>, Chad J says...
>
>I am hoping to find a way to use SIMD instructions in my programs to make them faster without rolling out assembly code each and every time.
>  About a week ago I started working on some functions that would do MMX
>operations to arrays of data.  So far so good, but now I wonder if someone has done this already.  I couldn't find anything in dsource or wiki4D.  Maybe there is something like this in C that could easily be used in D?
>

That depends on what you want to do.  Usually SSE/MMX is implemented as a compiler intrinsic, (see the Intel compiler and Visual Studios).  For scientific calculations, there are libraries of preimplemented mathematical routines using vector optimizations; like BLAS or LINPACK.  For games, most programmers just roll their own using inline assembler or compiler intrinsics. Hobby projects do not usually need the extra speed from SSE optimizations.

One thing I have proposed as have many others, is vectorization at the language level.  Such a feature would allow efficient and portable algorithms that take full advantage of modern SIMD hardware, at very low cost to the programmer.

>That brings me to SSE, SSE2, and maybe SSE3.  Now for these to be very fast, the data they operate on must be aligned to 16-byte boundaries in memory.  I found other posts on this forum where people had problems doing that in D, and they were either unresolved or there was no follow up post.  So I was wondering if there was a good (fast) way to make sure the data in arrays is aligned to a 16-byte boundary?
>

This is a problem, and to my knowledge, it is unresolved. What you could do is overallocate the memory by 15 bytes, then shift its starting address so that it always begins on the correct boundary.  Ideally, the linker should be able to position static objects on the correct boundary, but I have no idea how to make it happen.

>Also, I'm not sure if making a struct like so will work:
>
>align(16) struct someName
>{
>	...
>}
>
>Does that determine how the elements are packed or does that ensure that the struct's address is a multiple of 16?

That won't do what you want, instead each member of the struct will be aligned on a 16-byte boundary relative to the address of struct.


January 23, 2006
align(X) is still broken in the compiler for anything great than 2.  I'm assuming in the grand scheme of things, there's far more important (and wide reaching) feature list that Walter is working on before he fixes this.

In article <dqv19a$bjq$1@digitaldaemon.com>, mclysenk@mtu.edu says...
>
>In article <dqcu4p$29j9$1@digitaldaemon.com>, Chad J says...
>>
>>I am hoping to find a way to use SIMD instructions in my programs to make them faster without rolling out assembly code each and every time.
>>  About a week ago I started working on some functions that would do MMX
>>operations to arrays of data.  So far so good, but now I wonder if someone has done this already.  I couldn't find anything in dsource or wiki4D.  Maybe there is something like this in C that could easily be used in D?
>>
>
>That depends on what you want to do.  Usually SSE/MMX is implemented as a compiler intrinsic, (see the Intel compiler and Visual Studios).  For scientific calculations, there are libraries of preimplemented mathematical routines using vector optimizations; like BLAS or LINPACK.  For games, most programmers just roll their own using inline assembler or compiler intrinsics. Hobby projects do not usually need the extra speed from SSE optimizations.
>
>One thing I have proposed as have many others, is vectorization at the language level.  Such a feature would allow efficient and portable algorithms that take full advantage of modern SIMD hardware, at very low cost to the programmer.
>
>>That brings me to SSE, SSE2, and maybe SSE3.  Now for these to be very fast, the data they operate on must be aligned to 16-byte boundaries in memory.  I found other posts on this forum where people had problems doing that in D, and they were either unresolved or there was no follow up post.  So I was wondering if there was a good (fast) way to make sure the data in arrays is aligned to a 16-byte boundary?
>>
>
>This is a problem, and to my knowledge, it is unresolved. What you could do is overallocate the memory by 15 bytes, then shift its starting address so that it always begins on the correct boundary.  Ideally, the linker should be able to position static objects on the correct boundary, but I have no idea how to make it happen.
>
>>Also, I'm not sure if making a struct like so will work:
>>
>>align(16) struct someName
>>{
>>	...
>>}
>>
>>Does that determine how the elements are packed or does that ensure that the struct's address is a multiple of 16?
>
>That won't do what you want, instead each member of the struct will be aligned on a 16-byte boundary relative to the address of struct.
>
>


January 26, 2006
On 2006-01-21 22:22:02 -0600, mclysenk@mtu.edu said:
> 
> That depends on what you want to do.  Usually SSE/MMX is implemented as a
> compiler intrinsic, (see the Intel compiler and Visual Studios).  For scientific
> calculations, there are libraries of preimplemented mathematical routines using
> vector optimizations; like BLAS or LINPACK.  For games, most programmers just
> roll their own using inline assembler or compiler intrinsics. Hobby projects do
> not usually need the extra speed from SSE optimizations.
> 
> One thing I have proposed as have many others, is vectorization at the language
> level.  Such a feature would allow efficient and portable algorithms that take
> full advantage of modern SIMD hardware, at very low cost to the programmer.


Personally, at the very least, I'd just like to have intrinsics. If it's going to be a big undertaking for some kind of vectorization vs. a simple intrinsic interface, I'd rather take the latter and not wait another six years and 150+ release itterations later for the big 2-point-0. Know what I mean?