Thread overview
Vector Optimizations
Jan 27, 2006
Kyle Furlong
Jan 27, 2006
James Dunne
Jan 27, 2006
Kyle Furlong
Jan 27, 2006
James Dunne
January 27, 2006
Well my templated Vector is nearly complete! What remains is some complex algebra quirks, implementing fast algorithms for cross products for different dimension vectors, and optimizing the various operations.

In this vein, what is the best way to get SIMD, SSE, SSE2 etc. instructions emitted by the dmd compiler? Any other comments, links, or suggestions on my Vector struct or vector optimization in D are welcome.

P.S. - The sources are attached, but the latest Velocity source can be found at http://svn.dsource.org/projects/velocity/trunk/source/


January 27, 2006
Kyle Furlong wrote:
> Well my templated Vector is nearly complete! What remains is some complex algebra quirks, implementing fast algorithms for cross products for different dimension vectors, and optimizing the various operations.
> 
> In this vein, what is the best way to get SIMD, SSE, SSE2 etc. instructions emitted by the dmd compiler? Any other comments, links, or suggestions on my Vector struct or vector optimization in D are welcome.
> 
> [snippy]

I'm afraid with this sort of approach the best kind of SIMD optimization you'd get would be single operations.  Unless the compiler can perform ridiculous amounts of optimization, you won't get *much* benefit.  You will get some, and it might be enough for you.  But obviously hand-written SIMD instructions in assembly language is going to yield far superior results.

True support for SIMD instructions should be a function of the compiler because it needs to know how to align the data in memory, what registers are available for use, and how it can streamline vectorized operations.  If the compiler doesn't optimize your code much and leaves the method calls in, you'll just have exactly that: basically calling a function to copy a vector into a SIMD register, executing the SIMD instruction, copying the vector out of the register into the stack, and returning that value to the caller somehow.  All in all, extremely suboptimal IMHO.

So, in summary, the OO methodolgy doesn't apply well to vectorization it seems.

-- 
Regards,
James Dunne
January 27, 2006
James Dunne wrote:
> Kyle Furlong wrote:
>> Well my templated Vector is nearly complete! What remains is some complex algebra quirks, implementing fast algorithms for cross products for different dimension vectors, and optimizing the various operations.
>>
>> In this vein, what is the best way to get SIMD, SSE, SSE2 etc. instructions emitted by the dmd compiler? Any other comments, links, or suggestions on my Vector struct or vector optimization in D are welcome.
>>
>> [snippy]
> 
> I'm afraid with this sort of approach the best kind of SIMD optimization you'd get would be single operations.  Unless the compiler can perform ridiculous amounts of optimization, you won't get *much* benefit.  You will get some, and it might be enough for you.  But obviously hand-written SIMD instructions in assembly language is going to yield far superior results.
> 
> True support for SIMD instructions should be a function of the compiler because it needs to know how to align the data in memory, what registers are available for use, and how it can streamline vectorized operations.  If the compiler doesn't optimize your code much and leaves the method calls in, you'll just have exactly that: basically calling a function to copy a vector into a SIMD register, executing the SIMD instruction, copying the vector out of the register into the stack, and returning that value to the caller somehow.  All in all, extremely suboptimal IMHO.
> 
> So, in summary, the OO methodolgy doesn't apply well to vectorization it seems.
> 

Basically I want to utilize all the capabilities of the processor for each operation. For example, most of the operations atm are running a for loop accross all the elements of the wrapped array of whatevers and doing the operation. I'm asking how to optimize this to use the SIMD instructions of the various processors. I tried to unroll the loops using the duffs device walter put together with templates, but couldnt figure out how to get it to work with array indexing.
January 27, 2006
Kyle Furlong wrote:
> James Dunne wrote:
> 
>> Kyle Furlong wrote:
>>
>>> Well my templated Vector is nearly complete! What remains is some complex algebra quirks, implementing fast algorithms for cross products for different dimension vectors, and optimizing the various operations.
>>>
>>> In this vein, what is the best way to get SIMD, SSE, SSE2 etc. instructions emitted by the dmd compiler? Any other comments, links, or suggestions on my Vector struct or vector optimization in D are welcome.
>>>
>>> [snippy]
>>
>>
>> I'm afraid with this sort of approach the best kind of SIMD optimization you'd get would be single operations.  Unless the compiler can perform ridiculous amounts of optimization, you won't get *much* benefit.  You will get some, and it might be enough for you.  But obviously hand-written SIMD instructions in assembly language is going to yield far superior results.
>>
>> True support for SIMD instructions should be a function of the compiler because it needs to know how to align the data in memory, what registers are available for use, and how it can streamline vectorized operations.  If the compiler doesn't optimize your code much and leaves the method calls in, you'll just have exactly that: basically calling a function to copy a vector into a SIMD register, executing the SIMD instruction, copying the vector out of the register into the stack, and returning that value to the caller somehow.  All in all, extremely suboptimal IMHO.
>>
>> So, in summary, the OO methodolgy doesn't apply well to vectorization it seems.
>>
> 
> Basically I want to utilize all the capabilities of the processor for each operation. For example, most of the operations atm are running a for loop accross all the elements of the wrapped array of whatevers and doing the operation. I'm asking how to optimize this to use the SIMD instructions of the various processors. I tried to unroll the loops using the duffs device walter put together with templates, but couldnt figure out how to get it to work with array indexing.

1) You must know the capabilities of the processor on which your library will be running on.  This is possible to determine with the CPUID instruction on x86 machines.  You can check for flags which indicate support for SSE, SSE2, SSE3, MMX, 3Dnow!, etc.

2) You must guarantee that the data with which you are working is aligned in memory per the requirements of the specific SIMD capabilities you're targeting/using.  This can be done with custom class allocators.  Do structs have this ability too?

3) SIMD instruction sets usually have very fixed capabilities, such as working with either a vector of 4 32-bit floats, or 2 64-bit doubles. These hardware registers have fixed size limitations.  Continuing to allow for n-dimensional vectors via templates will have you writing so much special-case code for SIMD support that the templating support just won't be worth it.  Besides, most mathematics dealing with physical laws use vectors of dimension 4 or less.  Also, you might have to ditch the complex-number support.

If you really want SIMD support, there isn't any way around these requirements.

As per loop-unrolling, you can simply hand-unroll your loops.  The basic concept is to do as much work as you can within one loop iteration, and to simultaneously cut down on the number of loop iterations.  Loops mean branches, and branches are bad for pipelines.

In fact, I'm quite sure a generic templating approach to loop unrolling would be great!

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M--@ V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne