SIMD support... (page 13)

On 12.01.2012 23:10, Peter Alexander wrote: > On 12/01/12 8:13 PM, Norbert Nemec wrote: >> Considering these hardware details of the SSE architecture alone, I fear >> that portable low-level support for SIMD is very hard to achieve. If you >> want to offer access to the raw power of each architecture, it might be >> simpler to have machine-specific language extensions for SIMD and leave >> the portability for a wrapper library with a common front-end and >> various back-ends for the different architectures. > > You are right, but don't forget that the same is true for instructions > already in the language. For example, (1 << x) is a very slow operation > on PPUs (it's micro-coded). > > It's simply not possible to be portable and achieve maximum performance > for any language features, not just vectors. Algorithms must be tuned > for specific architectures in version statements. However, you can get a > decent baseline by providing the lowest common denominator in > functionality. This v128 type (or whatever it will be called) does that. Actually, my essential message is: The single v128 is too simplistic for the SSE architecture. You actually need different types because the compiler needs to know what type is stored in any given register to be able to move it around.

On 13 January 2012 08:34, Norbert Nemec <Norbert@nemec-online.de> wrote: > On 12.01.2012 23:10, Peter Alexander wrote: > >> On 12/01/12 8:13 PM, Norbert Nemec wrote: >> >>> Considering these hardware details of the SSE architecture alone, I fear that portable low-level support for SIMD is very hard to achieve. If you want to offer access to the raw power of each architecture, it might be simpler to have machine-specific language extensions for SIMD and leave the portability for a wrapper library with a common front-end and various back-ends for the different architectures. >>> >> >> You are right, but don't forget that the same is true for instructions >> already in the language. For example, (1 << x) is a very slow operation >> on PPUs (it's micro-coded). >> >> It's simply not possible to be portable and achieve maximum performance for any language features, not just vectors. Algorithms must be tuned for specific architectures in version statements. However, you can get a decent baseline by providing the lowest common denominator in functionality. This v128 type (or whatever it will be called) does that. >> > > Actually, my essential message is: The single v128 is too simplistic for the SSE architecture. You actually need different types because the compiler needs to know what type is stored in any given register to be able to move it around. > This has already been concluded some days back, the language has a quite of types, just like GCC.

MS has three types, __m128, __m128i and __m128d (float, int, double) Six if you count AVX's 256 forms. On 1/7/2012 6:54 PM, Peter Alexander wrote: > On 7/01/12 9:28 PM, Andrei Alexandrescu wrote: > I agree with Manu that we should just have a single type like __m128 in > MSVC. The other types and their conversions should be solvable in a > library with something like strong typedefs. >

January 15, 2012

Re: SIMD support...

Posted by Sean Cavanaugh
in reply to Manu

Permalink

Sean Cavanaugh

Posted in reply to Manu

Permalink

On 1/6/2012 9:44 AM, Manu wrote:
> On 6 January 2012 17:01, Russel Winder <russel@russel.org.uk
> <mailto:russel@russel.org.uk>> wrote:
> As said, I think these questions are way outside the scope of SIMD
> vector libraries ;)
> Although this is a fundamental piece of the puzzle, since GPGPU is no
> use without SIMD type expression... but I think everything we've
> discussed here so far will map perfectly to GPGPU.

I don't think you are in any danger as the GPGPU instructions are more flexible than the CPU SIMD counterparts GPU hardware natively works with float2, float3 extremely well.  GPUs have VLIW instructions that can effectively add a huge number of instruction modifiers to their instructions (things like built in saturates of 0..1 range on variable arguments _reads_, arbitrary swizzle on read and write, write masks that leave partial data untouched etc, all in one clock).

The CPU SIMD stuff is simplistic by comparions.  A good bang for the buck would be to have some basic set of operators (* / + - < > == != <= >= and especially ? (the ternary operator)), and versions of 'any' and 'all' from HLSL for dynamic branching, that can work at the very least for integer, float, and double types.

Bit shifting is useful (esp manipulating floats for transcendental functions or workingw ith half FP16 types requires a lot of), but should be restricted to integer types.  Having dedicated signed and unsigned right shifts would be pretty nice to (since about 95% of my right shifts end up needing to be of the zero-extended variety even though I had to cast to 'vector integers')

On 1/14/2012 9:58 PM, Sean Cavanaugh wrote: > MS has three types, __m128, __m128i and __m128d (float, int, double) > > Six if you count AVX's 256 forms. > > On 1/7/2012 6:54 PM, Peter Alexander wrote: >> On 7/01/12 9:28 PM, Andrei Alexandrescu wrote: >> I agree with Manu that we should just have a single type like __m128 in >> MSVC. The other types and their conversions should be solvable in a >> library with something like strong typedefs. >> The trouble with MS's scheme, is given the following: __m128i v; v += 2; Can't tell what to do. With D, int4 v; v += 2; it's clear (add 2 to each of the 4 ints).

On 1/6/2012 7:58 PM, Manu wrote: > On 7 January 2012 03:46, Vladimir Panteleev <vladimir@thecybershadow.net > <mailto:vladimir@thecybershadow.net>> wrote: > > I've never seen a memcpy on any console system I've ever worked on that > takes advantage if its large registers... writing a fast memcpy is > usually one of the first things we do when we get a new platform ;) Plus memcpy is optimized for reading and writing to cached virtual memory, so you need several others to write to write-combined or uncached memory efficiently and whatnot.

On 1/15/2012 12:09 AM, Walter Bright wrote: > On 1/14/2012 9:58 PM, Sean Cavanaugh wrote: >> MS has three types, __m128, __m128i and __m128d (float, int, double) >> >> Six if you count AVX's 256 forms. >> >> On 1/7/2012 6:54 PM, Peter Alexander wrote: >>> On 7/01/12 9:28 PM, Andrei Alexandrescu wrote: >>> I agree with Manu that we should just have a single type like __m128 in >>> MSVC. The other types and their conversions should be solvable in a >>> library with something like strong typedefs. >>> > > The trouble with MS's scheme, is given the following: > > __m128i v; > v += 2; > > Can't tell what to do. With D, > > int4 v; > v += 2; > > it's clear (add 2 to each of the 4 ints). Working with their intrinsics in their raw form for real code is pure insanity :) You need to wrap it all with a good math library (even if 90% of the library is the intrinsics wrapped into __forceinlined functions), so you can start having sensible operator overloads, and so you can write code that is readable. if (any4(a > b)) { // do stuff } is way way way better than (pseudocode) if (__movemask_ps(_mm_gt_ps(a, b)) == 0x0F) { } and (if the ternary operator was overrideable in C++) float4 foo = (a > b) ? c : d; would be better than float4 mask = _mm_gt_ps(a, b); float4 foo = _mm_or_ps(_mm_and_ps(mask, c), _mm_nand_ps_(mask, d));

On 1/13/2012 7:38 AM, Manu wrote: > On 13 January 2012 08:34, Norbert Nemec <Norbert@nemec-online.de > <mailto:Norbert@nemec-online.de>> wrote: > > > This has already been concluded some days back, the language has a quite > of types, just like GCC. So I would definitely like to help out on the SIMD stuff in some way, as I have a lot of experience using SIMD math to speed up the games I work on. I've got a vectorized set of transcendetal (currently in the form of MSVC++ intrinics) functions for float and double that would be a good start if anyone is interested. Beyond that I just want to help 'make it right' because its a topic I care alot about, and is my personal biggest gripe with the langauge at the moment. I also have experience with VMX as they two are not exactly the same, it definitely would help to avoid making the code too intel-centric (though typically the VMX is the more flexible design as it can do dynamic shuffling based on the contents of the vector registers etc)

Forums