On 11/2/2012 3:50 AM, Jens Mueller wrote:
> Okay. For me they look the same. Can you elaborate, please? Assume I
> want to add two float vectors which is common in both games and
> scientific computing. The only difference is in games their length is
> usually 3 or 4 whereas in scientific computing they are of arbitrary
> length. Why do I need instrinsics to support the game setting?

Another excellent question.

Most languages have taken the "auto-vectorization" approach of reverse engineering loops to turn them into high level constructs, and then compiling the code into special SIMD instructions.

How to do this is explained in detail in the (rare) book "The Software Vectorization Handbook" by Bik, which I fortunately was able to obtain a copy of.

This struck me as a terrible approach, however. It just seemed stupid to try to teach the compiler to reverse engineer low level code into high level code. A better design would be to start with high level code. Hence, the appearance of D vector operations.

The trouble with D vector operations, however, is that they are too general purpose. The SIMD instructions are very quirky, and it's easy to unwittingly and silently cause the compiler to generate absolutely terribly slow code. The reasons for that are the alignment requirements, coupled with the SIMD instructions not being orthogonal - some operations work for some types and not for others, in a way that is unintuitive unless you're carefully reading the SIMD specs.

Just saying align(16) isn't good enough, as the vector ops work on slices and those slices aren't always aligned. So each one has to check alignment at runtime, which is murder on performance.

If a particular vector op for a particular type has no SIMD support, then the compiler has to generate workaround code. This can also have terrible performance consequences.

So the user writes vector code, benchmarks it, finds zero improvement, and the reasons why will be elusive to anyone but an expert SIMD programmer.

(Auto-vectorizing technology has similar issues, pretty much meaning you won't get fast code out of it unless you've got a habit of examining the assembler output and tweaking as necessary.)

Enter Manu, who has a lot of experience making SIMD work for games. His proposal was:

1. Have native SIMD types. This will guarantee alignment, and will guarantee a compile time error for SIMD types that are not supported by the CPU.

2. Have the compiler issue an error for SIMD operations that are not supported by the CPU, rather than silently generating inefficient workaround code.

3. There are all kinds of weird but highly useful SIMD instructions that don't have a straightforward representation in high level code, such as saturated arithmetic. Manu's answer was to expose these instructions via intrinsics, so the user can string them together, be sure that they will generate real SIMD instructions, while the compiler can deal with register allocation.

Well, I wouldn't claim any credit for the approach ;) .. I think this is the standard for maximum performance, and also very well understood.

But the thing that excites me most is the potential quality of libraries that can be built on top. D has so much potential to extend on this SIMD foundation with it's templates being able to intelligently handle far more context specific situations.

What we do already in other languages will be far more convenient, more portable, and possibly even produce better code in D. And the biggest bonus, it will be readable! :)

I think it's a low risk investment, and it doesn't prohibit higher level support in the future.

This approach works, is inlineable, generates code as good as hand-built assembler, and is useable by regular programmers.

I won't say there aren't better approaches, but this one we know works.

Aye, and it's relatively un-intrusive too. Some new types and a few intrinsics, build useful libraries on top. It shouldn't have complex side effects, and if offers something that was sorely missing from the language today.