SIMD on Windows (page 4)

See Manu's talk and google how to use it. If you don't know what you're doing you are unlikely to see performance improvements. I'm not even sure if you're benchmarking SIMD performance or function call overhead there.

I've updated the project with your suggestions at http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors defined in the benchmark function body, no function calling overhead, etc. See some of my comments below btw: > First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization. For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden. > So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. ... The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1. I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.

On 29 June 2013 18:57, Jonathan Dunlap <jadit2@gmail.com> wrote: > I've updated the project with your suggestions at http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors defined in the benchmark function body, no function calling overhead, etc. See some of my comments below btw: > > >> First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization. > > > For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden. > >> So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. ... The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1. > > > I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me. s/class/struct/ -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';

On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote: > I've updated the project with your suggestions at http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors defined in the benchmark function body, no function calling overhead, etc. See some of my comments below btw: > >> First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization. > > For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden. > >> So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. ... The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1. > > I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me. The multiples 2 and 1 were the reason why the scalar code performs a little bit better than SIMD code when compiled with GDC. The main reason why scalar code isn't much slower than SIMD code is instruction level parallelism. Because the first four operation in calcScalar are independent (none of them depends on the result of any of the other three) modern x86-64 processors can execute them in parallel. Because of that, the speed of your program is limited by instruction latency and not throughput. That's why it doesn't really make a difference that the scalar version does four times as many operations. You can also make advantage of instruction level parallelism when using SIMD. For example, I get about the same number of iterations per second for the following two functions (when using GDC): import gcc.attribute; @attribute("forceinline") void calcSIMD1() { s0 = s0 * i0; s0 = s0 * d0; s1 = s1 * i1; s1 = s1 * d1; s2 = s2 * i2; s2 = s2 * d2; s3 = s3 * i3; s3 = s3 * d3; } @attribute("forceinline") void calcSIMD2() { s0 = s0 * i0; s0 = s0 * d0; } By the way, if performance is very important to you, you should try GDC (or LDC, but I don't think LDC is currently fully usable on Windows).

> For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden. The call to calcScalar compiles to this: mov rax,QWORD PTR [r12] rex.W call QWORD PTR [rax+0x40] so I think the implementation doesn't conform to the spec in this case.

>>modern x86-64 processors can execute them in parallel. Because of that, the speed of your program is limited by instruction latency and not throughput. It seems like auto-vectorization to SIMD code may be an ideal strategy (e.g. Java) since it seems that the conditions to get any performance improvement have to be very particular and situational... which is something the compiler may be best suited to handle. Thoughts?

> It seems like auto-vectorization to SIMD code may be an ideal strategy (e.g. Java) since it seems that the conditions to get any performance improvement have to be very particular and situational... which is something the compiler may be best suited to handle. Thoughts? The things is that using SIMD efficiently often requires you to organize your data and your algorithm differently, which is something that the compiler can't do for you. Another problem is that the compiler doesn't know how often different code paths will be executed so it can't know how to use SIMD in the best way (that could be solved with profile guided optimization, though). Alignment restrictions are another thing that can cause problems. For those reasons auto-vectorization only works in the simplest of cases. But if you want auto-vectorization, GDC and LDC already do it. I recommend watching Manu's talk (as Kiith-Sa has already suggested): http://youtube.com/watch?v=q_39RnxtkgM

I did watch Manu's a few days ago which inspired me to start this project. With the updates in http://dpaste.dzfl.pl/fce2d93b, I'm still a bit clueless as to why there is almost zero performance difference... considering that is seems like an ideal setup to benefit from SIMD. I feel that if I can't see gains here: that I shouldn't bother using them in practice, where sometimes non-ideal operations must be done.

June 30, 2013

Re: SIMD on Windows

Posted by Manu
in reply to Jonathan Dunlap

Permalink

Manu

Posted in reply to Jonathan Dunlap

Attachments:

text/html part

Permalink

You should probably watch my talk again ;)
Most of the points I make towards the end when I make the claim "almost
everyone who tries to use SIMD will see the same or slower performance, and
the reason is they have simply revealed other bottlenecks".
And I also made the point "only by strictly applying ALL of the points I
demonstrated, will you see significant performance improvement".

The problem with your code is that it doesn't do any real work. Your
operations are all dependent on the result of the previous operation. The
scalar operations have a shorter latency than the SIMD operations, and they
all execute in parallel.
This is exactly the pathological worst-case comparison that basically
everyone new to SIMD tries to write and wonders why it's slow.
I guess I should have demonstrated this point more clearly in my talk. It
was very rushed (actually, the script was basically on the spot), sorry
about that!

There's not enough code in those loops. You're basically profiling loop
iteration performance and the latency of a float opcode vs a simd opcode...
not any significant work.
You should see a big difference if you unroll the loop 4-8 times (or more
for such a short loop, depending on the CPU).
I also made the point that you should always avoid doing SIMD profiling on
an x86, and certainly not an x64, since it is both, the most forgiving
(results in the least wins of any arch), and also the hardest to predict;
the performance difference you see will almost certainly not be the same on
someone else's chip..

Look again to my points about latency, reducing the overall pipeline length (demonstrated with the addition sequence), and unrolling the loops.

On 30 June 2013 06:34, Jonathan Dunlap <jadit2@gmail.com> wrote:

> I did watch Manu's a few days ago which inspired me to start this project. With the updates in http://dpaste.dzfl.pl/fce2d93b**, I'm still a bit clueless as to why there is almost zero performance difference... considering that is seems like an ideal setup to benefit from SIMD. I feel that if I can't see gains here: that I shouldn't bother using them in practice, where sometimes non-ideal operations must be done.
>

Forums