You should probably watch my talk again ;)
Most of the points I make towards the end when I make the claim "almost everyone who tries to use SIMD will see the same or slower performance, and the reason is they have simply revealed other bottlenecks".
And I also made the point "only by strictly applying ALL of the points I demonstrated, will you see significant performance improvement".
The problem with your code is that it doesn't do any real work. Your operations are all dependent on the result of the previous operation. The scalar operations have a shorter latency than the SIMD operations, and they all execute in parallel.
This is exactly the pathological worst-case comparison that basically everyone new to SIMD tries to write and wonders why it's slow.
I guess I should have demonstrated this point more clearly in my talk. It was very rushed (actually, the script was basically on the spot), sorry about that!
There's not enough code in those loops. You're basically profiling loop iteration performance and the latency of a float opcode vs a simd opcode... not any significant work.
You should see a big difference if you unroll the loop 4-8 times (or more for such a short loop, depending on the CPU).
I also made the point that you should always avoid doing SIMD profiling on an x86, and certainly not an x64, since it is both, the most forgiving (results in the least wins of any arch), and also the hardest to predict; the performance difference you see will almost certainly not be the same on someone else's chip..
Look again to my points about latency, reducing the overall pipeline length (demonstrated with the addition sequence), and unrolling the loops.