It doesn't look like you account for alignment.
This is basically not-portable (I doubt unaligned loads in this context are faster than performing scalar operations), and possibly inefficient on x86 too.
To make it account for potentially random alignment will be awkward, but it might be possible to do efficiently.


On 18 August 2013 04:50, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote:
http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

Ilya