Jump to page: 1 2
Thread overview
SIMD implementation of dot-product. Benchmarks
Aug 17, 2013
Ilya Yaroshenko
Aug 17, 2013
John Colvin
Aug 17, 2013
Ilya Yaroshenko
Aug 17, 2013
John Colvin
Aug 18, 2013
Ilya Yaroshenko
Aug 18, 2013
Manu
Aug 18, 2013
Ilya Yaroshenko
Aug 24, 2013
Ilya Yaroshenko
Aug 26, 2013
Manu
Aug 17, 2013
bearophile
Aug 18, 2013
Manu
Aug 18, 2013
Ilya Yaroshenko
Aug 18, 2013
Manu
Aug 18, 2013
Ilya Yaroshenko
Aug 18, 2013
Ilya Yaroshenko
Aug 18, 2013
Iain Buclaw
August 17, 2013
http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

Ilya



August 17, 2013
On Saturday, 17 August 2013 at 18:50:15 UTC, Ilya Yaroshenko wrote:
> http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html
>
> Ilya

Nice, that's a good speedup.

BTW: -march=native automatically implies -mtune=native
August 17, 2013
> BTW: -march=native automatically implies -mtune=native

Thanks, I`ll remove mtune)
August 17, 2013
On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:
>> BTW: -march=native automatically implies -mtune=native
>
> Thanks, I`ll remove mtune)

It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve.
August 17, 2013
Ilya Yaroshenko:
> http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

From the blog post:

> Compile fast_math code from other program separately and then link it. This is easy solution. However this is a step back to C.<
> To introduce a @fast_math attribute. This is hard to realize. But I hope this will be done for future compilers.<

One solution is to copy one of the features of Lisp, that is offer an annotation to specify different compilation switches for functions.

Since some time it's present in GNU-C too:
http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas

Bye,
bearophile
August 18, 2013
It doesn't look like you account for alignment.
This is basically not-portable (I doubt unaligned loads in this context are
faster than performing scalar operations), and possibly inefficient on x86
too.
To make it account for potentially random alignment will be awkward, but it
might be possible to do efficiently.


On 18 August 2013 04:50, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote:

> http://spiceandmath.blogspot.**ru/2013/08/simd-** implementation-of-dot-product_**17.html<http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html>
>
> Ilya
>
>
>
>


August 18, 2013
On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
> It doesn't look like you account for alignment.
> This is basically not-portable (I doubt unaligned loads in this context are
> faster than performing scalar operations), and possibly inefficient on x86
> too.

dotProduct uses unaligned loads (__builtin_ia32_loadups256, __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar version.

Why unaligned loads is not-portable and inefficient?



> To make it account for potentially random alignment will be awkward, but it
> might be possible to do efficiently.

Did you mean use unaligned loads or prepare data for alignment loads at the beginning of function?
August 18, 2013
On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:
> On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:
>>> BTW: -march=native automatically implies -mtune=native
>>
>> Thanks, I`ll remove mtune)
>
> It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve.

I am lazy )

I have looked at assembler code:

float, scalar (main loop):
.L191:
	vmovss	xmm1, DWORD PTR [rsi+rax*4]
	vfmadd231ss	xmm0, xmm1, DWORD PTR [rcx+rax*4]
	add	rax, 1
	cmp	rax, rdi
	jne	.L191


float, vector (main loop):
.L2448:
	vmovups	ymm5, YMMWORD PTR [rax]
	sub	rax, -128
	sub	r11, -128
	vmovups	ymm4, YMMWORD PTR [r11-128]
	vmovups	ymm6, YMMWORD PTR [rax-96]
	vmovups	ymm7, YMMWORD PTR [r11-96]
	vfmadd231ps	ymm3, ymm5, ymm4
	vmovups	ymm8, YMMWORD PTR [rax-64]
	vmovups	ymm9, YMMWORD PTR [r11-64]
	vfmadd231ps	ymm0, ymm6, ymm7
	vmovups	ymm10, YMMWORD PTR [rax-32]
	vmovups	ymm11, YMMWORD PTR [r11-32]
	cmp	rdi, rax
	vfmadd231ps	ymm2, ymm8, ymm9
	vfmadd231ps	ymm1, ymm10, ymm11
	ja	.L2448

float, vector (full):
	https://gist.github.com/9il/6258443


It is pretty optimized)


____
Best regards

Ilya

August 18, 2013
On 18 August 2013 14:39, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote:

> On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
>
>> It doesn't look like you account for alignment.
>> This is basically not-portable (I doubt unaligned loads in this context
>> are
>> faster than performing scalar operations), and possibly inefficient on x86
>> too.
>>
>
> dotProduct uses unaligned loads (__builtin_ia32_loadups256, __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar version.
>
> Why unaligned loads is not-portable and inefficient?


x86 is the only arch that can perform an unaligned load. And even on x86 (many implementations) it's not very efficient.


 To make it account for potentially random alignment will be awkward, but it
>> might be possible to do efficiently.
>>
>
> Did you mean use unaligned loads or prepare data for alignment loads at the beginning of function?
>

I mean to only use aligned loads, in whatever way that happens to work out. The hard case is when the 2 arrays have different start offsets.

Otherwise you need to wrap your code in a version(x86) block.


August 18, 2013
On Sunday, 18 August 2013 at 05:07:12 UTC, Manu wrote:
> On 18 August 2013 14:39, Ilya Yaroshenko <ilyayaroshenko@gmail.com> wrote:
>
>> On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
>>
>>> It doesn't look like you account for alignment.
>>> This is basically not-portable (I doubt unaligned loads in this context
>>> are
>>> faster than performing scalar operations), and possibly inefficient on x86
>>> too.
>>>
>>
>> dotProduct uses unaligned loads (__builtin_ia32_loadups256,
>> __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar
>> version.
>>
>> Why unaligned loads is not-portable and inefficient?
>
>
> x86 is the only arch that can perform an unaligned load. And even on x86
> (many implementations) it's not very efficient.

:(

>
>
>  To make it account for potentially random alignment will be awkward, but it
>>> might be possible to do efficiently.
>>>
>>
>> Did you mean use unaligned loads or prepare data for alignment loads at
>> the beginning of function?
>>
>
> I mean to only use aligned loads, in whatever way that happens to work out.
> The hard case is when the 2 arrays have different start offsets.
>
> Otherwise you need to wrap your code in a version(x86) block.

Thanks!

« First   ‹ Prev
1 2