Jump to page: 1 2
Thread overview
DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans
Jun 20, 2013
bearophile
Jun 20, 2013
Manu
Jun 20, 2013
bearophile
Jun 21, 2013
Manu
Jun 23, 2013
bearophile
Jul 12, 2013
bearophile
Jun 20, 2013
Sönke Ludwig
Jun 21, 2013
Nick Sabalausky
June 19, 2013
Apologies for the delay, we're moving and things are a bit hectic.

reddit: http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/

twitter: https://twitter.com/D_Programming/status/347433981928693760

hackernews: https://news.ycombinator.com/item?id=5907624

facebook: https://www.facebook.com/dlang.org/posts/659747567372261

youtube: http://youtube.com/watch?v=q_39RnxtkgM


Andrei
June 20, 2013
Andrei Alexandrescu:

> http://youtube.com/watch?v=q_39RnxtkgM

Very nice.

- - - - - - - - - - - - - - - - - - -

Slide 3:

> In practise, say we have iterative code like this:
> 
> int data[100];
> 
> for(int i = 0; i < data.length; ++i) {
>   data[i] += 10; }

For code like that in D we have vector ops:

int[100] data;
data[] += 10;


Regarding vector ops: currently they are written with handwritten asm that uses SIMD where possible. Once std.simd is in good shape I think the array ops can be rewritten (and completed in their missing parts) using a higher level style of coding.

- - - - - - - - - - - - - - - - - - -

Slide 22:

> Comparisons:
> Full suite of comparisons Can produce bit-masks, or boolean 'any'/'all' logic.

Maybe a little of compiler support (for the syntax) will help here.

- - - - - - - - - - - - - - - - - - -

Slide 26:

> Always pass vectors by value.

Unfortunately it seems a bad idea to give a warning if you pass one of those by reference.

- - - - - - - - - - - - - - - - - - -

Slide 27:

> 3. Use ‘leaf’ functions where possible.

I am not sure how much good it is to enforce leaf functions with a @leaf annotation.

- - - - - - - - - - - - - - - - - - -

Slide 32:

> Experiment with prefetching?

Are D intrinsics offering instructions to perform prefetching?

- - - - - - - - - - - - - - - - - - -

LDC2 is supports SIMD on Windows32 too.

So for this code:


void main() {
    alias double2 = __vector(double[2]);
    auto a = new double[200];
    auto b = cast(double2[])a;
    double2 tens = [10.0, 10.0];
    b[] += tens;
}


LDC2 compiles it to:

	movl	$200, 4(%esp)
	movl	$__D11TypeInfo_Ad6__initZ, (%esp)
	calll	__d_newarrayiT
	movl	%edx, %esi
	movl	%eax, (%esp)
	movl	$16, 8(%esp)
	movl	$8, 4(%esp)
	calll	__d_array_cast_len
	testl	%eax, %eax
	je	LBB0_3
	movapd	LCPI0_0, %xmm0
	.align	16, 0x90
LBB0_2:
	movapd	(%esi), %xmm1
	addpd	%xmm0, %xmm1
	movapd	%xmm1, (%esi)
	addl	$16, %esi
	decl	%eax
	jne	LBB0_2
LBB0_3:
	xorl	%eax, %eax
	addl	$12, %esp
	popl	%esi
	ret


It uses addpd that works with two doubles at the same time.

- - - - - - - - - - - - - - - - - - -

The Reddit thread contains a link to this page, a compiler for a C variant from Intel that's optimized for SIMD:
http://ispc.github.io/

Some of the syntax of ispc:

- - - - - -

The first of these statements is cif, indicating an if statement that is expected to be coherent. The usage of cif in code is just the same as if:

cif (x < y) {
    ...
} else {
    ...
}

cif provides a hint to the compiler that you expect that most of the executing SPMD programs will all have the same result for the if condition.

Along similar lines, cfor, cdo, and cwhile check to see if all program instances are running at the start of each loop iteration; if so, they can run a specialized code path that has been optimized for the "all on" execution mask case.

- - - - - -

foreach_tiled(y = y0 ... y1, x = 0 ... w,
              u = 0 ... nsubsamples, v = 0 ... nsubsamples) {
    float du = (float)u * invSamples, dv = (float)v * invSamples;

- - - - - -

I'll take a better look at ispc.

Bye,
bearophile
June 20, 2013
On 20 June 2013 21:58, bearophile <bearophileHUGS@lycos.com> wrote:

> Andrei Alexandrescu:
>
>  http://youtube.com/watch?v=q_**39RnxtkgM<http://youtube.com/watch?v=q_39RnxtkgM>
>>
>
> Very nice.
>
> - - - - - - - - - - - - - - - - - - -
>
> Slide 3:
>
>  In practise, say we have iterative code like this:
>>
>> int data[100];
>>
>> for(int i = 0; i < data.length; ++i) {
>>   data[i] += 10; }
>>
>
> For code like that in D we have vector ops:
>
> int[100] data;
> data[] += 10;
>
>
> Regarding vector ops: currently they are written with handwritten asm that uses SIMD where possible. Once std.simd is in good shape I think the array ops can be rewritten (and completed in their missing parts) using a higher level style of coding.
>

I was trying to illustrate a process. Not so much a comment on D array
syntax.
The problem with auto-simd applied to array operations, is D doesn't assert
that arrays are aligned. Nor are they multiples of 'N' elements wide, which
means they lose the opportunity to make a lot of assumptions that make the
biggest performance difference.
They must be aligned, and multiples of N elements. By using explicit SIMD
types, you're forced to adhere to those rules as a programmer, and the
compiler can optimise properly.
You take on the responsibility to handle mis-alignment and stragglers as
the programmer, and perhaps make less conservative choices.

- - - - - - - - - - - - - - - - - - -
>
> Slide 22:
>
>  Comparisons:
>> Full suite of comparisons Can produce bit-masks, or boolean 'any'/'all' logic.
>>
>
> Maybe a little of compiler support (for the syntax) will help here.
>

Well, each are valid comparisons in different situations. I'm not sure how syntax could clearly select the one you want.

- - - - - - - - - - - - - - - - - - -
>
> Slide 26:
>
>  Always pass vectors by value.
>>
>
> Unfortunately it seems a bad idea to give a warning if you pass one of those by reference.
>

And I don't think it should. Passing by ref isn't 'wrong', you just shouldn't do it if you care about performance.

- - - - - - - - - - - - - - - - - - -
>
> Slide 27:
>
>  3. Use ‘leaf’ functions where possible.
>>
>
> I am not sure how much good it is to enforce leaf functions with a @leaf annotation.
>

I don't think it would be useful. It should only be considered a general
rule when people are very specifically considering performance above all
else.
It's just a very important detail to be aware of when optimising your code,
particularly so when you're dealing with maths code (often involving simd).

- - - - - - - - - - - - - - - - - - -
>
> Slide 32:
>
>  Experiment with prefetching?
>>
>
> Are D intrinsics offering instructions to perform prefetching?
>

Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :)

- - - - - - - - - - - - - - - - - - -
>
> LDC2 is supports SIMD on Windows32 too.
>
> So for this code:
>
>
> void main() {
>     alias double2 = __vector(double[2]);
>     auto a = new double[200];
>     auto b = cast(double2[])a;
>     double2 tens = [10.0, 10.0];
>     b[] += tens;
> }
>
>
> LDC2 compiles it to:
>
>         movl    $200, 4(%esp)
>         movl    $__D11TypeInfo_Ad6__initZ, (%esp)
>         calll   __d_newarrayiT
>         movl    %edx, %esi
>         movl    %eax, (%esp)
>         movl    $16, 8(%esp)
>         movl    $8, 4(%esp)
>         calll   __d_array_cast_len
>         testl   %eax, %eax
>         je      LBB0_3
>         movapd  LCPI0_0, %xmm0
>         .align  16, 0x90
> LBB0_2:
>         movapd  (%esi), %xmm1
>         addpd   %xmm0, %xmm1
>         movapd  %xmm1, (%esi)
>         addl    $16, %esi
>         decl    %eax
>         jne     LBB0_2
> LBB0_3:
>         xorl    %eax, %eax
>         addl    $12, %esp
>         popl    %esi
>         ret
>
>
> It uses addpd that works with two doubles at the same time.
>

Sure... did I say this wasn't supported somewhere? Sorry if I gave that impression.

- - - - - - - - - - - - - - - - - - -
>
> The Reddit thread contains a link to this page, a compiler for a C variant
> from Intel that's optimized for SIMD:
> http://ispc.github.io/
>
> Some of the syntax of ispc:
>
> - - - - - -
>
> The first of these statements is cif, indicating an if statement that is expected to be coherent. The usage of cif in code is just the same as if:
>
> cif (x < y) {
>     ...
> } else {
>     ...
> }
>
> cif provides a hint to the compiler that you expect that most of the executing SPMD programs will all have the same result for the if condition.
>
> Along similar lines, cfor, cdo, and cwhile check to see if all program instances are running at the start of each loop iteration; if so, they can run a specialized code path that has been optimized for the "all on" execution mask case.
>

This is interesting. I didn't know about this.


June 20, 2013
Manu:

> They must be aligned, and multiples of N elements.

The D GC currently allocates them 16-bytes aligned (but if you slice the array you can lose some alignment). On some new CPUs the penalty for misalignment is small.

You often have "n" values, where n is variable. If n is large enough and you are using D vector ops, the handling of the head and tail doesn't waste too much time. If you have very few values it's much better to use the SIMD code.


> Well, each are valid comparisons in different situations. I'm not sure how syntax could clearly select the one you want.

Maybe later we'll look for some syntax sugar for this.


>> Are D intrinsics offering instructions to perform prefetching?
>
> Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :)

I think D SIMD programmers will expect something functionally like __builtin_prefetch to be available in D too:
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396

Thank you,
bye,
bearophile
June 20, 2013
Am 20.06.2013 13:58, schrieb bearophile:
> 
> The Reddit thread contains a link to this page, a compiler for a C variant from Intel that's optimized for SIMD: http://ispc.github.io/
> 

Since you mention that, I developed a similar compiler/language in parallel to Intel at the time. The main differences were an implicit approach to handling the main loop, it didn't expose the SIMD target through explicit indices like ISPC did at the beginning, and could target GPUs in addition to outputting SIMD code. It was primarily used for high performance image processing on a commercial application that needed a safe CPU fallback path.

Although I didn't have the time to implement many comprehensive optimization techniques (apart from some basic ones and from what LLVM provides) the results were quite impressive, depending of course on how well a program lends itself to the SPMD->SIMD transformation. There are some benchmarks at the end of the thesis:

http://outerproduct.org/research/msc-thesis-slurp.pdf

Unfortunately, at this point it is purely academical because the copyright to the source code has been left with my former employer and now Google. It's a bit of a pity as it was filling a certain niche that nobody else did (fortunately this has become less of an issue with the ubiquitous distribution of shader capable GPUs and improving drivers from a certain GPU vendor).
June 21, 2013
On 21 June 2013 00:03, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
>
>  They must be aligned, and multiples of N elements.
>>
>
> The D GC currently allocates them 16-bytes aligned (but if you slice the array you can lose some alignment). On some new CPUs the penalty for misalignment is small.
>

Yes, the GC allocates 16byte aligned memory, this is good. It's critical actually. But if the data types themselves weren't aligned, then the alloc alignment would be lost as soon as they were used in struct's.

You'll notice I made a point of focusing on _portable_ simd. It's true,
some new chips can deal with it at virtually no additional cost, but they
lose nothing by aligning their data regardless, and you can run on anything.
I hope that people write libraries that can run well on anything, not just
their architecture of choice. The guidelines I presented, if followed, will
give you good performance on all architectures.
They're not even very inconvenient.

If your point is about auto-vectorisation being much simpler without the alignment restrictions, this is true. But again, I'm talking about portable and RELIABLE implementations, that is, the programmer should know that SIMD was used effectively, and not have to hope the optimiser was able to do a good job. Make these guidelines second nature, and you'll foster a habit of writing portable code even if you don't intend to do so personally. Someone somewhere may want to use your library...

You often have "n" values, where n is variable. If n is large enough and
> you are using D vector ops, the handling of the head and tail doesn't waste too much time. If you have very few values it's much better to use the SIMD code.


See my later slides about branch predictability. When you need to handle
stragglers on the head or tail, then you've introduced 2 sources of
unpredictability (and also bloated your code).
If the arrays are very long, this may be okay as you say, but if they're
not it becomes significant.

But there is an new issue that appears; if the output array is not the same as the input array, then you have a new mis-alignment where the bases of the 2 arrays might not share the same alignment, and you can't do a simd load from one and store to the other without a series of corrective shifts and merges, which will effectively result in similar code to my un-aligned load demonstration.

So the case where this is reliable is:
 * long data array
 * output array is the same as the input array (overwrites the input?)

I don't consider that reliable, and I don't think special-cases awareness of those criteria is any easier than carefully/deliberately using SIMD in the first place.

 Well, each are valid comparisons in different situations. I'm not sure how
>> syntax could clearly select the one you want.
>>
>
> Maybe later we'll look for some syntax sugar for this.
>

I'm definitely curious... but i'm not sure it's necessary.

 Are D intrinsics offering instructions to perform prefetching?
>>>
>>
>> Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :)
>>
>
> I think D SIMD programmers will expect something functionally like
> __builtin_prefetch to be available in D too:
> http://gcc.gnu.org/onlinedocs/**gcc/Other-Builtins.html#index-**
> g_t_005f_005fbuiltin_**005fprefetch-3396<http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396>


Yup, I toyed with the idea of adding it to std.simd, but I didn't think it fit there.


June 21, 2013
On Wed, 19 Jun 2013 15:25:29 -0400
Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> 
> reddit: http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/
> 

A bit late, but torrents/links up: http://semitwist.com/download/misc/dconf2013/


June 21, 2013
On 6/21/13 12:38 AM, Nick Sabalausky wrote:
> On Wed, 19 Jun 2013 15:25:29 -0400
> Andrei Alexandrescu<SeeWebsiteForEmail@erdani.org>  wrote:
>>
>> reddit:
>> http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/
>>
>
> A bit late, but torrents/links up:
> http://semitwist.com/download/misc/dconf2013/

Thanks for this work. I'll be late with torrents for the last two talks until I get to a broadband connection.

Andrei


June 23, 2013
Manu:

> This is interesting. I didn't know about this.

An important thing here is: what's the semantics present in that language that is missing in D (and that is useful for the optimizer)? Is it possible/worth to add it?

Bye,
bearophile
June 24, 2013
On 6/19/13 12:25 PM, Andrei Alexandrescu wrote:
> Apologies for the delay, we're moving and things are a bit hectic.
>
> reddit:
> http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/
>
>
> twitter: https://twitter.com/D_Programming/status/347433981928693760
>
> hackernews: https://news.ycombinator.com/item?id=5907624
>
> facebook: https://www.facebook.com/dlang.org/posts/659747567372261
>
> youtube: http://youtube.com/watch?v=q_39RnxtkgM
>
>
> Andrei

Now available in HD: https://archive.org/details/dconf2013-day03-talk05


Andrei
« First   ‹ Prev
1 2