Thread overview | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
June 19, 2013 DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Apologies for the delay, we're moving and things are a bit hectic. reddit: http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/ twitter: https://twitter.com/D_Programming/status/347433981928693760 hackernews: https://news.ycombinator.com/item?id=5907624 facebook: https://www.facebook.com/dlang.org/posts/659747567372261 youtube: http://youtube.com/watch?v=q_39RnxtkgM Andrei |
June 20, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu: > http://youtube.com/watch?v=q_39RnxtkgM Very nice. - - - - - - - - - - - - - - - - - - - Slide 3: > In practise, say we have iterative code like this: > > int data[100]; > > for(int i = 0; i < data.length; ++i) { > data[i] += 10; } For code like that in D we have vector ops: int[100] data; data[] += 10; Regarding vector ops: currently they are written with handwritten asm that uses SIMD where possible. Once std.simd is in good shape I think the array ops can be rewritten (and completed in their missing parts) using a higher level style of coding. - - - - - - - - - - - - - - - - - - - Slide 22: > Comparisons: > Full suite of comparisons Can produce bit-masks, or boolean 'any'/'all' logic. Maybe a little of compiler support (for the syntax) will help here. - - - - - - - - - - - - - - - - - - - Slide 26: > Always pass vectors by value. Unfortunately it seems a bad idea to give a warning if you pass one of those by reference. - - - - - - - - - - - - - - - - - - - Slide 27: > 3. Use ‘leaf’ functions where possible. I am not sure how much good it is to enforce leaf functions with a @leaf annotation. - - - - - - - - - - - - - - - - - - - Slide 32: > Experiment with prefetching? Are D intrinsics offering instructions to perform prefetching? - - - - - - - - - - - - - - - - - - - LDC2 is supports SIMD on Windows32 too. So for this code: void main() { alias double2 = __vector(double[2]); auto a = new double[200]; auto b = cast(double2[])a; double2 tens = [10.0, 10.0]; b[] += tens; } LDC2 compiles it to: movl $200, 4(%esp) movl $__D11TypeInfo_Ad6__initZ, (%esp) calll __d_newarrayiT movl %edx, %esi movl %eax, (%esp) movl $16, 8(%esp) movl $8, 4(%esp) calll __d_array_cast_len testl %eax, %eax je LBB0_3 movapd LCPI0_0, %xmm0 .align 16, 0x90 LBB0_2: movapd (%esi), %xmm1 addpd %xmm0, %xmm1 movapd %xmm1, (%esi) addl $16, %esi decl %eax jne LBB0_2 LBB0_3: xorl %eax, %eax addl $12, %esp popl %esi ret It uses addpd that works with two doubles at the same time. - - - - - - - - - - - - - - - - - - - The Reddit thread contains a link to this page, a compiler for a C variant from Intel that's optimized for SIMD: http://ispc.github.io/ Some of the syntax of ispc: - - - - - - The first of these statements is cif, indicating an if statement that is expected to be coherent. The usage of cif in code is just the same as if: cif (x < y) { ... } else { ... } cif provides a hint to the compiler that you expect that most of the executing SPMD programs will all have the same result for the if condition. Along similar lines, cfor, cdo, and cwhile check to see if all program instances are running at the start of each loop iteration; if so, they can run a specialized code path that has been optimized for the "all on" execution mask case. - - - - - - foreach_tiled(y = y0 ... y1, x = 0 ... w, u = 0 ... nsubsamples, v = 0 ... nsubsamples) { float du = (float)u * invSamples, dv = (float)v * invSamples; - - - - - - I'll take a better look at ispc. Bye, bearophile |
June 20, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile Attachments:
| On 20 June 2013 21:58, bearophile <bearophileHUGS@lycos.com> wrote: > Andrei Alexandrescu: > > http://youtube.com/watch?v=q_**39RnxtkgM<http://youtube.com/watch?v=q_39RnxtkgM> >> > > Very nice. > > - - - - - - - - - - - - - - - - - - - > > Slide 3: > > In practise, say we have iterative code like this: >> >> int data[100]; >> >> for(int i = 0; i < data.length; ++i) { >> data[i] += 10; } >> > > For code like that in D we have vector ops: > > int[100] data; > data[] += 10; > > > Regarding vector ops: currently they are written with handwritten asm that uses SIMD where possible. Once std.simd is in good shape I think the array ops can be rewritten (and completed in their missing parts) using a higher level style of coding. > I was trying to illustrate a process. Not so much a comment on D array syntax. The problem with auto-simd applied to array operations, is D doesn't assert that arrays are aligned. Nor are they multiples of 'N' elements wide, which means they lose the opportunity to make a lot of assumptions that make the biggest performance difference. They must be aligned, and multiples of N elements. By using explicit SIMD types, you're forced to adhere to those rules as a programmer, and the compiler can optimise properly. You take on the responsibility to handle mis-alignment and stragglers as the programmer, and perhaps make less conservative choices. - - - - - - - - - - - - - - - - - - - > > Slide 22: > > Comparisons: >> Full suite of comparisons Can produce bit-masks, or boolean 'any'/'all' logic. >> > > Maybe a little of compiler support (for the syntax) will help here. > Well, each are valid comparisons in different situations. I'm not sure how syntax could clearly select the one you want. - - - - - - - - - - - - - - - - - - - > > Slide 26: > > Always pass vectors by value. >> > > Unfortunately it seems a bad idea to give a warning if you pass one of those by reference. > And I don't think it should. Passing by ref isn't 'wrong', you just shouldn't do it if you care about performance. - - - - - - - - - - - - - - - - - - - > > Slide 27: > > 3. Use ‘leaf’ functions where possible. >> > > I am not sure how much good it is to enforce leaf functions with a @leaf annotation. > I don't think it would be useful. It should only be considered a general rule when people are very specifically considering performance above all else. It's just a very important detail to be aware of when optimising your code, particularly so when you're dealing with maths code (often involving simd). - - - - - - - - - - - - - - - - - - - > > Slide 32: > > Experiment with prefetching? >> > > Are D intrinsics offering instructions to perform prefetching? > Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :) - - - - - - - - - - - - - - - - - - - > > LDC2 is supports SIMD on Windows32 too. > > So for this code: > > > void main() { > alias double2 = __vector(double[2]); > auto a = new double[200]; > auto b = cast(double2[])a; > double2 tens = [10.0, 10.0]; > b[] += tens; > } > > > LDC2 compiles it to: > > movl $200, 4(%esp) > movl $__D11TypeInfo_Ad6__initZ, (%esp) > calll __d_newarrayiT > movl %edx, %esi > movl %eax, (%esp) > movl $16, 8(%esp) > movl $8, 4(%esp) > calll __d_array_cast_len > testl %eax, %eax > je LBB0_3 > movapd LCPI0_0, %xmm0 > .align 16, 0x90 > LBB0_2: > movapd (%esi), %xmm1 > addpd %xmm0, %xmm1 > movapd %xmm1, (%esi) > addl $16, %esi > decl %eax > jne LBB0_2 > LBB0_3: > xorl %eax, %eax > addl $12, %esp > popl %esi > ret > > > It uses addpd that works with two doubles at the same time. > Sure... did I say this wasn't supported somewhere? Sorry if I gave that impression. - - - - - - - - - - - - - - - - - - - > > The Reddit thread contains a link to this page, a compiler for a C variant > from Intel that's optimized for SIMD: > http://ispc.github.io/ > > Some of the syntax of ispc: > > - - - - - - > > The first of these statements is cif, indicating an if statement that is expected to be coherent. The usage of cif in code is just the same as if: > > cif (x < y) { > ... > } else { > ... > } > > cif provides a hint to the compiler that you expect that most of the executing SPMD programs will all have the same result for the if condition. > > Along similar lines, cfor, cdo, and cwhile check to see if all program instances are running at the start of each loop iteration; if so, they can run a specialized code path that has been optimized for the "all on" execution mask case. > This is interesting. I didn't know about this. |
June 20, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | Manu: > They must be aligned, and multiples of N elements. The D GC currently allocates them 16-bytes aligned (but if you slice the array you can lose some alignment). On some new CPUs the penalty for misalignment is small. You often have "n" values, where n is variable. If n is large enough and you are using D vector ops, the handling of the head and tail doesn't waste too much time. If you have very few values it's much better to use the SIMD code. > Well, each are valid comparisons in different situations. I'm not sure how syntax could clearly select the one you want. Maybe later we'll look for some syntax sugar for this. >> Are D intrinsics offering instructions to perform prefetching? > > Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :) I think D SIMD programmers will expect something functionally like __builtin_prefetch to be available in D too: http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396 Thank you, bye, bearophile |
June 20, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | Am 20.06.2013 13:58, schrieb bearophile: > > The Reddit thread contains a link to this page, a compiler for a C variant from Intel that's optimized for SIMD: http://ispc.github.io/ > Since you mention that, I developed a similar compiler/language in parallel to Intel at the time. The main differences were an implicit approach to handling the main loop, it didn't expose the SIMD target through explicit indices like ISPC did at the beginning, and could target GPUs in addition to outputting SIMD code. It was primarily used for high performance image processing on a commercial application that needed a safe CPU fallback path. Although I didn't have the time to implement many comprehensive optimization techniques (apart from some basic ones and from what LLVM provides) the results were quite impressive, depending of course on how well a program lends itself to the SPMD->SIMD transformation. There are some benchmarks at the end of the thesis: http://outerproduct.org/research/msc-thesis-slurp.pdf Unfortunately, at this point it is purely academical because the copyright to the source code has been left with my former employer and now Google. It's a bit of a pity as it was filling a certain niche that nobody else did (fortunately this has become less of an issue with the ubiquitous distribution of shader capable GPUs and improving drivers from a certain GPU vendor). |
June 21, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile Attachments:
| On 21 June 2013 00:03, bearophile <bearophileHUGS@lycos.com> wrote: > Manu: > > > They must be aligned, and multiples of N elements. >> > > The D GC currently allocates them 16-bytes aligned (but if you slice the array you can lose some alignment). On some new CPUs the penalty for misalignment is small. > Yes, the GC allocates 16byte aligned memory, this is good. It's critical actually. But if the data types themselves weren't aligned, then the alloc alignment would be lost as soon as they were used in struct's. You'll notice I made a point of focusing on _portable_ simd. It's true, some new chips can deal with it at virtually no additional cost, but they lose nothing by aligning their data regardless, and you can run on anything. I hope that people write libraries that can run well on anything, not just their architecture of choice. The guidelines I presented, if followed, will give you good performance on all architectures. They're not even very inconvenient. If your point is about auto-vectorisation being much simpler without the alignment restrictions, this is true. But again, I'm talking about portable and RELIABLE implementations, that is, the programmer should know that SIMD was used effectively, and not have to hope the optimiser was able to do a good job. Make these guidelines second nature, and you'll foster a habit of writing portable code even if you don't intend to do so personally. Someone somewhere may want to use your library... You often have "n" values, where n is variable. If n is large enough and > you are using D vector ops, the handling of the head and tail doesn't waste too much time. If you have very few values it's much better to use the SIMD code. See my later slides about branch predictability. When you need to handle stragglers on the head or tail, then you've introduced 2 sources of unpredictability (and also bloated your code). If the arrays are very long, this may be okay as you say, but if they're not it becomes significant. But there is an new issue that appears; if the output array is not the same as the input array, then you have a new mis-alignment where the bases of the 2 arrays might not share the same alignment, and you can't do a simd load from one and store to the other without a series of corrective shifts and merges, which will effectively result in similar code to my un-aligned load demonstration. So the case where this is reliable is: * long data array * output array is the same as the input array (overwrites the input?) I don't consider that reliable, and I don't think special-cases awareness of those criteria is any easier than carefully/deliberately using SIMD in the first place. Well, each are valid comparisons in different situations. I'm not sure how >> syntax could clearly select the one you want. >> > > Maybe later we'll look for some syntax sugar for this. > I'm definitely curious... but i'm not sure it's necessary. Are D intrinsics offering instructions to perform prefetching? >>> >> >> Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :) >> > > I think D SIMD programmers will expect something functionally like > __builtin_prefetch to be available in D too: > http://gcc.gnu.org/onlinedocs/**gcc/Other-Builtins.html#index-** > g_t_005f_005fbuiltin_**005fprefetch-3396<http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396> Yup, I toyed with the idea of adding it to std.simd, but I didn't think it fit there. |
June 21, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wed, 19 Jun 2013 15:25:29 -0400 Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > > reddit: http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/ > A bit late, but torrents/links up: http://semitwist.com/download/misc/dconf2013/ |
June 21, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nick Sabalausky | On 6/21/13 12:38 AM, Nick Sabalausky wrote:
> On Wed, 19 Jun 2013 15:25:29 -0400
> Andrei Alexandrescu<SeeWebsiteForEmail@erdani.org> wrote:
>>
>> reddit:
>> http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/
>>
>
> A bit late, but torrents/links up:
> http://semitwist.com/download/misc/dconf2013/
Thanks for this work. I'll be late with torrents for the last two talks until I get to a broadband connection.
Andrei
|
June 23, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | Manu:
> This is interesting. I didn't know about this.
An important thing here is: what's the semantics present in that language that is missing in D (and that is useful for the optimizer)? Is it possible/worth to add it?
Bye,
bearophile
|
June 24, 2013 Re: DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 6/19/13 12:25 PM, Andrei Alexandrescu wrote: > Apologies for the delay, we're moving and things are a bit hectic. > > reddit: > http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/ > > > twitter: https://twitter.com/D_Programming/status/347433981928693760 > > hackernews: https://news.ycombinator.com/item?id=5907624 > > facebook: https://www.facebook.com/dlang.org/posts/659747567372261 > > youtube: http://youtube.com/watch?v=q_39RnxtkgM > > > Andrei Now available in HD: https://archive.org/details/dconf2013-day03-talk05 Andrei |
Copyright © 1999-2021 by the D Language Foundation