On 21 June 2013 00:03, bearophile <bearophileHUGS@lycos.com> wrote:

Manu:

They must be aligned, and multiples of N elements.

The D GC currently allocates them 16-bytes aligned (but if you slice the array you can lose some alignment). On some new CPUs the penalty for misalignment is small.

Yes, the GC allocates 16byte aligned memory, this is good. It's critical actually. But if the data types themselves weren't aligned, then the alloc alignment would be lost as soon as they were used in struct's.

You'll notice I made a point of focusing on _portable_ simd. It's true, some new chips can deal with it at virtually no additional cost, but they lose nothing by aligning their data regardless, and you can run on anything.

I hope that people write libraries that can run well on anything, not just their architecture of choice. The guidelines I presented, if followed, will give you good performance on all architectures.

They're not even very inconvenient.

If your point is about auto-vectorisation being much simpler without the alignment restrictions, this is true. But again, I'm talking about portable and RELIABLE implementations, that is, the programmer should know that SIMD was used effectively, and not have to hope the optimiser was able to do a good job. Make these guidelines second nature, and you'll foster a habit of writing portable code even if you don't intend to do so personally. Someone somewhere may want to use your library...

You often have "n" values, where n is variable. If n is large enough and you are using D vector ops, the handling of the head and tail doesn't waste too much time. If you have very few values it's much better to use the SIMD code.

See my later slides about branch predictability. When you need to handle stragglers on the head or tail, then you've introduced 2 sources of unpredictability (and also bloated your code).

If the arrays are very long, this may be okay as you say, but if they're not it becomes significant.

But there is an new issue that appears; if the output array is not the same as the input array, then you have a new mis-alignment where the bases of the 2 arrays might not share the same alignment, and you can't do a simd load from one and store to the other without a series of corrective shifts and merges, which will effectively result in similar code to my un-aligned load demonstration.

So the case where this is reliable is:

* long data array

* output array is the same as the input array (overwrites the input?)

I don't consider that reliable, and I don't think special-cases awareness of those criteria is any easier than carefully/deliberately using SIMD in the first place.

Well, each are valid comparisons in different situations. I'm not sure how syntax could clearly select the one you want.

Maybe later we'll look for some syntax sugar for this.

I'm definitely curious... but i'm not sure it's necessary.

Are D intrinsics offering instructions to perform prefetching?

Well, GCC does at least. If you're worried about performance at this level, you're probably already using GCC :)

I think D SIMD programmers will expect something functionally like __builtin_prefetch to be available in D too:
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396

Yup, I toyed with the idea of adding it to std.simd, but I didn't think it fit there.