The XMM registers I am using are efficient when you feed them memory from arrays aligned to 16 bytes, as the D GC produces. But the YMM registers used by the AVX/AVX2 instructions prefer an alignment of 32 bytes. And the Intel Xeon Phi (MIC) has XMM registers that are efficient when the arrays are aligned to 64 bytes.
When I am not using SIMD code, and I want a small array of little elements, like an array of 10 ushorts, having it aligned to 16 bytes is a waste of space (despite helps the GC reduce the fragmentation).
So I have written a small enhancement request, where I suggest that arrays for YMM registers could be allocated with an alignment of 32 bytes:
http://d.puremagic.com/issues/show_bug.cgi?id=10826
Having the array alignments in the D type system could be useful. To be backward-compatible you also need a generic unknown alignment (like a void* for alignments), so you can assign arrays of any alignment to it, it could be denoted with '0'.
Some rough ideas:
import core.simd: double2, double4;
auto a = new int[10];
static assert(__traits(alignment, a) == 16);
auto b = new int[128]<32>;
static assert(__traits(alignment, b) == 32);
auto c1 = new double2[128];
auto c2 = new double4[64];
static assert(__traits(alignment, c1) == 16);
static assert(__traits(alignment, c2) == 32);
void foo1(int[]<32> a) {
// Uses YMM registers to modify a
// ...
}
void foo2(int[] a)
if (__traits(alignment, a) == 32) {
// Uses YMM registers to modify a
// ...
}
void foo3(size_t N)(int[]<N> a) {
static if (N >= 32) {
// Uses YMM registers to modify a
// ...
} else {
// ...
}
}