Thread overview
Pointer alignments in the type system?
Jun 24, 2013
bearophile
Jul 12, 2013
bearophile
Aug 15, 2013
bearophile
Aug 15, 2013
bearophile
Aug 16, 2013
Manu
Aug 16, 2013
Manu
Aug 16, 2013
bearophile
June 24, 2013
Lately I am playing a little with SIMD in D, and with the work in progress module std.simd, and I have added several related bug reports to Bugzilla.

This program compiles with no errors nor warnings with ldc2 on Windows32 bit:


import core.simd;

__gshared immutable int[16]
    a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 15, 16],
    b = [16, 15, 14, 13, 12, 11, 10,  9, 8,  7,  6,  5,  4,  3,  2,  1];
__gshared int[16] c;

void main() {
    auto va = cast(int4[])a;
    auto vb = cast(int4[])b;
    auto vc = cast(int4[])c;
    vc[0] = va[0] + vb[0];
}


ldc2 generates this main:

__Dmain:
    subl    $12, %esp
    movl    $16, 8(%esp)
    movl    $4, 4(%esp)
    movl    $16, (%esp)
    calll   __d_array_cast_len
    movdqa  __D5test51ayG16i, %xmm0
    paddd   __D5test51byG16i, %xmm0
    movdqa  %xmm0, __D5test51cG16i
    xorl    %eax, %eax
    addl    $12, %esp
    ret


It uses the instruction movdqa, that assumes a,b and c to be aligned to 16 bytes. But I think there is no guarantee they are.

This is the LL code generated using the -output-ll switch of ldc2 (it's a kind of nearly universal bytecode for llvm):

@_D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16]
@_D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1]
@_D5test51cG16i = global [16 x i32] zeroinitializer


If I add a "align(16)" annotation to a, b, c it adds the align 16 annotation in the LL code too:

@_D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16], align 16
@_D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1], align 16
@_D5test51cG16i = global [16 x i32] zeroinitializer, align 16


Instead of array casts if I cast the pointers the result is similar, but there is no call to __d_array_cast_len (this is a raw cast, so I don't expect much help from the type system here...):

    auto va = cast(int4*)a.ptr;
    auto vb = cast(int4*)b.ptr;
    auto vc = cast(int4*)c.ptr;


I'd like to receive an alignment warning in those cast(int4[]), or the compiler should not use movdqa in such case. This means that maybe __d_array_cast_len should keep and transmit the alignment of its input pointer to the pointer in the output array. And maybe this means the D front-end should keep an alignment information for each pointer (or for pointers that will be used in situations where the alignment is important), integrating it in its type system (and perhaps perform alignment inference like the purity inference done for function templates).

The alignment of pointers is important for the CPU, so maybe the type system of a system language should keep track of them, and enforce the correctness of the alignments. Maybe this could be done with no further annotation burden for the programmer. A potential problem is how to mix this alignment inference with the needs of separate compilation. I think the align() annotations suffice for that.

Bye,
bearophile
July 12, 2013
> This is the LL code generated using the -output-ll switch of ldc2 (it's a kind of nearly universal bytecode for llvm):
>
> @_D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16]
> @_D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1]
> @_D5test51cG16i = global [16 x i32] zeroinitializer
>
>
> If I add a "align(16)" annotation to a, b, c it adds the align 16 annotation in the LL code too:
>
> @_D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16], align 16
> @_D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1], align 16
> @_D5test51cG16i = global [16 x i32] zeroinitializer, align 16

I didn't think to also take a look at the asm produced. Even if I don't add align(16) in the code, and even if the LLVM IR doesn't have an alignment annotation, the data in the asm has such alignment:


    .data
    .globl  __D4test1ayG16i
    .align  16
__D4test1ayG16i:
    .long   1
    .long   2
    .long   3
    .long   4
    .long   5
    .long   6
    .long   7
    .long   8
    .long   9
    .long   10
    .long   11
    .long   12
    .long   13
    .long   14
    .long   15
    .long   16

    .globl  __D4test1byG16i
    .align  16
__D4test1byG16i:
    .long   16
    .long   15
    .long   14
    .long   13
    .long   12
    .long   11
    .long   10
    .long   9
    .long   8
    .long   7
    .long   6
    .long   5
    .long   4
    .long   3
    .long   2
    .long   1


Regarding adding the alignment to the D type system, to assure sliced data is aligned, a possible first step is to support:

void foo(align(16) double[] a1,
         align(16) double[] a2) {}


That annotation is dynamically tested at entry point. So it's similar to this:

void foo(double[] a1, double[] a2)
in {
    assert(isAligned(a1, 16) && isAligned(a2, 16));
} body {}


But inside the body of foo() the compiler is allowed to use aligned SIMD instructions (in theory this should be true for the version with asserts too).

A more elegant (and maybe better) solution is to verify the alignment statically. This means foo() accepts only 16-aligned arrays. Array cast(), new array, dup, idup, and little else is statically known to produce/return aligned arrays. And for the other slices you need something like makeAligned!N() that returns a statically known alignment.

Bye,
bearophile
August 15, 2013
The XMM registers I am using are efficient when you feed them memory from arrays aligned to 16 bytes, as the D GC produces. But the YMM registers used by the AVX/AVX2 instructions prefer an alignment of 32 bytes. And the Intel Xeon Phi (MIC) has XMM registers that are efficient when the arrays are aligned to 64 bytes.

When I am not using SIMD code, and I want a small array of little elements, like an array of 10 ushorts, having it aligned to 16 bytes is a waste of space (despite helps the GC reduce the fragmentation).

So I have written a small enhancement request, where I suggest that arrays for YMM registers could be allocated with an alignment of 32 bytes:
http://d.puremagic.com/issues/show_bug.cgi?id=10826

Having the array alignments in the D type system could be useful. To be backward-compatible you also need a generic unknown alignment (like a void* for alignments), so you can assign arrays of any alignment to it, it could be denoted with '0'.

Some rough ideas:


import core.simd: double2, double4;
auto a = new int[10];
static assert(__traits(alignment, a) == 16);
auto b = new int[128]<32>;
static assert(__traits(alignment, b) == 32);
auto c1 = new double2[128];
auto c2 = new double4[64];
static assert(__traits(alignment, c1) == 16);
static assert(__traits(alignment, c2) == 32);

void foo1(int[]<32> a) {
    // Uses YMM registers to modify a
    // ...
}

void foo2(int[] a)
if (__traits(alignment, a) == 32) {
    // Uses YMM registers to modify a
    // ...
}

void foo3(size_t N)(int[]<N> a) {
    static if (N >= 32) {
        // Uses YMM registers to modify a
        // ...
    } else {
        // ...
    }
}


Bye,
bearophile
August 15, 2013
> And the Intel Xeon Phi (MIC) has XMM registers that are efficient when the arrays are aligned to 64 bytes.

Sorry, I meant to say ZMM registers.

Bye,
bearophile
August 16, 2013
On 24 June 2013 22:03, bearophile <bearophileHUGS@lycos.com> wrote:

> Lately I am playing a little with SIMD in D, and with the work in progress module std.simd, and I have added several related bug reports to Bugzilla.
>

Good man! :)

This program compiles with no errors nor warnings with ldc2 on Windows32
> bit:
>
>
> import core.simd;
>
> __gshared immutable int[16]
>     a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 15, 16],
>     b = [16, 15, 14, 13, 12, 11, 10,  9, 8,  7,  6,  5,  4,  3,  2,  1];
> __gshared int[16] c;
>
> void main() {
>     auto va = cast(int4[])a;
>     auto vb = cast(int4[])b;
>     auto vc = cast(int4[])c;
>     vc[0] = va[0] + vb[0];
> }
>
>
> ldc2 generates this main:
>
> __Dmain:
>     subl    $12, %esp
>     movl    $16, 8(%esp)
>     movl    $4, 4(%esp)
>     movl    $16, (%esp)
>     calll   __d_array_cast_len
>     movdqa  __D5test51ayG16i, %xmm0
>     paddd   __D5test51byG16i, %xmm0
>     movdqa  %xmm0, __D5test51cG16i
>     xorl    %eax, %eax
>     addl    $12, %esp
>     ret
>
>
> It uses the instruction movdqa, that assumes a,b and c to be aligned to 16 bytes. But I think there is no guarantee they are.
>

Those int[16]'s aren't specified as aligned, but I bet LLVM has an optimisation which opportunistically aligns them given some criteria, and then generates code accordingly.

This is the LL code generated using the -output-ll switch of ldc2 (it's a
> kind of nearly universal bytecode for llvm):
>
> @_D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5,
> i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15,
> i32 16]
> @_D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32
> 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2,
> i32 1]
> @_D5test51cG16i = global [16 x i32] zeroinitializer
>
>
> If I add a "align(16)" annotation to a, b, c it adds the align 16
> annotation in the LL code too:
>

And so it should, you have now made the alignment explicit. It's not an optimisation anymore, it's an explicit request.

@_D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5,
> i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15,
> i32 16], align 16
> @_D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32
> 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2,
> i32 1], align 16
> @_D5test51cG16i = global [16 x i32] zeroinitializer, align 16
>
>
> Instead of array casts if I cast the pointers the result is similar, but there is no call to __d_array_cast_len (this is a raw cast, so I don't expect much help from the type system here...):
>
>     auto va = cast(int4*)a.ptr;
>     auto vb = cast(int4*)b.ptr;
>     auto vc = cast(int4*)c.ptr;
>
>
> I'd like to receive an alignment warning in those cast(int4[]), or the compiler should not use movdqa in such case.


I suspect that LLVM scheduling a movdqa is permitted only because it has
hard evidence during codegen that it is safe to do so.
Try the same thing with separate compilation across link units, see if the
presumption of alignment goes away?

I'm not sure about the warning, it doesn't really sound right to me. Maybe
it would be useful... or maybe it would be annoying.
A cast is a cast, it is a deliberate act of reinterpretation... the user
needs to take some responsibility in this case.

Although on the other hand, a cast TO a __vector() (or any aligned type)
pointer or array is absolutely useless unless the source pointer is aligned
correctly. And in *some* circumstances, the compiler could statically prove
this... but in many cases, it probably couldn't.
What happens if you do:
  int[16] a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 15,
16];
  auto va = cast(int4[])a[1..$]; // ?? .. the offset breaks the alignment.
(and now it's too short)

The compiler doesn't only need to consider the alignment of the source pointer, but also the offset at any given time. This is probably impossible.

This means that maybe __d_array_cast_len should keep and transmit the
> alignment of its input pointer to the pointer in the output array. And maybe this means the D front-end should keep an alignment information for each pointer (or for pointers that will be used in situations where the alignment is important), integrating it in its type system (and perhaps perform alignment inference like the purity inference done for function templates).
>
> The alignment of pointers is important for the CPU, so maybe the type system of a system language should keep track of them, and enforce the correctness of the alignments. Maybe this could be done with no further annotation burden for the programmer. A potential problem is how to mix this alignment inference with the needs of separate compilation. I think the align() annotations suffice for that.
>

Yes, kinda like I was saying above. It might be fine if you are explicit with align() statements, but whenever you are sharing pointers between link units, how can you ever retain the information the compiler needs?

I think it's a mistake to make it work in some cases and not others. So for this reason, I maintain that the user just needs to take responsibility for their casts...


August 16, 2013
On 16 August 2013 04:39, bearophile <bearophileHUGS@lycos.com> wrote:

> The XMM registers I am using are efficient when you feed them memory from arrays aligned to 16 bytes, as the D GC produces. But the YMM registers used by the AVX/AVX2 instructions prefer an alignment of 32 bytes. And the Intel Xeon Phi (MIC) has XMM registers that are efficient when the arrays are aligned to 64 bytes.
>
> When I am not using SIMD code, and I want a small array of little elements, like an array of 10 ushorts, having it aligned to 16 bytes is a waste of space (despite helps the GC reduce the fragmentation).
>

I'd argue that if you're alloc-ing many instances of 10 short's
independently on the heap, then you're probably doing it wrong.
If you're only doing it once or twice, then it's not a significant waste of
memory as you say.

So I have written a small enhancement request, where I suggest that arrays
> for YMM registers could be allocated with an alignment of 32 bytes: http://d.puremagic.com/issues/**show_bug.cgi?id=10826<http://d.puremagic.com/issues/show_bug.cgi?id=10826>


This shouldn't be an enhancement request, it should be the rule. __vector()'s should be intrinsically aligned to their sizeof. If they are not, it should be a bug.

Having the array alignments in the D type system could be useful. To be
> backward-compatible you also need a generic unknown alignment (like a void* for alignments), so you can assign arrays of any alignment to it, it could be denoted with '0'.
>
> Some rough ideas:
>
>
> import core.simd: double2, double4;
> auto a = new int[10];
> static assert(__traits(alignment, a) == 16);
> auto b = new int[128]<32>;
> static assert(__traits(alignment, b) == 32);
> auto c1 = new double2[128];
> auto c2 = new double4[64];
> static assert(__traits(alignment, c1) == 16);
> static assert(__traits(alignment, c2) == 32);
>
> void foo1(int[]<32> a) {
>     // Uses YMM registers to modify a
>     // ...
> }
>
> void foo2(int[] a)
> if (__traits(alignment, a) == 32) {
>     // Uses YMM registers to modify a
>     // ...
> }
>
> void foo3(size_t N)(int[]<N> a) {
>     static if (N >= 32) {
>         // Uses YMM registers to modify a
>         // ...
>     } else {
>         // ...
>     }
> }
>

The thing is, a/b/c1/c2 is really just: struct { size_t length; T *ptr; } Those symbol names just refer to the dynamic array struct... It doesn't make a lot of sense to query the alignment of those symbols. __traits(alignment, s) would == sizeof(size_t) every time.

I've thought about this sort of thing many times before... but I'm not convinced. I still think it falls over with the possibility of slicing, and separate compilation.

The thing you're asking for is alignment of the base of the array, not
alignment of elements within the array or alignment of the dynamic array
structure its self.
Alignment of the base of the array isn't really expressible as part of the
type. It's just a request to the allocator.
I use a mallocAligned() function in C. Sadly, I think this is one of the
mistakes of a discreet 'new' operator, which has a syntax that doesn't lend
its self to arbitrary parameters.


August 16, 2013
Manu:

> I'm not sure about the warning, it doesn't really sound right to me. Maybe
> it would be useful... or maybe it would be annoying.
> A cast is a cast, it is a deliberate act of reinterpretation... the user
> needs to take some responsibility in this case.

OK.


> What happens if you do:
>   int[16] a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 15,
> 16];
>   auto va = cast(int4[])a[1..$]; // ?? .. the offset breaks the alignment.
> (and now it's too short)

If you use a[16..$] or a[32..$] then the compiler knows statically that the slice has kept the 16 bytes alignment.
If you use a[i..$] where i is a run-time value, then the compiler assumes the worst, that a is aligned to 4 bytes.
A kind of cast could be used to force the compiler assume a certain pointer/array has a certain alignment (better languages could allow you to write a proof of such alignment, but this is not possible in D, unless you add something like the Microsoft Z3 solver inside the compiler).


> Yes, kinda like I was saying above. It might be fine if you are explicit
> with align() statements, but whenever you are sharing pointers between link
> units, how can you ever retain the information the compiler needs?

The title of this thread is "Pointer alignments in the type system", that means that the static information of the alignment is kept with the type of the pointers/arrays (including dynamic arrays). In D you can use int[3] across link units because the length is part of the static information of the array.

To make things back-ward compatible you need a "don't know" alignment, that is like the "void*" type for pointers, or maybe for the minimum alignment you could just use the natural alignment of the type.

See some my successive posts for some examples.


> I think it's a mistake to make it work in some cases and not others.

I agree. My proposal is to make them always work (unless you destroy the static information with a cast or a runtime-valued slicing). Runtime functions like "idup" etc are supposed to keep and "transmit" the alignment information.


> So for
> this reason, I maintain that the user just needs to take responsibility for their casts...

In D we have a good static type system, unlike for example in Python2. One of the points of a static type system is to keep some static information, to use it to enforce certain invariants statically, and avoid some bugs and offer performance. SIMD programming works more efficiently with certain alignments. In my opinion it's not too much hard to retain such alignment information and avoid certain bugs statically, and produce fast code. Having a static type system and not using it is a waste.

This means this a will has a standard aliment of 16:
auto a = new int[128];
static assert(__traits(alignment, a) == 16);

this b has a statically knwn alignment of 4:
auto b = a[i .. $];
static assert(__traits(alignment, b) == 4);
auto b2 = cast(align(16))b;
static assert(__traits(alignment, b2) == 16);

This c has a statically known alignment of 16:
auto c = a.dup;
static assert(__traits(alignment, c) == 16);

auto d = new align(32) int[128]; // A possible syntax.
static assert(__traits(alignment, d) == 32);
// Other possible syntax:
auto d2 = new int[128] align(32);
auto d3 = new int[128]<32>;

auto e = d[64..$];
static assert(__traits(alignment, e) == 32);

align(32) int[128] f;
static assert(__traits(alignment, f) == 32);

Bye,
bearophile