January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Fri, 06 Jan 2012 04:22:41 +0100, Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/5/2012 6:25 PM, Manu wrote:
>> Are you talking about for parameter passing, or for local variable assignment on
>> the stack?
>> For parameter passing, I understand the x32 problems with aligning the arguments
>> (I think it's possible to work around though), but there should be no problem
>> with aligning the stack for allocating local variables.
>
> Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
extending
push RBP;
mov RBP, RSP;
sub RSP, localStackSize;
to
push RBP;
// new
mov RAX, RSP;
and RAX, localAlignMask;
sub RSP, RAX;
// wen
mov RBP, RSP;
sub RSP, localStackSize;
should do the trick.
This would require to use biggest align attribute
of all stack variables for localAlignMask. Also align
needed to be power of 2 of it isn't already.
------------
RBP + 0 int a;
RBP + 4 int b;
24 byte padding
RBP + 32 align(32) struct float8 { float[8] v; } s;
------------
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On 1/5/2012 7:42 PM, Manu wrote:
> Perhaps I misunderstand, I can't see the problem?
> In the function preamble, you just align it... something like:
> mov reg, esp ; take a backup of the stack pointer
> and esp, -16 ; align it
>
> ... function
>
> mov esp, reg ; restore the stack pointer
> ret 0
And now you cannot access the function's parameters anymore, because the stack offset for them is now variable rather than fixed.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On 1/5/2012 5:42 PM, Manu wrote:
> So I've been hassling about this for a while now, and Walter asked me to pitch
> an email detailing a minimal implementation with some initial thoughts.
Takeaways:
1. SIMD behavior is going to be very machine specific.
2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated.
3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators.
So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours:
Declare one new basic type:
__v128
which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable.
Then, have:
import core.simd;
which provides two functions:
__v128 simdop(operator, __v128 op1);
__v128 simdop(operator, __v128 op1, __v128 op2);
This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.)
The operators would be an enum listing of the SIMD opcodes,
PFACC, PFADD, PFCMPEQ, etc.
For:
z = simdop(PFADD, x, y);
the compiler would generate:
MOV z,x
PFADD z,y
The code generator knows enough about these instructions to do register assignments reasonably optimally.
What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler.
One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types:
__vdouble2
__vfloat4
__vlong2
__vulong2
__vint4
__vuint4
__vshort8
__vushort8
__vbyte16
__vubyte16
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Fri, 06 Jan 2012 07:22:55 +0100, Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/5/2012 7:42 PM, Manu wrote:
>> Perhaps I misunderstand, I can't see the problem?
>> In the function preamble, you just align it... something like:
>> mov reg, esp ; take a backup of the stack pointer
>> and esp, -16 ; align it
>>
>> ... function
>>
>> mov esp, reg ; restore the stack pointer
>> ret 0
>
> And now you cannot access the function's parameters anymore, because the stack offset for them is now variable rather than fixed.
Aah, I knew there was something that wouldn't work.
One could possibly change from RBP relative addressing
to RSP relative addressing for the inner variables.
But that would fail with alloca.
So this won't work without a second frame register, does it?
@manu: Instead of using the RegionAllocator you could
write an aligning allocator using alloca memory. This
will be about the closest you get to that magic compiler
alignment.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Fri, Jan 6, 2012 at 2:43 AM, Walter Bright <newshound2@digitalmars.com> wrote: > On 1/5/2012 5:42 PM, Manu wrote: >> >> So I've been hassling about this for a while now, and Walter asked me to >> pitch >> an email detailing a minimal implementation with some initial thoughts. > > > Takeaways: > > 1. SIMD behavior is going to be very machine specific. > > 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated. > > 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators. > > So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours: > > Declare one new basic type: > > __v128 > > which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable. > > Then, have: > > import core.simd; > > which provides two functions: > > __v128 simdop(operator, __v128 op1); > __v128 simdop(operator, __v128 op1, __v128 op2); > > This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.) > > The operators would be an enum listing of the SIMD opcodes, > > PFACC, PFADD, PFCMPEQ, etc. > > For: > > z = simdop(PFADD, x, y); > > the compiler would generate: > > MOV z,x > PFADD z,y > Would this tie SIMD support directly to x86/x86_64, or would it possible to also support NEON on ARM (also 128 bit SIMD, see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html ) ? (Obviously not for DMD, but if the syntax wasn't directly tied to x86/64, GDC and LDC could support this) It seems like using a standard naming convention instead of directly referencing instructions could let the underlying SIMD instructions vary across platforms, but I don't know enough about the technologies to say whether NEON's capabilities match SSE closely enough that they could be handled the same way. |
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrew Wiley | > Would this tie SIMD support directly to x86/x86_64, or would it
> possible to also support NEON on ARM (also 128 bit SIMD, see
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
> ) ?
> (Obviously not for DMD, but if the syntax wasn't directly tied to
> x86/64, GDC and LDC could support this)
> It seems like using a standard naming convention instead of directly
> referencing instructions could let the underlying SIMD instructions
> vary across platforms, but I don't know enough about the technologies
> to say whether NEON's capabilities match SSE closely enough that they
> could be handled the same way.
For NEON you would need at least a function with a signature:
__v128 simdop(operator, __v128 op1, __v128 op2, __v128 op3);
since many NEON instructions operate on three registers.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to a | a Wrote:
> > Would this tie SIMD support directly to x86/x86_64, or would it
> > possible to also support NEON on ARM (also 128 bit SIMD, see
> > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409g/index.html
> > ) ?
> > (Obviously not for DMD, but if the syntax wasn't directly tied to
> > x86/64, GDC and LDC could support this)
> > It seems like using a standard naming convention instead of directly
> > referencing instructions could let the underlying SIMD instructions
> > vary across platforms, but I don't know enough about the technologies
> > to say whether NEON's capabilities match SSE closely enough that they
> > could be handled the same way.
>
> For NEON you would need at least a function with a signature:
>
> __v128 simdop(operator, __v128 op1, __v128 op2, __v128 op3);
>
> since many NEON instructions operate on three registers.
Disregard that, I wasn't paying atention to the return type. What Walter proposed can already handle three operand NEON instructions.
|
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright Wrote: > which provides two functions: > > __v128 simdop(operator, __v128 op1); > __v128 simdop(operator, __v128 op1, __v128 op2); You would also need functions that take an immediate too to support instructions such as shufps. > One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types: > > __vdouble2 > __vfloat4 > __vlong2 > __vulong2 > __vint4 > __vuint4 > __vshort8 > __vushort8 > __vbyte16 > __vubyte16 I don't see it being typeless as a problem. The purpose of this is to expose hardware capabilities to D code and the vector registers are typeless, so why shouldn't vector type be "typeless" too? Types such as vfloat4 can be implemented in a library (which could also be made portable and have a nice API). |
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Hi, just bringing into the discussion how Mono does it. http://tirania.org/blog/archive/2008/Nov-03.html Also have a look at pages 44-53 from the presentation slides. -- Paulo "Walter Bright" wrote in message news:je6c7j$2ct0$1@digitalmars.com... On 1/5/2012 5:42 PM, Manu wrote: > So I've been hassling about this for a while now, and Walter asked me to pitch > an email detailing a minimal implementation with some initial thoughts. Takeaways: 1. SIMD behavior is going to be very machine specific. 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated. 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators. So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours: Declare one new basic type: __v128 which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable. Then, have: import core.simd; which provides two functions: __v128 simdop(operator, __v128 op1); __v128 simdop(operator, __v128 op1, __v128 op2); This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.) The operators would be an enum listing of the SIMD opcodes, PFACC, PFADD, PFCMPEQ, etc. For: z = simdop(PFADD, x, y); the compiler would generate: MOV z,x PFADD z,y The code generator knows enough about these instructions to do register assignments reasonably optimally. What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler. One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types: __vdouble2 __vfloat4 __vlong2 __vulong2 __vint4 __vuint4 __vshort8 __vushort8 __vbyte16 __vubyte16 |
January 06, 2012 Re: SIMD support... | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments:
| On 6 January 2012 10:43, Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/5/2012 5:42 PM, Manu wrote:
>
>> So I've been hassling about this for a while now, and Walter asked me to
>> pitch
>> an email detailing a minimal implementation with some initial thoughts.
>>
>
> Takeaways:
>
> 1. SIMD behavior is going to be very machine specific.
>
> 2. Even trying to do something with + is fraught with peril, as integer adds with SIMD can be saturated or unsaturated.
>
> 3. Trying to build all the details about how each of the various adds and other ops work into the compiler/optimizer is a large undertaking. D would have to support internally maybe a 100 or more new operators.
>
> So some simplification is in order, perhaps a low level layer that is fairly extensible for new instructions, and for which a library can be layered over for a more presentable interface. A half-formed idea of mine is, taking a cue from yours:
>
> Declare one new basic type:
>
> __v128
>
> which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable.
>
> Then, have:
>
> import core.simd;
>
> which provides two functions:
>
> __v128 simdop(operator, __v128 op1);
> __v128 simdop(operator, __v128 op1, __v128 op2);
>
> This will be a function built in to the compiler, at least for the x86. (Other architectures can provide an implementation of it that simulates its operation, but I doubt that it would be worth anyone's while to use that.)
>
> The operators would be an enum listing of the SIMD opcodes,
>
> PFACC, PFADD, PFCMPEQ, etc.
>
> For:
>
> z = simdop(PFADD, x, y);
>
> the compiler would generate:
>
> MOV z,x
> PFADD z,y
>
> The code generator knows enough about these instructions to do register assignments reasonably optimally.
>
> What do you think? It ain't beeyoootiful, but it's implementable in a reasonable amount of time, and it should make writing tight & fast SIMD code without having to do it all in assembler.
>
> One caveat is it is typeless; a __v128 could be used as 4 packed ints or 2 packed doubles. One problem with making it typed is it'll add 10 more types to the base compiler, instead of one. Maybe we should just bite the bullet and do the types:
>
> __vdouble2
> __vfloat4
> __vlong2
> __vulong2
> __vint4
> __vuint4
> __vshort8
> __vushort8
> __vbyte16
> __vubyte16
>
Sounds good to me. Though I think __v128 should definitely be typeless,
allowing all those other types to be implemented in libraries. Why wouldn't
you leave that volume of work to libraries?
All those types and related complications shouldn't be code in the
language. There's a reason microsoft chose to only expose __m128 as an
intrinsic. The rest you build yourself.
Also, the LIBRARIES for types vectors can(/will) attempt to support
multiple architectures using version()s behind the scenes.
|
Copyright © 1999-2021 by the D Language Foundation