Thread overview
[dmd-internals] Shift support for vector types (or: is vector type a first class type?)
Apr 01, 2013
Kai Nacke
Apr 01, 2013
Walter Bright
Apr 02, 2013
Kai Nacke
Apr 02, 2013
Walter Bright
Apr 03, 2013
Kai Nacke
Apr 03, 2013
Walter Bright
Apr 03, 2013
Don Clugston
April 01, 2013
Hi!

I try to write a generic vectorized version of SHA1. During that I noticed that only some operations are allowed on vector types.

For the SHA1 algorithm I need to implement a 'rotate left'. I like to write something like this

    uint4 w = ...;
    uint4 v = (w << 1) | (w >> 31);

which is not allowed by DMD.

Is this by design or simply not implemented because the backend is not capable of generating code for it? The TDPL says nothing about vector types. My understanding of the language reference on the web (http://dlang.org/simd.html) is that the supported operators are CPU architecture dependent.

I really like to see more support for vector operations in the language, e.g. for shifting. What is the view of the language designers? Is the vector type a first class type or just an architecture (maybe vendor) dependent type with limited usability?

Because LLVM treats the vector type as a first class type supporting more operators is easy with LDC. See my pull request for shift operators here: https://github.com/ldc-developers/ldc/pull/321

Regards
Kai
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

March 31, 2013
On 3/31/2013 6:56 PM, Kai Nacke wrote:
> Hi!
>
> I try to write a generic vectorized version of SHA1. During that I noticed that only some operations are allowed on vector types.
>
> For the SHA1 algorithm I need to implement a 'rotate left'. I like to write something like this
>
>     uint4 w = ...;
>     uint4 v = (w << 1) | (w >> 31);
>
> which is not allowed by DMD.
>
> Is this by design or simply not implemented because the backend is not capable of generating code for it? The TDPL says nothing about vector types. My understanding of the language reference on the web (http://dlang.org/simd.html) is that the supported operators are CPU architecture dependent.
>
> I really like to see more support for vector operations in the language, e.g. for shifting. What is the view of the language designers? Is the vector type a first class type or just an architecture (maybe vendor) dependent type with limited usability?
>
> Because LLVM treats the vector type as a first class type supporting more operators is easy with LDC. See my pull request for shift operators here: https://github.com/ldc-developers/ldc/pull/321

The idea is if a vector operation is not supported by the underlying hardware, then dmd won't allow it. It specifically does not generate "workaround" code like gcc does. The reason for this is the workaround code is terribly, terribly slow (because moving code between the ALU and the SIMD unit is awful), and having the compiler silently insert it leaves the programmer mystified why he is getting execrable performance.

The vector design philosophy in D is if you write SIMD code, and it compiles, you can be confident it will execute in the SIMD unit of your particular target processor. You won't have to dump the assembler output to be sure.
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

April 02, 2013
On 01.04.2013 04:31, Walter Bright wrote:
>
> On 3/31/2013 6:56 PM, Kai Nacke wrote:
>> Hi!
>>
>> I try to write a generic vectorized version of SHA1. During that I noticed that only some operations are allowed on vector types.
>>
>> For the SHA1 algorithm I need to implement a 'rotate left'. I like to write something like this
>>
>>     uint4 w = ...;
>>     uint4 v = (w << 1) | (w >> 31);
>>
>> which is not allowed by DMD.
>>
>> Is this by design or simply not implemented because the backend is not capable of generating code for it? The TDPL says nothing about vector types. My understanding of the language reference on the web (http://dlang.org/simd.html) is that the supported operators are CPU architecture dependent.
>>
>> I really like to see more support for vector operations in the language, e.g. for shifting. What is the view of the language designers? Is the vector type a first class type or just an architecture (maybe vendor) dependent type with limited usability?
>>
>> Because LLVM treats the vector type as a first class type supporting more operators is easy with LDC. See my pull request for shift operators here: https://github.com/ldc-developers/ldc/pull/321
>
> The idea is if a vector operation is not supported by the underlying hardware, then dmd won't allow it. It specifically does not generate "workaround" code like gcc does. The reason for this is the workaround code is terribly, terribly slow (because moving code between the ALU and the SIMD unit is awful), and having the compiler silently insert it leaves the programmer mystified why he is getting execrable performance.

Shifting a vector left by a single scalar e.g. v << 2 is then a missing operation. It is supported by the PSLLW/D/Q instruction. Same for shifting right. This is good news for my implementation. :-)

> The vector design philosophy in D is if you write SIMD code, and it compiles, you can be confident it will execute in the SIMD unit of your particular target processor. You won't have to dump the assembler output to be sure.

Would it be legal for a D compiler to generate "workaround" code? Otherwise the language changes depending on the target. Consider again the left shift: on an Intel CPU only v << n (v: vector; n: scalar) is valid. In contrast, Altivec allows v << w (v, w: vector). Then the same source may or may not compile depending on the target (with an error message saying 'incompatible types'). As a user of a cross compiler I would be very surprised by this behavior. I really have Linux/PPC64 in mind but do most development on Windows...
(It feels a bit like ++ is only supported if the underlying hardware has an INC instruction...)

Regards
Kai
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

April 01, 2013
On 4/1/2013 10:13 PM, Kai Nacke wrote:
> On 01.04.2013 04:31, Walter Bright wrote:
>>
>> On 3/31/2013 6:56 PM, Kai Nacke wrote:
>>> Hi!
>>>
>>> I try to write a generic vectorized version of SHA1. During that I noticed that only some operations are allowed on vector types.
>>>
>>> For the SHA1 algorithm I need to implement a 'rotate left'. I like to write something like this
>>>
>>>     uint4 w = ...;
>>>     uint4 v = (w << 1) | (w >> 31);
>>>
>>> which is not allowed by DMD.
>>>
>>> Is this by design or simply not implemented because the backend is not capable of generating code for it? The TDPL says nothing about vector types. My understanding of the language reference on the web (http://dlang.org/simd.html) is that the supported operators are CPU architecture dependent.
>>>
>>> I really like to see more support for vector operations in the language, e.g. for shifting. What is the view of the language designers? Is the vector type a first class type or just an architecture (maybe vendor) dependent type with limited usability?
>>>
>>> Because LLVM treats the vector type as a first class type supporting more operators is easy with LDC. See my pull request for shift operators here: https://github.com/ldc-developers/ldc/pull/321
>>
>> The idea is if a vector operation is not supported by the underlying hardware, then dmd won't allow it. It specifically does not generate "workaround" code like gcc does. The reason for this is the workaround code is terribly, terribly slow (because moving code between the ALU and the SIMD unit is awful), and having the compiler silently insert it leaves the programmer mystified why he is getting execrable performance.
>
> Shifting a vector left by a single scalar e.g. v << 2 is then a missing operation. It is supported by the PSLLW/D/Q instruction. Same for shifting right. This is good news for my implementation. :-)

You can file a bugzilla for that one.

>
>> The vector design philosophy in D is if you write SIMD code, and it compiles, you can be confident it will execute in the SIMD unit of your particular target processor. You won't have to dump the assembler output to be sure.
>
> Would it be legal for a D compiler to generate "workaround" code?

No.

> Otherwise the language changes depending on the target.

That's correct.

> Consider again the left shift: on an Intel CPU only v << n (v: vector; n: scalar) is valid. In contrast, Altivec allows v << w (v, w: vector). Then the same source may or may not compile depending on the target (with an error message saying 'incompatible types'). As a user of a cross compiler I would be very surprised by this behavior.

The bigger surprise would be the silent and unpredictable execrably bad performance. The only reason to write SIMD code is for performance, and the compiler ought to give an error when it cannot deliver SIMD performance.

The workaround code can be 100x slower. This is a big deal.

> I really have Linux/PPC64 in mind but do most development on Windows...
> (It feels a bit like ++ is only supported if the underlying hardware has an INC instruction...)

That's a different issue, since the workaround code is just as fast.

_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

April 03, 2013
On 02.04.2013 08:57, Walter Bright wrote:
>
> On 4/1/2013 10:13 PM, Kai Nacke wrote:
>> On 01.04.2013 04:31, Walter Bright wrote:
>>>
>>> On 3/31/2013 6:56 PM, Kai Nacke wrote:
>>>> Hi!
>>>>
>>>> I try to write a generic vectorized version of SHA1. During that I noticed that only some operations are allowed on vector types.
>>>>
>>>> For the SHA1 algorithm I need to implement a 'rotate left'. I like to write something like this
>>>>
>>>>     uint4 w = ...;
>>>>     uint4 v = (w << 1) | (w >> 31);
>>>>
>>>> which is not allowed by DMD.
>>>>
>>>> Is this by design or simply not implemented because the backend is not capable of generating code for it? The TDPL says nothing about vector types. My understanding of the language reference on the web (http://dlang.org/simd.html) is that the supported operators are CPU architecture dependent.
>>>>
>>>> I really like to see more support for vector operations in the language, e.g. for shifting. What is the view of the language designers? Is the vector type a first class type or just an architecture (maybe vendor) dependent type with limited usability?
>>>>
>>>> Because LLVM treats the vector type as a first class type supporting more operators is easy with LDC. See my pull request for shift operators here: https://github.com/ldc-developers/ldc/pull/321
>>>
>>> The idea is if a vector operation is not supported by the underlying hardware, then dmd won't allow it. It specifically does not generate "workaround" code like gcc does. The reason for this is the workaround code is terribly, terribly slow (because moving code between the ALU and the SIMD unit is awful), and having the compiler silently insert it leaves the programmer mystified why he is getting execrable performance.
>>
>> Shifting a vector left by a single scalar e.g. v << 2 is then a missing operation. It is supported by the PSLLW/D/Q instruction. Same for shifting right. This is good news for my implementation.
>
> You can file a bugzilla for that one.

Done. It's bugzilla 9860

>>> The vector design philosophy in D is if you write SIMD code, and it compiles, you can be confident it will execute in the SIMD unit of your particular target processor. You won't have to dump the assembler output to be sure.
>>
>> Would it be legal for a D compiler to generate "workaround" code?
>
> No.
>
>> Otherwise the language changes depending on the target.
>
> That's correct.
>
>> Consider again the left shift: on an Intel CPU only v << n (v: vector; n: scalar) is valid. In contrast, Altivec allows v << w (v, w: vector). Then the same source may or may not compile depending on the target (with an error message saying 'incompatible types'). As a user of a cross compiler I would be very surprised by this behavior.
>
> The bigger surprise would be the silent and unpredictable execrably bad performance. The only reason to write SIMD code is for performance, and the compiler ought to give an error when it cannot deliver SIMD performance.
>
> The workaround code can be 100x slower. This is a big deal.

While I understand your argumentation I still feel a bit uncomfortable
with it. It creates a situation in which you can't tell me if a D
program will compile by reading the source if I dont' tell you the
target architecture. I think this is something really new.
The current situation is that conditional compiling is used if
interfaces etc. are not globally available.  This principle is now
broken by the "invisible" rules which determines the availability of
vector operations.

My approach would be to define the following: if D_SIMD is defined then only the optimal vector operations are available. This ensures your goal of generating code with optimal performance. If D_SIMD is not defined but a vendor specific SIMD implementation is available then the rules of this implementation hold (which may include generation of "workaround" code). This has the advantage of being explicit:

     version(D_SIMD)
     {
         uint4 w = ...;
         uint4 v = w << 1;
     }
     else version(XYZ_SIMD)
     {
         uint4 w = ..., x = ...;
         // Not allowed by DMD . only fast on altivec
         uint4 v = x << w;
     }

Or do I miss something?

>> I really have Linux/PPC64 in mind but do most development on Windows... (It feels a bit like ++ is only supported if the underlying hardware has an INC instruction...)
>
> That's a different issue, since the workaround code is just as fast.

2nd try: core.bitop.popcnt is a "workaround" for a missing popcnt instruction. LDC provides an intrinsic for popcnt but this is lowered to the "workaround" code if the popcnt instruction is not available. If we apply the same rules then this is verboten.

Regards
Kai


April 02, 2013
On 4/2/2013 10:58 PM, Kai Nacke wrote:
>
> While I understand your argumentation I still feel a bit uncomfortable with
> it. It creates a situation in which you can't tell me if a D program will
> compile by reading the source if I dont' tell you the target architecture. I
> think this is something really new.
> The current situation is that conditional compiling is used if interfaces etc.
> are not globally available.  This principle is now broken by the "invisible"
> rules which determines the availability of vector operations.

Performant SIMD code is simply not portable between architectures. The programmer writing SIMD code ought to be guaranteed he's getting SIMD code, not workaround code that is 100x slower. I view a compiler error as being far more visible than silently generating unacceptably slow code.


>
>
> My approach would be to define the following: if D_SIMD is defined then only the optimal vector operations are available. This ensures your goal of generating code with optimal performance. If D_SIMD is not defined but a vendor specific SIMD implementation is available then the rules of this implementation hold (which may include generation of "workaround" code). This has the advantage of being explicit:
>
>     version(D_SIMD)
>     {
>         uint4 w = ...;
>         uint4 v = w << 1;
>     }
>     else version(XYZ_SIMD)
>     {
>         uint4 w = ..., x = ...;
>         // Not allowed by DMD . only fast on altivec
>         uint4 v = x << w;
>     }
>
> Or do I miss something?

All the programmer really needs to do is use a version statement on the architecture for the SIMD code for that architecture, and then have a default with the workaround code. The point here will be that he *knowingly* selects the slow workaround code. This is critical for a systems programming language where programmers writing SIMD code are not always experts at dumping the compiler output to see what was generated.
> 2nd try: core.bitop.popcnt is a "workaround" for a missing popcnt instruction. LDC provides an intrinsic for popcnt but this is lowered to the "workaround" code if the popcnt instruction is not available. If we apply the same rules then this is verboten.

The workaround code for popcnt isn't 100x slower.



April 03, 2013
On 3 April 2013 08:39, Walter Bright <walter@digitalmars.com> wrote:
>
> On 4/2/2013 10:58 PM, Kai Nacke wrote:
>
>
> While I understand your argumentation I still feel a bit uncomfortable with
> it. It creates a situation in which you can't tell me if a D program will
> compile by reading the source if I dont' tell you the target architecture. I
> think this is something really new.
> The current situation is that conditional compiling is used if interfaces
> etc. are not globally available.  This principle is now broken by the
> "invisible" rules which determines the availability of vector operations.
>
>
> Performant SIMD code is simply not portable between architectures. The programmer writing SIMD code ought to be guaranteed he's getting SIMD code, not workaround code that is 100x slower. I view a compiler error as being far more visible than silently generating unacceptably slow code.
>
>
>
>
>
> My approach would be to define the following: if D_SIMD is defined then only the optimal vector operations are available. This ensures your goal of generating code with optimal performance. If D_SIMD is not defined but a vendor specific SIMD implementation is available then the rules of this implementation hold (which may include generation of "workaround" code). This has the advantage of being explicit:
>
>     version(D_SIMD)
>     {
>         uint4 w = ...;
>         uint4 v = w << 1;
>     }
>     else version(XYZ_SIMD)
>     {
>         uint4 w = ..., x = ...;
>         // Not allowed by DMD . only fast on altivec
>         uint4 v = x << w;
>     }
>
> Or do I miss something?
>
>
> All the programmer really needs to do is use a version statement on the architecture for the SIMD code for that architecture, and then have a default with the workaround code. The point here will be that he *knowingly* selects the slow workaround code. This is critical for a systems programming language where programmers writing SIMD code are not always experts at dumping the compiler output to see what was generated.
>
> 2nd try: core.bitop.popcnt is a "workaround" for a missing popcnt instruction. LDC provides an intrinsic for popcnt but this is lowered to the "workaround" code if the popcnt instruction is not available. If we apply the same rules then this is verboten.
>
>
> The workaround code for popcnt isn't 100x slower.

Actually it is, and we should probably do something about that. (The
"workaround" code is the original, it actually dates from a time
before Intel added Popcount to their instruction set!)
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals