On 02.04.2013 08:57, Walter
Bright wrote:
On 4/1/2013 10:13 PM, Kai Nacke wrote:
On 01.04.2013
04:31, Walter Bright wrote:
On 3/31/2013 6:56 PM, Kai Nacke wrote:
Hi!
I try to write a generic vectorized version of SHA1.
During that I noticed that only some operations are
allowed on vector types.
For the SHA1 algorithm I need to implement a 'rotate
left'. I like to write something like this
uint4 w = ...;
uint4 v = (w << 1) | (w >> 31);
which is not allowed by DMD.
Is this by design or simply not implemented because the
backend is not capable of generating code for it? The TDPL
says nothing about vector types. My understanding of the
language reference on the web (http://dlang.org/simd.html)
is that the supported operators are CPU architecture
dependent.
I really like to see more support for vector operations in
the language, e.g. for shifting. What is the view of the
language designers? Is the vector type a first class type
or just an architecture (maybe vendor) dependent type with
limited usability?
Because LLVM treats the vector type as a first class type
supporting more operators is easy with LDC. See my pull
request for shift operators here: https://github.com/ldc-developers/ldc/pull/321
The idea is if a vector operation is not supported by the
underlying hardware, then dmd won't allow it. It
specifically does not generate "workaround" code like gcc
does. The reason for this is the workaround code is
terribly, terribly slow (because moving code between the ALU
and the SIMD unit is awful), and having the compiler
silently insert it leaves the programmer mystified why he is
getting execrable performance.
Shifting a vector left by a single scalar e.g. v << 2 is
then a missing operation. It is supported by the PSLLW/D/Q
instruction. Same for shifting right. This is good news for my
implementation.
You can file a bugzilla for that one.
Done. It's bugzilla 9860
The vector
design philosophy in D is if you write SIMD code, and it
compiles, you can be confident it will execute in the SIMD
unit of your particular target processor. You won't have to
dump the assembler output to be sure.
Would it be legal for a D compiler to generate "workaround"
code?
No.
Otherwise the
language changes depending on the target.
That's correct.
Consider again
the left shift: on an Intel CPU only v << n (v: vector;
n: scalar) is valid. In contrast, Altivec allows v << w
(v, w: vector). Then the same source may or may not compile
depending on the target (with an error message saying
'incompatible types'). As a user of a cross compiler I would
be very surprised by this behavior.
The bigger surprise would be the silent and unpredictable
execrably bad performance. The only reason to write SIMD code is
for performance, and the compiler ought to give an error when it
cannot deliver SIMD performance.
The workaround code can be 100x slower. This is a big deal.
While I understand your argumentation I still feel a bit
uncomfortable with it. It creates a situation in which you can't
tell me if a D program will compile by reading the source if I
dont' tell you the target architecture. I think this is something
really new.
The current situation is that conditional compiling is used if
interfaces etc. are not globally available. This principle is now
broken by the "invisible" rules which determines the availability
of vector operations.
My approach would be to define the following: if D_SIMD is defined
then only the optimal vector operations are available. This
ensures your goal of generating code with optimal performance. If
D_SIMD is not defined but a vendor specific SIMD implementation is
available then the rules of this implementation hold (which may
include generation of "workaround" code). This has the advantage
of being explicit:
version(D_SIMD)
{
uint4 w = ...;
uint4 v = w << 1;
}
else version(XYZ_SIMD)
{
uint4 w = ..., x = ...;
// Not allowed by DMD . only fast on altivec
uint4 v = x << w;
}
Or do I miss something?
I really have
Linux/PPC64 in mind but do most development on Windows...
(It feels a bit like ++ is only supported if the underlying
hardware has an INC instruction...)
That's a different issue, since the workaround code is just as
fast.
2nd try: core.bitop.popcnt is a "workaround" for a missing popcnt
instruction. LDC provides an intrinsic for popcnt but this is
lowered to the "workaround" code if the popcnt instruction is not
available. If we apply the same rules then this is verboten.
Regards
Kai