October 24, 2012
> Simple example:
>   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
> _madd(a, b, c); }

It may be useful to have a way to define compound operators for other things (although you can already write expression templates), but this is an optimization that the compiler back end can do. If you compile this code:

float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }

With gdc with flags -O2 -fma, you get:

0000000000000000 <_D3tmp3fooFNhG4fNhG4fNhG4fZNhG4f>:
   0:   c4 e2 69 98 c1          vfmadd132ps xmm0,xmm2,xmm1
   5:   c3                      ret

October 24, 2012
On Wednesday, 24 October 2012 at 12:47:38 UTC, bearophile wrote:
> Paulo Pinto:
>
>> Actually, I am yet to see any language that has SIMD as part of the language standard and not as an extension where each vendor does its own way.
>
> D is, or is going to be, one such language :-)
>
> Bye,
> bearophile

Is it so?

From what I can understand the SIMD calls depend on which D compiler is being used.

That doesn't look like part of the language standard to me. :(

--
Paulo
October 24, 2012
Manu wrote:
> One thing I can think of that would really improve simd (and not only simd)
> would be a way to define compound operators.
> If the library could detect/hook sequences of operations and implement them
> more efficiently as a compound, that would make some very powerful
> optimisations available.
>
> Simple example:
>   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
> _madd(a, b, c); }
>   T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { return
> _madd(b, c, a); }

I thought about that before and it might be nice to have that level of control in the language, but ultimately, like jerro said, I think it would be better suited for the compiler's backend optimization. Unfortunately I don't think more complex patterns, such as Matrix multiplications, are found and optimized by GCC/LLVM... I could be wrong, but these are area where my hand-tuned code always outperforms basic math code.

I think having that in the back-end makes a lot of sense, because your code is easier to read and understand, without sacrificing performance. Plus, it would be difficult to map a sequence as complex as matrix multiplication to a single compound operator. That being said, I do think something similar would be useful in general:

  struct Vector
  {
      ...

      static float distance(Vector a, Vector b) {...}
      static float distanceSquared(Vector a, Vector b) {...}

      float opSequence(string funcs...)(Vector a, Vector b)
        if (funcs[0] == "Math.sqrt" &&
            funcs[1] == "Vector.distance")
      {
          return distanceSquared(a, b);
      }
  }

  void main()
  {
      auto a = Vector.random( ... );
      auto b = Vector.random( ... );

      // Below is turned into a 'distanceSquared()' call
      float dis = Math.sqrt(Vector.distance(a, b));
  }

Since distance requires a 'Math.sqrt()', this pseudo-code could avoid the operation entirely by calling 'distanceSquared()' even if the programmer is a noob and doesn't know to do it explicitly.
October 24, 2012
On 24 October 2012 16:00, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
>
>  D already has what's required to do some fairly nice (by comparison) simd
>> stuff with good supporting libraries.
>>
>
> After reading that paper I am not sure you are right. See how their language manages masks by itself. This is from page 3:
>
>
> // vector length of context = 1; current_mask = T
> int block[4] v = <0,3,4,1>;
> int block[4] w = 3; // <3,3,3,3> via broadcast
> bool block[4] m = v < w; // <T,F,F,T>
> ++v; // <1,4,5,2>
> if (m) {
>     // vector length of context = 4; current_mask = m
>     v += 2; // <3,4,5,4>
> } else {
>     // vector length of context = 4; current_mask = ~m
>     v += 3; // <3,7,8,4>
> }
> // vector length of context = 1; current_mask = T
>

I agree that if is kinda neat, but it's probably not out of the question
for future extension. All the other stuff here is possible.
That said, it's not necessarily optimal either, just conveniently written.
The compiler would have to do some serious magic to optimise that;
flattening both sides of the if into parallel expressions, and then
applying the mask to combine...
I'm personally not in favour of SIMD constructs that are anything less than
optimal (but I appreciate I'm probably in the minority here).


(The simple benchmarks of the paper show a 5-15% performance loss compared
> to handwritten SIMD code.)
>

Right, as I suspected.


October 24, 2012
On 24 October 2012 18:12, jerro <a@a.com> wrote:

>  Simple example:
>>   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
>> _madd(a, b, c); }
>>
>
> It may be useful to have a way to define compound operators for other things (although you can already write expression templates), but this is an optimization that the compiler back end can do. If you compile this code:
>
> float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }
>
> With gdc with flags -O2 -fma, you get:
>
> 0000000000000000 <_**D3tmp3fooFNhG4fNhG4fNhG4fZNhG4**f>:
>    0:   c4 e2 69 98 c1          vfmadd132ps xmm0,xmm2,xmm1
>    5:   c3                      ret
>

Right, I suspected GDC might do that, but it was just an example. You can
extend that to many more complicated scenarios.
What does it do on less mature architectures like MIPS, PPC, ARM?


October 24, 2012
Manu:

> The compiler would have to do some serious magic to optimise that;
> flattening both sides of the if into parallel expressions, and then applying the mask to combine...

I think it's a small amount of magic.

The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient.


> I'm personally not in favour of SIMD constructs that are anything less than
> optimal (but I appreciate I'm probably in the minority here).
>
>
> (The simple benchmarks of the paper show a 5-15% performance loss compared
>> to handwritten SIMD code.)
>>
>
> Right, as I suspected.

15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-)

The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things.

A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss.

This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-)

Bye,
bearophile
October 24, 2012
On 25 October 2012 01:00, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
>
>  The compiler would have to do some serious magic to optimise that;
>> flattening both sides of the if into parallel expressions, and then applying the mask to combine...
>>
>
> I think it's a small amount of magic.
>
> The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient.
>
>
>  I'm personally not in favour of SIMD constructs that are anything less
>> than
>> optimal (but I appreciate I'm probably in the minority here).
>>
>>
>> (The simple benchmarks of the paper show a 5-15% performance loss compared
>>
>>> to handwritten SIMD code.)
>>>
>>>
>> Right, as I suspected.
>>
>
> 15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-)
>
> The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things.
>
> A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss.
>
> This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-)
>

I don't disagree with you, it is fairly cool!
I can't can't imagine D adopting those sort of language features any time
soon, but it's probably possible.
I guess the keys are defining the bool vector concept, and some tech to
flatten both sides of a vector if statement, but that's far from simple...
Particularly so if someone puts some unrelated code in those if blocks.
Chances are it offers too much freedom that wouldn't be well used or
understood by the average programmer, and that still leaves you in a
similar land of only being particularly worthwhile in the hands of a fairly
advanced/competent user.
The main error that most people make is thinking SIMD code is faster by
nature. Truth is, in the hands of someone who doesn't know precisely what
they're doing, SIMD code is almost always slower.
There are some cool new expressions offered here, fairly convenient
(although easy[er?] to write in other ways too), but I don't think it would
likely change that fundamental premise for the average programmer beyond
some very simple parallel constructs that the compiler can easily get right.
I'd certainly love to see it, but is it realistic that someone would take
the time to do all of that any time soon when benefits
are controversial? It may even open the possibility for un-skilled people
to write far worse code.

Let's consider your example above for instance, I would rewrite (given
existing syntax):

// vector length of context = 1; current_mask = T
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
v += int4(1); // [1,4,5,2]

// the if block is trivially rewritten:
int4 trueSide = v + int4(2);
int4 falseSize = v + int4(3);
v = select(m, trueSide, falseSide); // [3,7,8,4]


Or the whole thing further simplified:
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast

// one convenient function does the comparison and select accordingly
v = selectLess(v, w, v + int4(1 + 2), v + int4(1 + 3)); // combine the
prior few lines

I actually find this more convenient. I also find the if syntax you demonstrate to be rather deceptive and possibly misleading. 'if' suggests a branch, whereas the construct you demonstrate will evaluate both sides every time. Inexperienced programmers may not really grasp that. Evaluating the true side and the false side inline, and then perform the select serially is more honest; it's actually what the computer will do, and I don't really see it being particularly less convenient either.


October 24, 2012
On 24 October 2012 23:46, Manu <turkeyman@gmail.com> wrote:
> On 25 October 2012 01:00, bearophile <bearophileHUGS@lycos.com> wrote:
>>
>> Manu:
>>
>>
>>> The compiler would have to do some serious magic to optimise that; flattening both sides of the if into parallel expressions, and then applying the mask to combine...
>>
>>
>> I think it's a small amount of magic.
>>
>> The simple features shown in that paper are fully focused on SIMD programming, so they aren't introducing things clearly not efficient.
>>
>>
>>> I'm personally not in favour of SIMD constructs that are anything less
>>> than
>>> optimal (but I appreciate I'm probably in the minority here).
>>>
>>>
>>> (The simple benchmarks of the paper show a 5-15% performance loss compared
>>>>
>>>> to handwritten SIMD code.)
>>>>
>>>
>>> Right, as I suspected.
>>
>>
>> 15% is a very small performance loss, if for the programmer the alternative is writing scalar code, that is 2 or 3 times slower :-)
>>
>> The SIMD programmers that can't stand a 1% loss of performance use the intrinsics manually (or write in asm) and they ignore all other things.
>>
>> A much larger population of system programmers wish to use modern CPUs efficiently, but they don't have time (or skill, this means their programs are too much often buggy) for assembly-level programming. Currently they use smart numerical C++ libraries, use modern Fortran versions, and/or write C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look at the produced asm hoping the modern compiler back-ends will vectorize it. This is not good enough, and it's far from a 15% loss.
>>
>> This paper shows a third way, making such kind of programming simpler and approachable for a wider audience, with a small performance loss compared to handwritten code. This is what language designers do since 60+ years :-)
>
>
> I don't disagree with you, it is fairly cool!
> I can't can't imagine D adopting those sort of language features any time
> soon, but it's probably possible.
> I guess the keys are defining the bool vector concept, and some tech to
> flatten both sides of a vector if statement, but that's far from simple...
> Particularly so if someone puts some unrelated code in those if blocks.
> Chances are it offers too much freedom that wouldn't be well used or
> understood by the average programmer, and that still leaves you in a similar
> land of only being particularly worthwhile in the hands of a fairly
> advanced/competent user.
> The main error that most people make is thinking SIMD code is faster by
> nature. Truth is, in the hands of someone who doesn't know precisely what
> they're doing, SIMD code is almost always slower.
> There are some cool new expressions offered here, fairly convenient
> (although easy[er?] to write in other ways too), but I don't think it would
> likely change that fundamental premise for the average programmer beyond
> some very simple parallel constructs that the compiler can easily get right.
> I'd certainly love to see it, but is it realistic that someone would take
> the time to do all of that any time soon when benefits are controversial? It
> may even open the possibility for un-skilled people to write far worse code.
>
> Let's consider your example above for instance, I would rewrite (given
> existing syntax):
>
> // vector length of context = 1; current_mask = T
> int4 v = [0,3,4,1];
> int4 w = 3; // [3,3,3,3] via broadcast
> uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
> v += int4(1); // [1,4,5,2]
>
> // the if block is trivially rewritten:
> int4 trueSide = v + int4(2);
> int4 falseSize = v + int4(3);
> v = select(m, trueSide, falseSide); // [3,7,8,4]
>
>

This should work....

int4 trueSide = v + 2;
int4 falseSide = v + 3;
....


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
October 24, 2012
On 25 October 2012 02:01, Iain Buclaw <ibuclaw@ubuntu.com> wrote:

> On 24 October 2012 23:46, Manu <turkeyman@gmail.com> wrote:
>
> Let's consider your example above for instance, I would rewrite (given
> > existing syntax):
> >
> > // vector length of context = 1; current_mask = T
> > int4 v = [0,3,4,1];
> > int4 w = 3; // [3,3,3,3] via broadcast
> > uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
> > v += int4(1); // [1,4,5,2]
> >
> > // the if block is trivially rewritten:
> > int4 trueSide = v + int4(2);
> > int4 falseSize = v + int4(3);
> > v = select(m, trueSide, falseSide); // [3,7,8,4]
> >
> >
>
> This should work....
>
> int4 trueSide = v + 2;
> int4 falseSide = v + 3;
>

Probably, just wasn't sure.


October 24, 2012
On 25 October 2012 00:16, Manu <turkeyman@gmail.com> wrote:
> On 25 October 2012 02:01, Iain Buclaw <ibuclaw@ubuntu.com> wrote:
>>
>> On 24 October 2012 23:46, Manu <turkeyman@gmail.com> wrote:
>>
>> > Let's consider your example above for instance, I would rewrite (given
>> > existing syntax):
>> >
>> > // vector length of context = 1; current_mask = T
>> > int4 v = [0,3,4,1];
>> > int4 w = 3; // [3,3,3,3] via broadcast
>> > uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
>> > v += int4(1); // [1,4,5,2]
>> >
>> > // the if block is trivially rewritten:
>> > int4 trueSide = v + int4(2);
>> > int4 falseSize = v + int4(3);
>> > v = select(m, trueSide, falseSide); // [3,7,8,4]
>> >
>> >
>>
>> This should work....
>>
>> int4 trueSide = v + 2;
>> int4 falseSide = v + 3;
>
>
> Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D array operations support. :-)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';