Thread overview
Vector operations optimization.
Mar 22, 2012
Comrad
Mar 22, 2012
Trass3r
Mar 23, 2012
Comrad
Mar 23, 2012
James Miller
Mar 23, 2012
Trass3r
Mar 23, 2012
Comrad
Mar 23, 2012
Dmitry Olshansky
Mar 23, 2012
Comrad
March 22, 2012
I'd like to try d in computational physics. One of the most appealing features of the d is implementation of arrays, but to be really usable this has to work FAST.
So here http://dlang.org/arrays.html it is stated, that:

"Im­ple­men­ta­tion note: many of the more com­mon vec­tor op­er­a­tions are ex­pected to take ad­van­tage of any vec­tor math in­struc­tions avail­able on the tar­get com­puter."

What is the status at the moment? What compiler and with which compiler flags I should use to achieve maximum performance?

March 22, 2012
> What is the status at the moment? What compiler and with which compiler flags I should use to achieve maximum performance?

In general gdc or ldc. Not sure how good vectorization is though, esp. auto-vectorization.
On the other hand the so called vector operations like a[] = b[] + c[]; are lowered to hand-written SSE assembly even in dmd.
March 23, 2012
On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
>> What is the status at the moment? What compiler and with which compiler flags I should use to achieve maximum performance?
>
> In general gdc or ldc. Not sure how good vectorization is though, esp. auto-vectorization.
> On the other hand the so called vector operations like a[] = b[] + c[]; are lowered to hand-written SSE assembly even in dmd.

I had such a snippet to test:

  1 import std.stdio;
  2 void main()
  3 {
  4   double[2] a=[1.,0.];
  5   double[2] a1=[1.,0.];
  6   double[2] a2=[1.,0.];
  7   double[2] a3=[0.,0.];
  8   foreach(i;0..1000000000)
  9     a3[]+=a[]+a1[]*a2[];
 10   writeln(a3);
 11 }

And I compared with the following d code:

  1 import std.stdio;
  2 void main()
  3 {
  4   double[2] a=[1.,0.];
  5   double[2] a1=[1.,0.];
  6   double[2] a2=[1.,0.];
  7   double[2] a3=[0.,0.];
  8   foreach(i;0..1000000000)
  9   {
 10     a3[0]+=a[0]+a1[0]*a2[0];
 11     a3[1]+=a[1]+a1[1]*a2[1];
 12   }
 13   writeln(a3);
 14 }

And with the following c code:

  1 #include  <stdio.h>
  2 int main()
  3 {
  4   double a[2]={1.,0.};
  5   double a1[2]={1.,0.};
  6   double a2[2]={1.,0.};
  7   double a3[2];
  8   unsigned i;
  9   for(i=0;i<1000000000;++i)
 10   {
 11     a3[0]+=a[0]+a1[0]*a2[0];
 12     a3[1]+=a[1]+a1[1]*a2[1];
 13   }
 14   printf("%f %f\n",a3[0],a3[1]);
 15   return 0;
 16 }

The last one I compiled with gcc two previous with dmd and ldc. C code with -O2
was the fastest and as fast as d without slicing compiled with ldc. d code with slicing was 3 times slower (ldc compiler). I tried to compile with different optimization flags, that didn't help. Maybe I used the wrong ones. Can someone comment on this?
March 23, 2012
On 23 March 2012 18:57, Comrad <comrad.karlovich@googlemail.com> wrote:
> On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
>>>
>>> What is the status at the moment? What compiler and with which compiler flags I should use to achieve maximum performance?
>>
>>
>> In general gdc or ldc. Not sure how good vectorization is though, esp.
>> auto-vectorization.
>> On the other hand the so called vector operations like a[] = b[] + c[];
>> are lowered to hand-written SSE assembly even in dmd.
>
>
> I had such a snippet to test:
>
>  1 import std.stdio;
>  2 void main()
>  3 {
>  4   double[2] a=[1.,0.];
>  5   double[2] a1=[1.,0.];
>  6   double[2] a2=[1.,0.];
>  7   double[2] a3=[0.,0.];
>  8   foreach(i;0..1000000000)
>  9     a3[]+=a[]+a1[]*a2[];
>  10   writeln(a3);
>  11 }
>
> And I compared with the following d code:
>
>  1 import std.stdio;
>  2 void main()
>  3 {
>  4   double[2] a=[1.,0.];
>  5   double[2] a1=[1.,0.];
>  6   double[2] a2=[1.,0.];
>  7   double[2] a3=[0.,0.];
>  8   foreach(i;0..1000000000)
>  9   {
>  10     a3[0]+=a[0]+a1[0]*a2[0];
>  11     a3[1]+=a[1]+a1[1]*a2[1];
>  12   }
>  13   writeln(a3);
>  14 }
>
> And with the following c code:
>
>  1 #include  <stdio.h>
>  2 int main()
>  3 {
>  4   double a[2]={1.,0.};
>  5   double a1[2]={1.,0.};
>  6   double a2[2]={1.,0.};
>  7   double a3[2];
>  8   unsigned i;
>  9   for(i=0;i<1000000000;++i)
>  10   {
>  11     a3[0]+=a[0]+a1[0]*a2[0];
>  12     a3[1]+=a[1]+a1[1]*a2[1];
>  13   }
>  14   printf("%f %f\n",a3[0],a3[1]);
>  15   return 0;
>  16 }
>
> The last one I compiled with gcc two previous with dmd and ldc. C code with
> -O2
> was the fastest and as fast as d without slicing compiled with ldc. d code
> with slicing was 3 times slower (ldc compiler). I tried to compile with
> different optimization flags, that didn't help. Maybe I used the wrong ones.
> Can someone comment on this?

The flags you want are -O2, -inline -release.

If you don't have those, then that might explain some of the slow down on slicing, since -release drops a ton of runtime checks.

Otherwise, I'm not sure why its so much slower, the druntime array ops are written using SIMD instructions where available, so it should be fast.

--
James Miller
March 23, 2012
On 23.03.2012 9:57, Comrad wrote:
> On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
>>> What is the status at the moment? What compiler and with which
>>> compiler flags I should use to achieve maximum performance?
>>
>> In general gdc or ldc. Not sure how good vectorization is though, esp.
>> auto-vectorization.
>> On the other hand the so called vector operations like a[] = b[] +
>> c[]; are lowered to hand-written SSE assembly even in dmd.
>
> I had such a snippet to test:
>
> 1 import std.stdio;
> 2 void main()
> 3 {
> 4 double[2] a=[1.,0.];
> 5 double[2] a1=[1.,0.];
> 6 double[2] a2=[1.,0.];
> 7 double[2] a3=[0.,0.];

Here is a culprit, the array ops [] are tuned for arbitrary long(!) arrays, they are not plain 1 simd SEE op. They are handcrafted loops(!) on SSE ops, cool and fast for arrays in general, not fixed pairs/trios/etc. I believe it might change in future, if compiler is able to deduce that size is fixed, and use more optimal code for small sizes.

> 8 foreach(i;0..1000000000)
> 9 a3[]+=a[]+a1[]*a2[];
> 10 writeln(a3);
> 11 }
>
> And I compared with the following d code:
>
> 1 import std.stdio;
> 2 void main()
> 3 {
> 4 double[2] a=[1.,0.];
> 5 double[2] a1=[1.,0.];
> 6 double[2] a2=[1.,0.];
> 7 double[2] a3=[0.,0.];
> 8 foreach(i;0..1000000000)
> 9 {
> 10 a3[0]+=a[0]+a1[0]*a2[0];
> 11 a3[1]+=a[1]+a1[1]*a2[1];
> 12 }
> 13 writeln(a3);
> 14 }
>
> And with the following c code:
>
> 1 #include <stdio.h>
> 2 int main()
> 3 {
> 4 double a[2]={1.,0.};
> 5 double a1[2]={1.,0.};
> 6 double a2[2]={1.,0.};
> 7 double a3[2];
> 8 unsigned i;
> 9 for(i=0;i<1000000000;++i)
> 10 {
> 11 a3[0]+=a[0]+a1[0]*a2[0];
> 12 a3[1]+=a[1]+a1[1]*a2[1];
> 13 }
> 14 printf("%f %f\n",a3[0],a3[1]);
> 15 return 0;
> 16 }
>
> The last one I compiled with gcc two previous with dmd and ldc. C code
> with -O2
> was the fastest and as fast as d without slicing compiled with ldc. d
> code with slicing was 3 times slower (ldc compiler). I tried to compile
> with different optimization flags, that didn't help. Maybe I used the
> wrong ones. Can someone comment on this?


-- 
Dmitry Olshansky
March 23, 2012
> The flags you want are -O, -inline -release.
>
> If you don't have those, then that might explain some of the slow down
> on slicing, since -release drops a ton of runtime checks.

-noboundscheck option can also speed up things.
March 23, 2012
On Friday, 23 March 2012 at 10:48:55 UTC, Dmitry Olshansky wrote:
> On 23.03.2012 9:57, Comrad wrote:
>> On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
>>>> What is the status at the moment? What compiler and with which
>>>> compiler flags I should use to achieve maximum performance?
>>>
>>> In general gdc or ldc. Not sure how good vectorization is though, esp.
>>> auto-vectorization.
>>> On the other hand the so called vector operations like a[] = b[] +
>>> c[]; are lowered to hand-written SSE assembly even in dmd.
>>
>> I had such a snippet to test:
>>
>> 1 import std.stdio;
>> 2 void main()
>> 3 {
>> 4 double[2] a=[1.,0.];
>> 5 double[2] a1=[1.,0.];
>> 6 double[2] a2=[1.,0.];
>> 7 double[2] a3=[0.,0.];
>
> Here is a culprit, the array ops [] are tuned for arbitrary long(!) arrays, they are not plain 1 simd SEE op. They are handcrafted loops(!) on SSE ops, cool and fast for arrays in general, not fixed pairs/trios/etc. I believe it might change in future, if compiler is able to deduce that size is fixed, and use more optimal code for small sizes.
>

So currently there is no such an optimization exists for any d
compiler?

March 23, 2012
On Friday, 23 March 2012 at 11:20:59 UTC, Trass3r wrote:
>> The flags you want are -O, -inline -release.
>>
>> If you don't have those, then that might explain some of the slow down
>> on slicing, since -release drops a ton of runtime checks.
>
> -noboundscheck option can also speed up things.

dmd is anyway very slow, ldc2 was better, but still not fast enough.