August 10, 2008
Walter Bright wrote:
> This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?
> 
> http://www.digitalmars.com/d/1.0/changelog.html
> http://ftp.digitalmars.com/dmd.1.034.zip
> 
> http://www.digitalmars.com/d/2.0/changelog.html
> http://ftp.digitalmars.com/dmd.2.018.zip

I intend to contribute some asm routines, but have been working on bigint operations (both Tango and Phobos) for the past couple of weeks.
August 10, 2008
bearophile wrote:
> Walter Bright:
>> If this happens, then it's worth verifying that the asm code is
>> actually being run by inserting a printf in it.
> 
> I presume I'll have to recompile Phobos for that.

Not really, it's easier to just copy that particular function out of the
library and paste it into your test module, that way it's easier to
experiment with.

>>> And I haven't seen yet SS2 asm in my compiled programs :-)
>> The dmd compiler doesn't generate SS2 instructions. But the
>> routines in internal\array*.d do.
> 
> I know. I was talking about the parts of the code that for example
> adds the arrays; according to the phobos source code they use SSE2
> but in the final source code produces they are absent.

I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.
August 10, 2008
Don wrote:
> I intend to contribute some asm routines, but have been working on bigint operations (both Tango and Phobos) for the past couple of weeks.

Cool!
August 11, 2008
dsimcha wrote:
> == Quote from bearophile (bearophileHUGS@lycos.com)'s article
>> First benchmark, just D against itself, not used GCC yet, the results show that
> vector ops are generally slower, but maybe there's some bug/problem in my
> benchmark (note it needs just Phobos!), not tested on Linux yet:
> 
> I see at least part of the problem.  When you use such huge arrays, it ends up
> being more a test of your memory bandwidth than of the vector ops.  Three arrays
> of 80000 ints comes to a total of about 960k.  This is not going to fit in any L1
> cache for a long time.

Yes. The solution to that is to check for huge array sizes, and use a different routine (using prefetching) in that case. Actually, the most important routine to be doing that is memcpy/ array slice assignment, but I'm not sure it does. I think it just does a movsd.

So I think this is still a useful case to benchmark, it's not the most important one, though.
August 12, 2008
"Walter Bright" <newshound1@digitalmars.com> wrote in message news:g7na5s$qg0$1@digitalmars.com...
> bearophile wrote:
>> Walter Bright:
>>> If this happens, then it's worth verifying that the asm code is
>>> actually being run by inserting a printf in it.
>>
>> I presume I'll have to recompile Phobos for that.
>
> Not really, it's easier to just copy that particular function out of the
> library and paste it into your test module, that way it's easier to
> experiment with.
>
>>>> And I haven't seen yet SS2 asm in my compiled programs :-)
>>> The dmd compiler doesn't generate SS2 instructions. But the
>>> routines in internal\array*.d do.
>>
>> I know. I was talking about the parts of the code that for example
>> adds the arrays; according to the phobos source code they use SSE2
>> but in the final source code produces they are absent.
>
> I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.

The SSE2 is being used, but what would be nice would be the same code that Burton used for his benchmarks. Is that available?

Thanks,

- Dave

import std.stdio, std.date, std.conv;

void main(string[] args)
{
   if(args.length < 3)
   {
       writefln("usage: ",args[0]," <array size> <iterations>");
       return;
   }
   auto ASIZE = toInt(args[1]);
   auto ITERS = toInt(args[2]);
   writefln("Array Size = ",ASIZE,", Iterations = ",ITERS);
   int[] ia, ib, ic;
   ia = new int[ASIZE];
   ib = new int[ASIZE];
   ic = new int[ASIZE];
   ib[] = ic[] = 10;
   double[] da, db, dc;
   da = new double[ASIZE];
   db = new double[ASIZE];
   dc = new double[ASIZE];
   db[] = dc[] = 10.0;

   {
   ia[] = 0;
   int sum = 0;
   d_time s = getUTCtime();
   for(size_t i = 0; i < ITERS; i++)
   {
       sum += aops!(int)(ia,ib,ic);
   }
   d_time e = getUTCtime();
   writefln("intaops: ",(e - s) / 1000.0," secs, sum = ",sum);
   }

   {
   ia[] = 0;
   int sum = 0;
   d_time s = getUTCtime();
   for(size_t i = 0; i < ITERS; i++)
   {
       sum += loop!(int)(ia,ib,ic);
   }
   d_time e = getUTCtime();
   writefln("intloop: ",(e - s) / 1000.0," secs, sum = ",sum);
   }

   {
   da[] = 0.0;
   double sum = 0.0;
   d_time s = getUTCtime();
   for(size_t i = 0; i < ITERS; i++)
   {
       sum += aops!(double)(da,db,dc);
   }
   d_time e = getUTCtime();
   writefln("dfpaops: ",(e - s) / 1000.0," secs, sum = ",sum);
   }

   {
   da[] = 0.0;
   double sum = 0.0;
   d_time s = getUTCtime();
   for(size_t i = 0; i < ITERS; i++)
   {
       sum += loop!(double)(da,db,dc);
   }
   d_time e = getUTCtime();
   writefln("dfploop: ",(e - s) / 1000.0," secs, sum = ",sum);
   }
}

T aops(T)(T[] a, T[] b, T[] c)
{
   a[] = b[] + c[];
   return a[$-1];
}

T loop(T)(T[] a, T[] b, T[] c)
{
   foreach(i, inout val; a) val = b[i] + c[i];
   return a[$-1];
}

C:\Zz>dmd -O -inline -release top.d

C:\Zz>top 4000 100000
Array Size = 4000, Iterations = 100000
intaops: 0.204 secs, sum = 2000000
intloop: 0.515 secs, sum = 2000000
dfpaops: 0.625 secs, sum = 2e+06
dfploop: 0.563 secs, sum = 2e+06

August 12, 2008
"Dave" <Dave_member@pathlink.com> wrote in message news:g7qr3h$2l6$1@digitalmars.com...
>
> "Walter Bright" <newshound1@digitalmars.com> wrote in message news:g7na5s$qg0$1@digitalmars.com...
>> bearophile wrote:
>>> Walter Bright:
>>>> If this happens, then it's worth verifying that the asm code is actually being run by inserting a printf in it.
>>>
>>> I presume I'll have to recompile Phobos for that.
>>
>> Not really, it's easier to just copy that particular function out of the library and paste it into your test module, that way it's easier to experiment with.
>>
>>>>> And I haven't seen yet SS2 asm in my compiled programs :-)
>>>> The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
>>>
>>> I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent.
>>
>> I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.
>
> The SSE2 is being used, but what would be nice would be the same code that Burton used for his benchmarks. Is that available?
>
> Thanks,
>
> - Dave
>

Before:

>
> C:\Zz>top 4000 100000
> Array Size = 4000, Iterations = 100000
> intaops: 0.204 secs, sum = 2000000
> intloop: 0.515 secs, sum = 2000000
> dfpaops: 0.625 secs, sum = 2e+06
> dfploop: 0.563 secs, sum = 2e+06
>

After adding aligned case for _arraySliceSliceAddSliceAssign_d

C:\Zz>top 4000 100000
Array Size = 4000, Iterations = 100000
intaops: 0.212 secs, sum = 2000000
intloop: 0.525 secs, sum = 2000000
dfpaops: 0.438 secs, sum = 2e+06
dfploop: 0.557 secs, sum = 2e+06

;---

SiSoftware Sandra

Processor
Model : Intel(R) Core(TM)2 CPU          6700  @ 2.66GHz

Processor Cache(s)
Internal Data Cache : 32kB, Synchronous, Write-Thru, 8-way set, 64 byte line
size
Internal Instruction Cache : 32kB, Synchronous, Write-Back, 8-way set, 64
byte line size
L2 On-board Cache : 4MB, ECC, Synchronous, ATC, 16-way set, 64 byte line
size, 2 threads sharing
L2 Cache Multiplier : 1/1x  (2667MHz)


August 14, 2008
Walter Bright wrote:
> This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?
> 
> http://www.digitalmars.com/d/1.0/changelog.html
> http://ftp.digitalmars.com/dmd.1.034.zip
> 
> http://www.digitalmars.com/d/2.0/changelog.html
> http://ftp.digitalmars.com/dmd.2.018.zip

My tests indicate that array operations also support ^ and ^=, but that's not listed in the spec. Not the first time that D's been better than advertised. <g>
1 2 3 4 5
Next ›   Last »