March 04, 2013
On 4 March 2013 14:50, J <private@private-dont-email-dont-spam.com> wrote:

> On Monday, 4 March 2013 at 04:22:01 UTC, bearophile wrote:
>
>> So this should be better:
>>
>> http://codepad.org/B5b4uyBM
>>
>> Bye,
>> bearophile
>>
>
> @bearophile: Thank you!  Unfortunately the http://codepad.org/B5b4uyBMcode runs a bit *slower* than the original D code. Yikes!
>
> $  gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear
> $ time ./dbear
> -1015380632 859379360 -367726792 -1548829944
>
> real    2m36.971s
> user    2m36.910s
> sys 0m0.030s
> $ time ./dbear
> -1015380632 859379360 -367726792 -1548829944
>
> real    2m34.425s
> user    2m34.370s
> sys 0m0.020s
>
>
> @John Colvin: here is the disassembly of mmult() in both languages. Unfortunately I'm not literate in x86_64 assembly.  Perhaps the problem is obvious to you?  All I can really tell is that the g++ version is shorter.
>
> The memory allocation, when timed separately (comment out mmult), is less than 60 msec for either version, so I don't think its a memory issue, although it could be caching issue since the matrix layouts are different.
>

Using dynamic arrays of dynamic arrays that way is pretty poor form regardless of the language.

You should really use single dimensional array:
  int matrix[SIZE*SIZE];
And index via:
  matrix[y*SIZE+x]

(You can pretty this up in various ways, if this is too unsightly)


On 4 March 2013 14:02, John Colvin <john.loughran.colvin@gmail.com> wrote:

>     int[][] m = new int[][](rows, cols);
>

Does D support proper square array's this way? Or does it just automate
allocation of the inner arrays?
Does it allocate all the associated memory in one block?


March 04, 2013
On Monday, 4 March 2013 at 04:49:20 UTC, Andrei Alexandrescu wrote:
> You're measuring the speed of a couple of tight loops. The smallest differences in codegen between them will be on the radar. Use straight for loops or "foreach (i; 0 .. limit)" for those loops...

Thanks Andrei!

I validated your analysis by doing a straight port of the C code to D, even using the same memory layout (matching malloc calls). The port was trivial, which was very reassuring. As evidence, I had to tweak only 8 lines othe C to make compiled under gdc. The mmult() function in the D version remained identical to that in the C version.

Even more reassuring was that the performance of the resulting D matched the C to within 1% tolerance (about 200-500 msec seconds slower on the D side; presumably due to runtime init).

$time ./gdc_compiled_ported_cpp_version
-1015380632 859379360 -367726792 -1548829944

real    1m32.418s
user    1m32.370s
sys 0m0.020s
$

It's a great feeling, knowing the bare metal is available if need be.

Thanks guys!

-J

March 04, 2013
On Monday, 4 March 2013 at 05:07:10 UTC, Manu wrote:
> Using dynamic arrays of dynamic arrays that way is pretty poor form regardless of the language.
>
> You should really use single dimensional array:
>   int matrix[SIZE*SIZE];
> And index via:
>   matrix[y*SIZE+x]
>
> (You can pretty this up in various ways, if this is too unsightly)
>
>
> On 4 March 2013 14:02, John Colvin <john.loughran.colvin@gmail.com> wrote:
>
>>     int[][] m = new int[][](rows, cols);
>>
>
> Does D support proper square array's this way? Or does it just automate
> allocation of the inner arrays?
> Does it allocate all the associated memory in one block?

That's a really good point. I wonder if there is a canonical matrix that would be preferred?
March 04, 2013
On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:
> That's a really good point. I wonder if there is a canonical matrix that would be preferred?

I'm not sure if they are the recommended/best practice for matrix handling in D at the moment (please advise if they are not), but with a little searching, I found that the SciD library has nice matrices and MattrixViews (column-major storage, LAPACK compatible).

Now I like MatrixViews because they let me beat the original (clearly non-optimal) C matrix multiplication by a couple seconds, and the D code with operator overloading in place makes matrix multiplication look elegant.

One shootout benchmark down, eleven to go. :-)

- J

p.s. Am I right in concluding there are no multimethods (multiple dispatch) in D... it seemed a little awkward to have to wrap the MatrixView in a new struct solely in order to overload multiplication. Is there a better way that I've missed?

#beating the C/malloc/shootout based code that ran in 1m32sec:

$ time ./scid_matrix_test
-1015380632 859379360 -367726792 -1548829944

real    1m27.967s
user    1m27.930s
sys 0m0.030s
$ time ./scid_matrix_test
-1015380632 859379360 -367726792 -1548829944

real    1m28.259s
user    1m28.230s
sys 0m0.020s
$

module scid_matrix_test;

// compile: gdmd -O -release -inline -noboundscheck scid_matrix_test.d -L-L/home/you/pkg/scid/scid/ -L-lscid -L-L/usr/lib/x86_64-linux-gnu/ -L-lgfortran -L-lblas -L-llapack -L-L.

import scid.matrix;

import std.stdio, std.string, std.array, std.conv;
const int SIZE = 2000;

import std.stdio;


// Doesn't seem to have multiple dispatch / multimethods, so
// I guess we have to wrap MatrixView to implement opBinary!"*"(lhs,rhs)

struct Multipliable (T) {
  MatrixView!(T) _m;
  alias _m this;
  this(MatrixView!(T) src) {
    _m = src;
  }

  Multipliable!(T) opBinary(string op : "*")(ref Multipliable rhs) if (op == "*") {
    auto r = Multipliable!int(matrix!int(_m.rows,rhs.cols));
    return mmult(this, rhs, r);
  }

}


Multipliable!(int) mkmatrix(int rows, int cols)
{
    auto m = Multipliable!int();
    m._m = matrix!int(rows,cols);

    int count = 1;
    for(int i = 0; i < rows; ++i)
    {
        for(int j = 0; j < cols; ++j)
        {
            m[i,j] = count++;
        }
    }
    return(m);
}



Multipliable!(int) mmult(ref Multipliable!(int) m1, ref Multipliable!(int) m2, ref Multipliable!(int) m3) {
    int i, j, k, val;
    ulong rows = m1.rows;
    ulong cols = m1.cols;
    for (i=0; i<rows; i++) {
    for (j=0; j<cols; j++) {
        val = 0;
        for (k=0; k<cols; k++) {
        val += m1[i,k] * m2[k,j];
        }
        m3[i,j] = val;
    }
    }
    return(m3);
}


void main()
{
    auto m1 = mkmatrix(SIZE,SIZE);
    auto m2 = mkmatrix(SIZE,SIZE);
    auto mm = mkmatrix(SIZE,SIZE);

    mm = m1 * m2;

    writefln("%d %d %d %d",mm[0,0],mm[2,3],mm[3,2],mm[4,4]);

}
March 04, 2013
J wrote:

> @bearophile: Thank you!  Unfortunately the http://codepad.org/B5b4uyBM code runs a bit *slower* than the original D code. Yikes!
>
> $  gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear
> $ time ./dbear
> -1015380632 859379360 -367726792 -1548829944
>
> real    2m36.971s
> user    2m36.910s
> sys 0m0.030s
> $ time ./dbear
> -1015380632 859379360 -367726792 -1548829944
>
> real    2m34.425s
> user    2m34.370s
> sys 0m0.020s

A bit better version:
http://codepad.org/jhbYxEgU

I think this code is good compared to the original (there are better algorithms). So maybe the introduced inefficiencies are small compiler problems worth finding and reporting.

------------------

Manu:

> Does D support proper square array's this way? Or does it just automate allocation of the inner arrays?
> Does it allocate all the associated memory in one block?

Maybe you should take a look at druntime code.

Bye,
bearophile
March 04, 2013
On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:
> Manu:
>
>> Does D support proper square array's this way? Or does it just automate allocation of the inner arrays?
>> Does it allocate all the associated memory in one block?
>
> Maybe you should take a look at druntime code.
>
> Bye,
> bearophile

As far as I know it's just a way of automating the allocation, it expands out to the necessary loops.
March 04, 2013
On Monday, 4 March 2013 at 04:15:41 UTC, bearophile wrote:
> John Colvin:
>
>> First things first:
>> You're not just timing the multiplication, you're timing the memory allocation as well. I suggest using http://dlang.org/phobos/std_datetime.html#StopWatch to do some proper timings in D
>
> Nope, what matters is the total program runtime.
>
> Bye,
> bearophile

The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.
March 04, 2013
On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:
>
> A bit better version:
> http://codepad.org/jhbYxEgU
> Bye,
> bearophile

http://dpaste.dzfl.pl/ is back online btw
March 04, 2013
John Colvin:

> The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.

If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time.


> http://dpaste.dzfl.pl/ is back online btw

dpaste is too much javascript-heavy for a very quick pasting. codepad is much more light if I don't want to run D code.

Bye,
bearophile
March 04, 2013
> A bit better version:
> http://codepad.org/jhbYxEgU
>
> I think this code is good compared to the original (there are better algorithms).

You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this:

void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow {
    foreach (immutable i; 0 .. m1.length)
        foreach (immutable k; 0 .. m2[0].length)
            foreach (immutable j; 0 .. m3[0].length)
                m3[i][j] += m1[i][k] * m2[k][j];
}

you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with  gdmd -O -inline -release -noboundscheck -mavx).

This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.