March 04, 2013
On Monday, 4 March 2013 at 15:46:50 UTC, jerro wrote:
>> A bit better version:
>> http://codepad.org/jhbYxEgU
>>
>> I think this code is good compared to the original (there are better algorithms).
>
> You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this:
>
> void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow {
>     foreach (immutable i; 0 .. m1.length)
>         foreach (immutable k; 0 .. m2[0].length)
>             foreach (immutable j; 0 .. m3[0].length)
>                 m3[i][j] += m1[i][k] * m2[k][j];
> }
>
> you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with  gdmd -O -inline -release -noboundscheck -mavx).
>
> This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.

forgot to set m3's elements to zero before adding to them:

void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow {
    foreach (immutable i; 0 .. m1.length)
    {
        m3[i][] = 0;

        foreach (immutable k; 0 .. m2[0].length)
            foreach (immutable j; 0 .. m3[0].length)
                m3[i][j] += m1[i][k] * m2[k][j];
    }
}

This does not make the function noticeably slower.
March 04, 2013
On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:
> John Colvin:
>
>> The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.
>
> If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time.
>
> Bye,
> bearophile

I disagree. Information about which parts of the code are running fast and which are running slow is critical to optimisation. If you don't know whether it's the D memory allocation that's slow or the D multiplication loops, you're trying to optimise blindfolded.

Even if all your doing is a comparison, it's a lot more useful to know *where* the slowdown is happening so that you can make a meaningful analysis of the results.

Enter a strange example:
I found that malloced multi-dim arrays were considerably faster to iterate over and assign to than D gc slices, even with the gc disable after allocation and bounds checks turned off.

If I hadn't bothered to do separate timings of the allocation and iteration, I would never have noticed this effect and instead written it off as purely "malloc is faster at allocating than the GC"
March 04, 2013
On 3/3/2013 8:50 PM, J wrote:
> Dump of assembler code for function _D6matrix5mmultFAAiAAiAAiZv:

Using obj2asm will get you much nicer looking assembler.
March 04, 2013
On 3/3/2013 7:48 PM, J wrote:
> void mmult(int[][] m1, int[][] m2, int[][] m3)
> {
>      foreach(int i, int[] m1i; m1)
>      {
>          foreach(int j, ref int m3ij; m3[i])
>          {
>              int val;
>              foreach(int k, int[] m2k; m2)
>              {
>                  val += m1i[k] * m2k[j];
>              }
>              m3ij = val;
>          }
>      }
> }
[...]
> ////// C++ version
> int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
>      int i, j, k, val;
>      for (i=0; i<rows; i++) {
>      for (j=0; j<cols; j++) {
>          val = 0;
>          for (k=0; k<cols; k++) {
>          val += m1[i][k] * m2[k][j];
>          }
>          m3[i][j] = val;
>      }
>      }
>      return(m3);
> }

One difference that jumps out at me is you have extra variables and ref types in the D version, and in the C++ version you have "cached" the row & column loop limits. (I.e. the C++ version assumes a rectangular matrix, while the D one has a (presumably) different length for each column.)

March 05, 2013
On 3/4/2013 9:00 AM, John Colvin wrote:
> On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:
>> John Colvin:
>>
>>> The performance of the multiplication loops and the performance of the
>>> allocation are separate issues and should be measured as such, especially if
>>> one wants to make meaningful optimisations.
>>
>> If you want to improve the D compiler, druntime, etc, then I agree you have to
>> separate the variables and test them one at a time. But if you are comparing
>> languages+runtimes+libraries then it's better to not cheat, and test the whole
>> running (warmed) time.
>>
>> Bye,
>> bearophile
>
> I disagree. Information about which parts of the code are running fast and which
> are running slow is critical to optimisation. If you don't know whether it's the
> D memory allocation that's slow or the D multiplication loops, you're trying to
> optimise blindfolded.

I agree with John. Correctly interpreting benchmark results will enable you to tune your application to run much faster, even without any changes to the compiler or runtime library.

March 05, 2013
On Monday, 4 March 2013 at 15:57:42 UTC, jerro wrote:
>> matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with  gdmd -O -inline -release -noboundscheck -mavx).

Thanks Jerro. You made me realize that help from the experts could be quite useful.  I plugged in a call to the BLAS matrix multiply routine, which SciD conveniently binds.

The result?  My 2000x2000 matrix multiply went from 98 seconds down to 1.8 seconds.  Its just hilariously faster to use 20 years of numerical experts optimized code than to try to write your own.


// screaming fast version - uses BLAS for 50x speedup over naive code.
//
Multipliable!(T) mmult2(T)(ref Multipliable!(T) m1,
                         ref Multipliable!(T) m2,
                         ref Multipliable!(T) m3) {
    m3.array[] = 0;

    assert(m1.cols == m2.rows);

    char ntran = 'N';
    double one = 1.0;
    double zero = 0.0;
    int nrow = cast(int)m1.rows;
    int ncol = cast(int)m1.cols;
    int mcol = cast(int)m2.cols;

    scid.bindings.blas.blas.dgemm_(&ntran, // transa
                                   &ntran, // transb
				   &nrow,  // m
				   &mcol,  // n
				   &ncol,  // k
				   &one,   // alpha
				   m1.array.ptr, // A
                                   &nrow,        // lda
                                   m2.array.ptr, // B
                                   &ncol,        // ldb
                                   &zero,        // beta
                                   m3.array.ptr, // C
                                   &nrow,        // ldc
                                   nrow,         // transa_len
                                   ncol);        // transb_len
    return m3;
}

March 05, 2013
On Monday, 4 March 2013 at 12:28:25 UTC, J wrote:
> On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:
>> That's a really good point. I wonder if there is a canonical matrix that would be preferred?
>
> I'm not sure if they are the recommended/best practice for matrix handling in D at the moment (please advise if they are not), but with a little searching, I found that the SciD library has nice matrices and MattrixViews (column-major storage, LAPACK compatible).
>
> Now I like MatrixViews because they let me beat the original (clearly non-optimal) C matrix multiplication by a couple seconds, and the D code with operator overloading in place makes matrix multiplication look elegant.
>
> One shootout benchmark down, eleven to go. :-)
>
> - J
>
> p.s. Am I right in concluding there are no multimethods (multiple dispatch) in D... it seemed a little awkward to have to wrap the MatrixView in a new struct solely in order to overload multiplication. Is there a better way that I've missed?

I'm the author of SciD.  It's great that you found it useful! :)

When I wrote scid.matrix and scid.linalg, it was because I needed some linear algebra functionality for a project at work.  I knew I wouldn't have the time to write a full-featured linear algebra library, so I intentionally gave it a minimalistic design, and only included the stuff I needed.  That's also why I used the name "MatrixView" rather than "Matrix"; it was only supposed to be a convenient way to view an array as a matrix, and not a full matrix type.

Later, Cristi Cobzarenco forked my library for Google Summer of Code 2011, with David Simcha as his mentor.  They removed almost everything but the linear algebra modules, and redesigned those from scratch.  So basically, there is pretty much nothing left of my code in their library. ;)  I don't think they ever completed the project, but I believe parts of it are usable.  You'll find it here:

    https://github.com/cristicbz/scid

AFAIK, their goal was to use expression templates in order to transform D expressions like

    x*A*B + y*C

where A, B and C are matrices, and x and y are scalars, into as few optimised BLAS calls as possible -- e.g., a single GEMM call in the example above.  I don't know how far they got on this, though.

Lars
March 05, 2013
On 03/04/2013 04:46 PM, jerro wrote:
>> A bit better version:
>> http://codepad.org/jhbYxEgU
>>
>> I think this code is good compared to the original (there are better
>> algorithms).
>
> You can make it much faster even without really changing the algorithm.
> Just by reversing the order of inner two loops like this:
>
> void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow {
>      foreach (immutable i; 0 .. m1.length)
>          foreach (immutable k; 0 .. m2[0].length)
>              foreach (immutable j; 0 .. m3[0].length)
>                  m3[i][j] += m1[i][k] * m2[k][j];
> }
>
> you can make the code much more cache friendly (because now you aren't
> iterating any matrix by column in the inner loop) and also allow the
> compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my
> machine and matrixMul()takes 72 seconds (both compiled with  gdmd -O
> -inline -release -noboundscheck -mavx).
>
> This isn't really relevant to the comparison with C++ in this thread, I
> just thought it may be useful for anyone writing matrix code.

This is a little faster for 2000x2000 matrices and typical cache size:

void matrixMult(in int[][] m1, in int[][] m2, int[][] m3){
    enum block=50; // optimal parameter depends on hardware
    foreach(ib;iota(0,m1.length,block))
    foreach(kb;iota(0,m2[0].length,block))
    foreach(i;iota(ib,min(m1.length,ib+block)))
    foreach(k;iota(kb,min(m2[0].length,kb+block)))
        m3[i][] += m1[i][k] * m2[k][];
}

1 2 3
Next ›   Last »