Improving dot product for standard multidimensional D arrays (page 2)

it is difficult to write an efficient matrix matrix multiplication in any language. If you want a fair comparison, implement your naive method in python and compare those timings. Op di 3 mrt. 2020 om 04:20 schreef 9il via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com>: > On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote: > > Hello again, > > > > Thanks to previous thread on multidimensional arrays, I managed to play around with pure D matrix representations and even benchmark a little against numpy: > > > > [...] > > Matrix multiplication is about cache-friendly blocking. https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf > > `mir-blas` package can be used for matrix operations for ndslice. > `cblas` - if you want to work with your own matrix type . >

On Tuesday, 3 March 2020 at 10:25:27 UTC, maarten van damme wrote: > it is difficult to write an efficient matrix matrix multiplication in any language. If you want a fair comparison, implement your naive method in python and compare those timings. > > Op di 3 mrt. 2020 om 04:20 schreef 9il via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com>: > >> On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote: >> > [...] >> >> Matrix multiplication is about cache-friendly blocking. https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf >> >> `mir-blas` package can be used for matrix operations for ndslice. >> `cblas` - if you want to work with your own matrix type . Yeah, got it. After some reading, I understand that's not trivial once bigger matrices are involved.

On Tuesday, 3 March 2020 at 10:25:27 UTC, maarten van damme wrote: > it is difficult to write an efficient matrix matrix multiplication in any language. If you want a fair comparison, implement your naive method in python and compare those timings. > [snip] And of course there's going to be a big slowdown in using native python. Numpy basically calls blas in the background. A naive C implementation might be another comparison.

On Sunday, 1 March 2020 at 20:58:42 UTC, p.shkadzko wrote: > pragma(inline) static int toIdx(T)(Matrix!T m, in int i, in int j) > { > return m.cols * i + j; > } This is row-major order [1]. BTW: Why don't you make toIdx a member of Matrix? It saves one parameter. You may also define opIndex as ref T opIndex(in int r, in int c) Then the innermost summation becomes more readable: m3[i, j] += m1[i, k] * m2[k, j]; How about performing an in-place transposition of m2 before performing the dot product? Then you can then rewrite the innermost loop: m3[i, j] += m1[i, k] * m2[j, k]; // note: j and k swapped This should avoid the costly jumping thru the memory. A good starting point for a performance analysis would be looking over the assember code of the innermost loop. [1] https://en.wikipedia.org/wiki/Row_major

March 04, 2020

Re: Improving dot product for standard multidimensional D arrays

Posted by Timon Gehr
in reply to p.shkadzko

Permalink

Timon Gehr

Posted in reply to p.shkadzko

Permalink

On 01.03.20 21:58, p.shkadzko wrote:
> 
> ******************************************************************
> Matrix!T matrixDotProduct(T)(Matrix!T m1, Matrix!T m2)
> in
> {
>      assert(m1.rows == m2.cols);

This asserts that the result is a square matrix. I think you want `m1.cols==m2.rows` instead.

> }
> do
> {
>      Matrix!T m3 = Matrix!T(m1.rows, m2.cols);
> 
>      for (int i; i < m1.rows; ++i)
>      {
>          for (int j; j < m2.cols; ++j)
>          {
>              for (int k; k < m2.rows; ++k)
>              {
>                  m3.data[toIdx(m3, i, j)] += m1[i, k] * m2[k, j];
>              }
>          }
>      }
>      return m3;
> }
> ******************************************************************
> ...
> I can see that accessing the appropriate array member in Matrix.data is costly due to toIdx operation but, I can hardly explain why it gets so much costly. Maybe there is a better way to do it after all?

Changing the order of the second and third loop probably goes a pretty long way in terms of cache efficiency:

Matrix!T matrixDotProduct(T)(Matrix!T m1,Matrix!T m2)in{
    assert(m1.cols==m2.rows);
}do{
    int m=m1.rows,n=m1.cols,p=m2.cols;
    Matrix!T m3=Matrix!T(m,p);
    foreach(i;0..m) foreach(j;0..n) foreach(k;0..p)
        m3.data[i*p+k]+=m1.data[i*n+j]*m2.data[j*p+k];
    return m3;
}


(untested.)

Forums