February 14, 2014
On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
> As I am very new to D I instantly ran into certain optimizing issues. E.g. the simple matrix multiplication based in my java implementation requires about 1.5 secs for multiplying two 1000x1000 matrices, however my D implementation requires 39 seconds without compiler optimizations and about 14 seconds with -inline, -O, -noboundscheck and -release. So I am sure that I just missed some possible optimization routines for D and that I could miss entire patters used in D to write more optimized code - especially when looking at the const and immutability concepts.

Try with and without -noboundscheck. Sometimes using it seems to
make things slower, which is odd.

Also, compiling for 64-bit may help significantly. When you
compile for 32-bit, you won't get benefits of SIMD instructions
as it's not guaranteed to be supported by the CPU. Since Java
runs a JIT, it would use these SIMD instructions.

Also, you'll get significantly faster code if you use LDC or GDC.

> class Matrix(T = double) {
> 	private T[] data;
> 	private Dimension dim;
> }

This should be final class or struct, you don't need virtual
methods.

>
> This is my class with its templated data as a one dimensional array (as I don't like jagged-arrays) and a dimension (which is a struct). The dimension object has some util functionality such as getting the total size or mapping (row, col) to a one-dimensional index besides the getter for rows and columns.
>
> this(size_t rows, size_t cols) {
> 	this.dim = Dimension(rows, cols);
> 	this.data = new T[this.dim.size];
> 	enum nil = to!T(0);
> 	foreach(ref T element; this.data) element = nil;
> }

You don't need to!T for setting to 0, just use plain 0. Using
this to will probably result in overhead. Note that this may
evaluated every time you use 'nil', so using a constant will help.


> T opIndex(size_t row, size_t col) const {
> 	immutable size_t i = this.dim.offset(row, col);
> 	if (i >= this.dim.size) {
> 		// TODO - have to learn exception handling in D first. :P
> 	}
> 	return this.data[i];
> }

You're using -noboundscheck then manually implementing bounds
checks, and probably less efficiently than the compiler does.
Either change this to an assert, or remove it. In either case,
the compiler will do it's own built in bounds checking.

This call is especially expensive since you're doing entirely
virtual methods because you specified class instead of final
class or using a struct. None of your Matrix calls will be
inlined.

>
> Which calls:
>
> size_t size() @property const {
> 	return this.rows * this.cols;
> }
>
> I think and hope that this is getting optimized via inlining. =)
> This works similar for opIndexAssign.

Not unless you final class / struct.

>
> The function I am mainly benchmarking is the simple matrix multiplication where one of the multiplied matrices is tranposed first in order to improve cache hit ratio.
>
> Matrix opMul(ref const(Matrix) other) {
> 	if (this.dim.rows != other.dim.cols || this.dim.cols != ther.dim.rows) {
> 		// TODO - still have to learn exception handling first ...
> 	}
> 	auto m = new Matrix(this.dim.rows, other.dim.cols);
> 	auto s = new Matrix(other).transposeAssign();
> 	size_t r1, r2, c;
> 	T sum;
> 	for (r1 = 0; r1 < this.dim.rows; ++r1) {
> 		for (r2 = 0; r2 < other.dim.rows; ++r2) {
> 			sum = to!T(0);
> 			for (c = 0; c < this.dim.cols; ++c) {
> 				sum += this[r1, c] * other[r2, c];
> 			}
> 			m[r1, r2] = sum;
> 		}
> 	}
> 	return m;
> }

These allocations may potentially hurt.
Also, again, the virtual calls here are a huge hit.
Using to!T(0) is also a waste of performance, as youc an just do
straight 0.


> I am aware that there are faster algorithms for matrix multiplication but I am mainly learning how to write efficient D code in general and not based on any algorithm.

Using SIMD instructions manually would of course be much, much,
faster, but I understand that it defeats the point somewhat and
is much more ugly.

>
> This is more or less the same algorithm I am using with java and c++. Both require about 1.5 secs (after java warm-up) to perform this task for two 1000x1000 matrices. D compiled with DMD takes about 14 seconds with all (known-to-me) optimize flag activated. (listed above)

If you used LDC or GDC 64-bit with the changes above, I'd guess
it would be similar. I wouldn't expect D to out-perform Java much
for this particular scenario, there's very little going on and
the JIT should be able to optimize it quite well with SIMD stuff.

>
> I wanted to make s an immutable matrix in order to hopefully improve performance via this change, however I wasn't technically able to do this to be honest.

I don't think this would improve performance.
February 14, 2014
On 2/14/2014 11:00 AM, Robin wrote:
>
> class Matrix(T = double) {
>      private T[] data;
>      private Dimension dim;
> }
>

A matrix is just plain-old-data, so use a struct, you don't need a class.

A struct will be much more lightweight: A struct doesn't normally involve memory allocations like a class does, and you'll get better data locality and less indirection, even compared to a final class.

>
> I am using opIndex and opIndexAssign in order to access and assign the
> matrix values:
>
> T opIndex(size_t row, size_t col) const {
>      immutable size_t i = this.dim.offset(row, col);
>      if (i >= this.dim.size) {
>          // TODO - have to learn exception handling in D first. :P
>      }
>      return this.data[i];
> }

No need for the bounds check. D already does bounds checks automatically (unless you compile with -noboundscheck, but the whole *point* of that flag is to disable bounds checks.)

But that said, I don't know whether the compiler might already be optimizing out your bounds check anyway. So try it and profile, see what happens.

>
> Another nice thing to know would be if it is possible to initialize an
> array before it is default initialized with T.init where T is the type
> of the array's fields. In C++ e.g. there is no default initialization
> which is nice if you have to initialize every single field anyway. E.g.
> in a Matrix.random() method which creates a matrix with random values.
> There it is unnecessary that the (sometimes huge) array is completely
> initialized with the type's init value.

You can opt-out of the default initialization with a void initializer: http://dlang.org/declaration.html#VoidInitializer

Although to be honest I forget how to do that for arrays, and the functions other people already suggested for creating/initing your array probably already do that anyway.

February 14, 2014
Nick Sabalausky:

>> T opIndex(size_t row, size_t col) const {
>>     immutable size_t i = this.dim.offset(row, col);
>>     if (i >= this.dim.size) {
>>         // TODO - have to learn exception handling in D first. :P
>>     }
>>     return this.data[i];
>> }
>
> No need for the bounds check. D already does bounds checks automatically

But the row-column bounds of the matrix are not the same as the 1D bounds of the 1D array. You can have out-of-column-bounds and still be inside the 1D bounds. So that test should go in the precondition and it should test row and col separately:

T opIndex(in size_t row, in size_t col) const pure nothrow
in {
    assert(row < nRows);
    assert(col < nCols);
} body {
    return data[dim.offset(row, col)];
}

Bye,
bearophile
February 15, 2014
Hiho,

wow, first of all: this community is awesome - so many kind and interesting answers. =)

With your help I was able to achieve a performance boost for several different operations.

Some benchmarks:

Allocation of 5 10.000 x 10.000 matrices in a row:
Before: ~8,2 seconds
After: ~2,3 seconds (with the minimallyInitializedArray!)

Multiplication of two 1000x1000 matrices:
Before: ~14,8 seconds
After: ~4,3 seconds

However, I think there is still much potential in order to further optimize this code. Let me tell you what changes I have mainly performed on the code so far ...

Matrix is still a class but I changed it to a final class preventing matrix methods to be virtual. Dimension is now a final struct (don't know if 'final' is affecting structs in any way tough ...). This mainly gave the multiplication a huge performance boost.

When converting the Matrix to a struct from class the multiplication even lowered from ~4.3 seconds to about 3.6 seconds. However, I am currently not sure if I want matrices to be structs (value types).

Besides that I tried to add nothrow and pure as attribute to every possible method. However, as it turned out I wasn't able to add the pure modifier to any method as I always called an impure method with it (as stated by the compiler). This actually made sense most of the times and I think the only pure methods now are the constructor methods of the Dimension struct.

What may still speed up things?

In my tests it turned out that the simple code:

auto m1 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m2 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m3 = m1 * m2;

Called the normal copy-constructor. This is sad as it would be a huge performance boost if it would make use of the move semantics. (In the C++ matrix codes this scenario would actually call move assignment operator for matrix m3 which is much much faster than copying.)

But I haven't figured out yet how to use move semantics in D with class objects. Or is that just possible with struct value types?

I have also tried the LDC compiler. However, it gave me some strange bugs. E.g. it didn't like the following line:

ref Matrix transpose() const {
	return new Matrix(this).transposeAssign();
}

And it forced me to change it to:

ref Matrix transpose() const {
	auto m = new Matrix(this);
	m.transposeAssign();
	return m;
}

Which is kind of ugly ...
I hope that this is getting fixed soon, as I imply it as correct D code because the DMD is capable to compile this correctly.

Some of you came up with the thing that transposing the matrix before multiplication of both takes place must be much slower than without the transposition. In my former Java and C++ programs I have already tested what strategy is faster and it turned out that cache efficiency DOES matter of course. There are also papers explaining why a transpose matrix multiplication may be faster than without transposing.
In the end this is just a test suite and there is already an even faster approach of a matrix multiplication which runs in O(n^2.8) instead of O(n^3) as my current simple solution. And with the power of OpenCL (or similar) one could even lift a matrix multiplication to the GPU and boom.^^ But the current task is to find general optimized code for the D language - this thread already helped me much!

I wasn't aware that D is actually capable of lazy evaluations other than the if-pseude-lazy-evaluation which is kind of cool. However, I still have to look that up in order to maybe find a use for it in these codes.

SIMD instructions also sound extremely cool but a little bit too complex regarding the fact that I am fairly new to D and still learning basics of this language.

In the end I wanted to state something which I found very iritating. Bearophile stated correctly that my pre-conditions for the index operators are all wrong and corrected my code with some smooth additions:

T opIndex(in size_t row, in size_t col) const nothrow
in {
    assert(row < nRows);
    assert(col < nCols);
} body {
    return data[dim.offset(row, col)];
}

The in and body statements are cool as far as I realize what they are for. However, in the benchmarks they had a clear and noticable negative impact on the matrix multiplication which raised to ~8 seconds from ~4 seconds with his code compared to mine. When leaving out the in and body statement block and using only one normal block as follows, it stayed at the 4 seconds duration for that task and so I think that the compiler won't optimize things in an 'in' code block - is that right?

T opIndex(in size_t row, in size_t col) const nothrow {
    assert(row < nRows);
    assert(col < nCols);
    return data[dim.offset(row, col)];
}

Thanks again for all your helpful comments and thanks in advance - I am eagerly looking forward for your future comments! =)

Robin
February 15, 2014
On Saturday, 15 February 2014 at 23:06:27 UTC, Robin wrote:

> Matrix is still a class but I changed it to a final class preventing matrix methods to be virtual. Dimension is now a final struct (don't know if 'final' is affecting structs in any way tough ...).

It doesn't. It may even be disallowed someday, when it's fixed :)

> This mainly gave the multiplication a huge performance boost.
>
> When converting the Matrix to a struct from class the multiplication even lowered from ~4.3 seconds to about 3.6 seconds. However, I am currently not sure if I want matrices to be structs (value types).

Make them value types. If you're using dynamic arrays for storage, you're already using GC plenty, no need for additional allocations (and see below for copy and move semantics). Don't forget a postblit though.

> (In the C++ matrix codes this scenario would actually call move assignment operator for matrix m3 which is much much faster than copying.)

D performs return value optimization: it moves result whenever it can.

> But I haven't figured out yet how to use move semantics in D with class objects. Or is that just possible with struct value types?

Yup, it "just works". In most cases, anyway.
But move semantics in D differ from C++: D doesn't have rvalue references, and thus the compiler only performs a move when it can prove that value will no longer be used (there's a lengthy description in TDPL but I'm too lazy now to fetch it for exact citation :) ).
For explicit moves, there's std.algorithm.move(), though AFAIK it's underimplemented at the moment.

> I have also tried the LDC compiler. However, it gave me some strange bugs. E.g. it didn't like the following line:
>
> ref Matrix transpose() const {
> 	return new Matrix(this).transposeAssign();
> }
>
> And it forced me to change it to:
>
> ref Matrix transpose() const {
> 	auto m = new Matrix(this);
> 	m.transposeAssign();
> 	return m;
> }
>
> Which is kind of ugly ...
> I hope that this is getting fixed soon, as I imply it as correct D code because the DMD is capable to compile this correctly.

Yes, this should be allowed. But you could also just write (new Matrix(this)).transposeAssign() :)
Also, there's no point in ref return when you're returning a reference to class instance.

> T opIndex(in size_t row, in size_t col) const nothrow
> in {
>     assert(row < nRows);
>     assert(col < nCols);
> } body {
>     return data[dim.offset(row, col)];
> }
>
> The in and body statements are cool as far as I realize what they are for. ...so I think that the compiler won't optimize things in an 'in' code block - is that right?

in/out contracts are debug creatures anyway, they're not present in -release builds.
February 16, 2014
Robin:

> Thanks again for all your helpful comments and thanks in advance - I am eagerly looking forward for your future comments! =)

Many of the things you say and write are still wrong or confused. Usually the hard optimization of code is one the last things you learn about a language, because to do it you need to have very clear what the language is doing in every spot (this is true for most languages, like C, C++, Haskell, Python, etc). And you are not yet at this point.

Tomorrow I will answer some things in your post, but in the meantime if you want you can show your whole D program, and tomorrow I'll try to make it faster for LDC2 and I'll explain it some.

Bye,
bearophile
February 16, 2014
In the meantime also take a look at the matrix mul code I've written here:

http://rosettacode.org/wiki/Matrix_multiplication#D

In the first entry you see no whole transposition up front.

Storing the whole matrix in a 1D array is probably more efficient.

Bye,
bearophile
February 16, 2014
On 2/15/2014 6:06 PM, Robin wrote:
>
> Matrix is still a class but I changed it to a final class preventing
> matrix methods to be virtual. Dimension is now a final struct (don't
> know if 'final' is affecting structs in any way tough ...). This mainly
> gave the multiplication a huge performance boost.
>

Like in other languages, "final" means "cannot inherit from this". Since structs, by definition, don't support inheritance anyway, "final" isn't really applicable - they're inherently "final" no matter what.

> When converting the Matrix to a struct from class the multiplication
> even lowered from ~4.3 seconds to about 3.6 seconds. However, I am
> currently not sure if I want matrices to be structs (value types).
>

They really should be structs. There's basically no benefit to making them classes. Classes are for when you need inheritance and don't need performance. Java, from what I understand, does a reasonable job compensating for the inherent inefficiency of classes by going to great lengths to have absolute top-of-the-line class-oriented optimizations and GC and strip out all the class underpinnings when possible (AIUI). But D is much more like C# with "class vs struct" being an explicit thing.

Keep in mind too, the data inside dynamic arrays is already by reference. Arrays in D are basically like this:

struct Array(T)
{
    size_t length;
    T* ptr;
}

So even when your matrix is a struct, the actual values inside the matrix are still by reference (because it's through that pointer). So you're still not actually copying the data inside the matrix as you pass it around. You're only copying the dimension, length and pointer. And those are copied via a straight memcopy, not a "one field at a time" as I've heard C++ does (though I could be wrong, it's been forever since I did much C++).

And if you need to pass a struct by reference, there's always still "ref". Class really just isn't the right thing for a matrix, especially if you're paying attention to performance.


> I have also tried the LDC compiler. However, it gave me some strange
> bugs. E.g. it didn't like the following line:
>
> ref Matrix transpose() const {
>      return new Matrix(this).transposeAssign();
> }
>

Doing that without parens around the "new Matrix(this)" is a very new feature, so I'm not entirely surprised LDC doesn't have it yet. Like Stanislav said, you can just add parens for now.


February 16, 2014
On Sunday, 16 February 2014 at 01:25:13 UTC, bearophile wrote:
>
> Many of the things you say and write are still wrong or confused. Usually the hard optimization of code is one the last things you learn about a language
Well, actually, sometimes squeezing as much performance as you can from a test case can be a way to find out if a given language checks all the boxes and can be used to solve your problems.
I must admit I'm shocked by the poor performance of D here. But I also know one HAS to try LDC or GDC, or those numbers are really meaningless.
February 16, 2014
francesco cattoglio:

> Well, actually, sometimes squeezing as much performance as you can from a test case can be a way to find out if a given language checks all the boxes and can be used to solve your problems.

Unless you know a language well, you will fail squeezing that. This is true even in Python.


> I must admit I'm shocked by the poor performance of D here. But I also know one HAS to try LDC or GDC, or those numbers are really meaningless.

Be instead amazed of the sometimes near-C++ performance levels they have pushed Java to :-)
And I think D is doing well in a benchmark like this, I guess with ldc2 you can go about as fast as C++.

Bye,
bearophile