February 18, 2014
Stanislav Blinov:

>>> LDC yields roughly the same times.
>>
>> This is surprising.
>
> To me as well. I haven't yet tried to dig deep though.

I have compiled your code with (a single module, 32 bit Windows):

dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d
ldmd2 -wi -O -release -inline -noboundscheck matrix3.d

DMD:

multiplicationTest ...
        Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs

LDC2:

multiplicationTest ...
        Time required: 2 secs, 522 ms, and 747 ╬╝s

Bye,
bearophile
February 18, 2014
Stanislav Blinov:

> allocationTest ...
>         Time required: 1 sec, 112 ms, 827 μs, and 3 hnsecs
> multiplicationTest ...
>         Time required: 1 sec, 234 ms, 417 μs, and 8 hnsecs

Physics teaches us that those experimental measures are expressed with a excessive precision. For such benchmarks it's more wise to write "1.11" seconds.

Bye,
bearophile
February 18, 2014
On Tuesday, 18 February 2014 at 08:50:23 UTC, bearophile wrote:
> Stanislav Blinov:
>
>>>> LDC yields roughly the same times.
>>>
>>> This is surprising.
>>
>> To me as well. I haven't yet tried to dig deep though.
>
> I have compiled your code with (a single module, 32 bit Windows):
>
> dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d
> ldmd2 -wi -O -release -inline -noboundscheck matrix3.d
>
> DMD:
>
> multiplicationTest ...
>         Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs
>
> LDC2:
>
> multiplicationTest ...
>         Time required: 2 secs, 522 ms, and 747 ╬╝s

Interesting. For me (on 64 bit) multiplicationTest is in the same league for DMD and LDC. (I'm still using dmd 2.064.2 as my main dmd install btw).

But what is up with those first and last tests in LDC?..

$ rdmd -wi -O -release -inline -noboundscheck main.d
allocationTest ...
        Time required: 1 sec, 165 ms, 766 μs, and 1 hnsec
multiplicationTest ...
        Time required: 1 sec, 232 ms, 287 μs, and 4 hnsecs
toStringTest ...
        Time required: 997 ms, 217 μs, and 1 hnsec
transposeTest ...
        Time required: 859 ms, 152 μs, and 5 hnsecs
scalarMultiplicationTest ...
        Time required: 105 ms, 366 μs, and 5 hnsecs
matrixAddSubTest ...
        Time required: 241 ms, 705 μs, and 8 hnsecs
matrixEqualsTest ...
        Time required: 243 ms, 534 μs, and 8 hnsecs
identityMatrixTest ...
        Time required: 260 ms, 411 μs, and 4 hnsec

$ ldmd2 -wi -O -release -inline -noboundscheck main.d neolab/core/dimension.d neolab/core/matrix.d -ofmain-ldc && main-ldc

allocationTest ...
        Time required: 2 hnsecs (o_O)
multiplicationTest ...
        Time required: 1 sec, 138 ms, 628 μs, and 2 hnsecs
toStringTest ...
        Time required: 704 ms, 937 μs, and 5 hnsecs
transposeTest ...
        Time required: 859 ms, 413 μs, and 5 hnsecs
scalarMultiplicationTest ...
        Time required: 103 ms, 250 μs, and 2 hnsecs
matrixAddSubTest ...
        Time required: 226 ms, 955 μs, and 3 hnsecs
matrixEqualsTest ...
        Time required: 470 ms and 186 μs
identityMatrixTest ...
        Time required: 4 hnsecs (o_O)
February 18, 2014
...And if I define opEquals as it was made by Robin, i.e. like this:


bool opEquals(ref const Matrix other) const pure nothrow {
    version (all) {
        if (dim != other.dim) return false;
        foreach(immutable i, const ref e; data)
            if (e != other.data[i])
                return false;
        return true;
    } else {
        return (dim == other.dim) && (data[] == other.data[]);
    }
}


then the relevant benchmark on LDC flattens near zero time too x.x
February 18, 2014
On Tuesday, 18 February 2014 at 09:05:33 UTC, Stanislav Blinov
wrote:
>
> allocationTest ...
>         Time required: 2 hnsecs (o_O)
> identityMatrixTest ...
>         Time required: 4 hnsecs (o_O)

LDC is probably detecting that you're never actually using the
results of the operation and that none of the steps have
side-effects, so it's optimizing the call out. One way of
avoiding this is to do something like writeln the result of an
operation.
February 18, 2014
Hiho,

I am happy to see that I could encourage so many people to discuss about this topic to not only give me interesting answer to my questions but also to analyse and evaluate features and performance of D.

I have fixed the allocation performance problem via a custom destructor method which manually notifies the GC that its data is garbage so that the GC can free it in the next cycle and no longer have to determine if it is no longer used. This extremely speeded up the allocation tests but increased overall runtime performance by a very slight amount of time because memory is now freed contigeously.

@Nick Sabalausky:
Why should I remove .dub from the copy constructor? In my opinion this is important to keep both matrices (source and copy) independent from each other. The suggested COW feature for matrices sound interesting but also weird. I have to read more about that.
Move semantics in D a needed and existing, however, I think that the D compiler isn't as good as the C++ compiler in determining what is moveable and what not.

Another thing which is hopefully a bug in the current language implementation of D is the strange behaviour of the transpose method with the only line of code:

return Matrix(this).transposeAssign();

In C++ for example this compiles without any problems and results in correct transposed matrix copies - this works due to the existance of move semantics in C++ and is one of the coolest features since C++11 which increased and simplified codes in many cases enormously for value types just as structs in D.

I have ran several tests in order to test when the move assign or the move constructors are called in D and whenever I expected them to be called the copy-constructor or the postblit was called instead which was frustating imo.
Perhaps I still haven't quite understood the things behind the scenes in D but it can't be the solution to always copy the whole data whenever I could instead have a move of the whole data on the heap.

Besides that on suggestion which came up was that I could insert the Dimension module into the Matrix module as their are semantically working together. However, I find it better to have many smaller code snippets instead of fewer bigger ones and that's why I keep them both separated.

I also gave scoped imports a try and hoped that they were able to reduce my executable file and perhaps increase the performance of my program, none of which was true -> confused. Instead I now have more lines of code and do not see instantly what dependencies the module as itself has. So what is the point in scoped imports?

The mixin keyword is also nice to have but it feels similar to a dirty C-macro to me where text replacement with lower abstraction (unlike templates) takes place. Of course I am wrong and you will teach me why but atm I have strange feelings about implementing codes with mixins. In this special case: perhaps it isn't a wise decision to merge addition with subtraction and perhaps I can find faster ways to do that which invole more differences in both actions which requires to split both methods up again. (theoretical, but it could be)

Another weird thing is that the result ~= text(tabStr, this[r, c]) in the toString method is much slower than the two following lines of code:

result ~= tabStr;
result ~= to!string(this[r, c]);

Does anybody have an answer to this?

I am now thinking that I know the most important things about how to write ordinary (not super) efficient D code - laugh at me if you want :D - and will continue extending this benchmarking library and whenever I feel bad about a certain algorithm's performance I will knock here in this thread again, you know. :P

In the end of my post I just want to summarize the benchmark history for the matrix multiplication as I think that it is funny: (All tests for two 1000x1000 matrices!)

- The bloody first implementation of the matrix implementation (which worked) required about 39 seconds to finish.
- Then I have finally found out the optimizing commands for the DMD and the multiplication performance double roughly to about 14 seconds.
- I created an account here and due to your massive help the matrix multiplication required only about 5 seconds shortly after due to better const, pure and nothrow usage.
- Through the shift from class to struct and some optimizations in memory layout and further usage improvements of const, pure and nothrow as well as several array feature usages and foreach loop the algorithm performance raised once again and required about 3,7 seconds.
- The last notifiable optimization was the implemenatation based on pointer arithmentics which again improved the performance from 3,7 seconds to roughly about 2 seconds.

Due to this development I see that there is still a possibility that we could beat the 1,5 seconds from Java or even the 1,3 seconds from C++! (based on my machine: 64bit-archlinux, dualcore 2,2ghz, 4gb ram)

There are still many ways to further improve the performance. For examply by using LDC on certain hardwares, paralellism and perhaps by implementing COW with no GC dependencies. And of course I may miss many other possible optimization features of D.

I by myself can say that I have learn a lot and that's the most important thing above everything else for me here.

Thank you all for this very interesting conversation! You - the community - are a part what makes D a great language. =)

Robin
February 19, 2014
Robin:

> the existance of move semantics in C++ and is one of the coolest features since C++11 which increased and simplified codes in many cases enormously for value types just as structs in D.

I guess Andrei doesn't agree with you (and move semantics in C++11 is quite hard to understand).


> I also gave scoped imports a try and hoped that they were able to reduce my executable file and perhaps increase the performance of my program, none of which was true -> confused. Instead I now have more lines of code and do not see instantly what dependencies the module as itself has. So what is the point in scoped imports?

Scoped imports in general can't increase performance. Their main point is to avoid importing modules that are needed only by templated code. So if you don't instantiate the template, the liker works less and the binary is usually smaller (no moduleinfo, etc).


> Another weird thing is that the result ~= text(tabStr, this[r, c]) in the toString method is much slower than the two following lines of code:
>
> result ~= tabStr;
> result ~= to!string(this[r, c]);
>
> Does anybody have an answer to this?

It doesn't look too much weird. In the first case you are allocating and creating larger strings. But I don't think matrix printing is a bottleneck in a program.


> - Then I have finally found out the optimizing commands for the DMD

This is a small but common problem. Perhaps worth fixing.


> There are still many ways to further improve the performance. For examply by using LDC

Latest stable and unstable versions of LDC2, try it:
https://github.com/ldc-developers/ldc/releases/tag/v0.12.1
https://github.com/ldc-developers/ldc/releases/tag/v0.13.0-alpha1


> on certain hardwares, paralellism and perhaps by implementing COW with no GC dependencies. And of course I may miss many other possible optimization features of D.

Matrix multiplication can be improved a lot tiling the matrix (or better using a cache oblivious algorithm), using SSE/AVX2, using multiple cores, etc. As starting point you can try to use std.parallelism. It could speed up your code on 4 cores with a very limited amount of added code.

Bye,
bearophile
February 19, 2014
On Tuesday, 18 February 2014 at 23:36:12 UTC, Robin wrote:

> Another thing which is hopefully a bug in the current language implementation of D is the strange behaviour of the transpose method with the only line of code:
>
> return Matrix(this).transposeAssign();

Matrix(this) not compiling when 'this' is const is a bug. That's why I had to replace it with assignment. Postblit still has some unresolved problems.

However, you should note that you have a bigger bug in that transposeAssign() returns a reference :) So unless transpose() returns Matrix (instead of ref Matrix), that's a problem.

> In C++ for example this compiles without any problems and results in correct transposed matrix copies - this works due to the existance of move semantics in C++ and is one of the coolest features since C++11 which increased and simplified codes in many cases enormously for value types just as structs in D.

The fact that it compiles in C++ is a problem (this illustrates your initial implementation of transpose() translated into C++):

struct Matrix
{
    // ref Matrix transpose() const;
    S& transpose() const { return Matrix(*this).transposeAssign(); }

    // ref Matrix transposeAssign();
    S& transposeAssign() { return *this; }
};

int main(int argc, char** argv)
{
    Matrix m1;
    Matrix& m2 = m1.transpose(); // ooops!
}

> I have ran several tests in order to test when the move assign or the move constructors are called in D and whenever I expected them to be called the copy-constructor or the postblit was called instead which was frustating imo.

I've posted a complete illustration on how D moves rvalues, browse this thread back a bit.

> I also gave scoped imports a try and hoped that they were able to reduce my executable file and perhaps increase the performance of my program, none of which was true -> confused. Instead I now have more lines of code and do not see instantly what dependencies the module as itself has. So what is the point in scoped imports?

They don't pollute outer scope with symbols. If you just need to call a couple of functions from std.random inside *one* function, there's no need to pull names from std.random into the whole module.

> The mixin keyword is also nice to have but it feels similar to a dirty C-macro to me where text replacement with lower abstraction (unlike templates) takes place. Of course I am wrong

Yes you are :) It's not dirty and it's certainly not a macro. It's a great feature for generating code at compile time. For examples just browse through Phobos (e.g. std.functional), or take a look at the code in this thread:

http://forum.dlang.org/thread/mailman.158.1391156715.13884.digitalmars-d@puremagic.com

Maybe it's your syntax highlighting that throws you off? Then use q{} insead of "" :)

> Another weird thing is that the result ~= text(tabStr, this[r, c]) in the toString method is much slower than the two following lines of code:
>
> result ~= tabStr;
> result ~= to!string(this[r, c]);
>
> Does anybody have an answer to this?

Probably due to extra allocation in text(tabStr, this[r,c]) since it will perform ~ under the hood, while appender already has storage.
February 20, 2014
Hiho,

here are the results of both compilers (DMD and LDC2) on my system:

LDC:
=========================================
allocationTest ...
	Time required: 5 secs, 424 msecs
multiplicationTest ...
	Time required: 1 secs, 744 msecs
toStringTest ...
	Time required: 0 secs, 974 msecs
transposeTest ...
	Time required: 0 secs, 998 msecs
scalarMultiplicationTest ...
	Time required: 0 secs, 395 msecs
matrixAddSubTest ...
	Time required: 0 secs, 794 msecs
matrixEqualsTest ...
	Time required: 0 secs, 0 msecs
identityMatrixTest ...
	Time required: 0 secs, 393 msecs
=========================================

DMD:
=========================================
allocationTest ...
	Time required: 3 secs, 161 msecs
multiplicationTest ...
	Time required: 2 secs, 6 msecs
toStringTest ...
	Time required: 1 secs, 365 msecs
transposeTest ...
	Time required: 1 secs, 45 msecs
scalarMultiplicationTest ...
	Time required: 0 secs, 405 msecs
matrixAddSubTest ...
	Time required: 0 secs, 809 msecs
matrixEqualsTest ...
	Time required: 0 secs, 430 msecs
identityMatrixTest ...
	Time required: 0 secs, 350 msecs
=========================================

All in all I must say that I am more pleased with the DMD results as I am kind of worried about the LDC allocation test performance ...

I also had to rewrite parts of the codes as some functions just weren't "pure" or "nothrow" such as the whole things around this.data[] += other.data[].

Robin
February 20, 2014
Robin:

> All in all I must say that I am more pleased with the DMD results as I am kind of worried about the LDC allocation test performance ...

I am not sure of the causes of this, but the D GG was improved a little, in the last two compiler versions.


> I also had to rewrite parts of the codes as some functions just weren't "pure" or "nothrow" such as the whole things around this.data[] += other.data[].

This because LDC2 is one version of the language behind. It's a temporary problem.

Your multiplicationTest timings with LDC2 seem to have not yet reached the performance of your C++ code. You can take a look at the asm output of both. Perhaps in the D code you have some allocation you are not doing in the C++ code.

Bye,
bearophile