February 16, 2014
On Sunday, 16 February 2014 at 09:53:08 UTC, bearophile wrote:
> Be instead amazed of the sometimes near-C++ performance levels they have pushed Java to :-)
Sorry, but I fail to be amazed by what huge piles of money thrown to a project for 10+ years can achieve. Those kind of achievements are boring :P
> And I think D is doing well in a benchmark like this, I guess with ldc2 you can go about as fast as C++.
I'm sure that other "small benchmarks" on computational-intensive code achieved C++ speed in the past, so if we can't get C++-like speed here, it's either a problem with the code or some serious performance regression.

February 16, 2014
Hiho,

thanks again for your answers!

I don't know where to start ...

@Stanislav Blinov:
I am not currently aware of the "move semantics" in D or what it is called there. However, I am not quite sure if D does an equally good job as C++ in determining if a value can be moved or not. I have made some simple tests which showed me that the swap-assignment and the move-constructor are never called in D at code where C++ would certainly take them, instead D just copies values which were in fact rvalues ... (Maybe I just don't get the whole thing behind the scenes - and I am really hoping it.)

My statement that the "in" and "body" block of the assignment operator slows down the whole code isn't true anymore. Don't know what fixed it tough.

@bearophile:
Yeah, you are completely right - I am super confused at the moment. I always thought that D would be much easier to learn than C++ as all the people always say that C++ is the most complex language. After what I have learned so far D seems to be much more complex which isn't bad in general, but the learning curve doesn't feel so well atm as I am mainly confused by many things instead of getting what happens behind the scene. (e.g. move semantics, the whole army of modifiers, immutability which result in zero performance boost in non-paralell environments, the fact that classes are by-ref and structs are by-val is also confusing from time to time to be honest.)

When I started to write the C++ matrix library after I have written the Java matrix library to get to know what both languages can achive I have also had to start learning C++ as well. I bought a book and after 2-3 weeks I got the concepts and everything went as excepted and no things seemed to be hidden behind the scenes from me, the programmer which was in fact a nice experience when trying to optimize the code until it was equally fast as the java code or even faster. D seems to be more problematic when it comes to learning how to optimize the code.

To summarize, I am still a beginner and learn this language as it has got cool, new and advanced features, it has not so many legacy issues (like C++), it unites object orientated, functional as well as imperative and paralell programming paradigms (still amazed by that fact!), runs native and isn't bound to any operating system or a huge runtime environment (java/c#). That's why I am learning D! =)

I would love to see how much performance a D vetaran like you is able to grab out of my codes. Just tell me where I shall upload them and I will do so. Don't expect too much as I am still a beginner and as this project has just started. I didn't want to implement the whole functionality before I know how to do it with a good result. ;-)

The matrix multiplication you have posted has some cool features in it. I especially like the pure one with all that functional stuff. :D

@Nick Sabalausky:

Don't be shocked at the bad performance of a beginner D programmer. At least I am not shocked yet that my D implementation isn't performing as well as the highly optimized C++11 implementation or the JIT-optimized Java implementation. In general one can not compare the results of a native compilation optimization which shall run on different hardwares and a JIT optimization which is able to pull out every single optimization routine because it knows the system and all its components to compile at.

There is also the LDC2 and the GDC which should also improve the general performance quite a bit.

However, what is true is, that it is (at least for me) harder to learn how to write efficient code for D than it was to learn it for C++. And keep in mind that my Java implementation (which runs nearly as fast as the C++ implementation) is written just to fit for performance. I have had to do dirty things (against the general Java mentality) in order to achive this performance. Besides that I couldn't implement a generic solution in Java which wasn't five to ten times slower than without generics as generics in Java are a runtime feature. So all in all Java is only good in performance if you don't use it as Java (purely object oriented etc.) ... but it works! :D

@All of you:
What I have done since the last time:
I have changed my mind - matrix is now a value type. However, I wasn't sure about this step because I planned to implement dense and sparse matrices which could inherit from a basic matrix implementation in thus required to be classes.

I have also removed that nasty "final struct" statement.^^

However, benchmarks stayed the same since the last post.

Hopefully I haven't forgotten to say anything important ...

Robin
February 17, 2014
On Sunday, 16 February 2014 at 23:47:08 UTC, Robin wrote:

> I am not currently aware of the "move semantics" in D or what it is called there. However, I am not quite sure if D does an equally good job as C++ in determining if a value can be moved or not. I have made some simple tests which showed me that the swap-assignment and the move-constructor are never called in D at code where C++ would certainly take them, instead D just copies values which were in fact rvalues ... (Maybe I just don't get the whole thing behind the scenes - and I am really hoping it.)

What swap assignment, what move constructor? :)

Here are the rules for moving rvalues in D:

1) All anonymous rvalues are moved;
2) Named temporaries that are returned from functions are moved;
3) That's it :) No other guarantees are offered (i.e. compilers may or may not try to be smart in other scenarios).

Sometimes when you want an explicit move you can use std.algorithm.move function. But keep in mind that its current implementation is incomplete (i.e. it doesn't handle immutability properly).

This code illustrates it all:

http://dpaste.dzfl.pl/c050c9fccbb2

> @All of you:
> What I have done since the last time:
> I have changed my mind - matrix is now a value type. However, I wasn't sure about this step because I planned to implement dense and sparse matrices which could inherit from a basic matrix implementation in thus required to be classes.

Templates, perhaps?
February 17, 2014
Robin:

> I always thought that D would be much easier to learn than C++ as all the people always say that C++ is the most complex language. After what I have learned so far D seems to be much more complex which isn't bad in general, but the learning curve doesn't feel so well atm as I am mainly confused by many things instead of getting what happens behind the scene.

D is not a small language, it contains many features, but compared to C++ it has less corner cases, less traps, and on default it's often safer than C++. In C++ you need to take care of many details if you want to write correct code. Writing D code should be faster and safer than C++.


> immutability which result in zero performance boost in non-paralell environments,

A good D compiler like ldc2 is sometimes able to optimize better the code that uses immutable/const. But immutability is also for the programmer: to make the code simpler to reason about, and safer.


> the fact that classes are by-ref and structs are by-val is also confusing from time to time to be honest.)


> D seems to be more problematic when it comes to learning how to
> optimize the code.

Perhaps because D also focuses a little more on correctness of the code. You see this even more in Ada language.


> I would love to see how much performance a D vetaran like you is able to grab out of my codes. Just tell me where I shall upload them and I will do so.

I guess your code is composed by just one or two small D modules, so you can post it here:
http://dpaste.dzfl.pl/
You can also upload there the Java code, to help me compare and to know what you were trying to write :-)


> However, what is true is, that it is (at least for me) harder to learn how to write efficient code for D than it was to learn it for C++.

But I think on average it's a little harder to write correct code in C++ compared to D.

Bye,
bearophile
February 17, 2014
Hiho,

I currently haven't got enough time to respond to all what have been said since my last post.
However, here is the simple code:
http://dpaste.dzfl.pl/3df8694eb47d

Thanks in advance!
I am really looking forward to your editing! =)

Robin
February 17, 2014
On Monday, 17 February 2014 at 09:48:49 UTC, Robin wrote:
> Hiho,
>
> I currently haven't got enough time to respond to all what have been said since my last post.
> However, here is the simple code:
> http://dpaste.dzfl.pl/3df8694eb47d
>
> Thanks in advance!
> I am really looking forward to your editing! =)
>
> Robin

A few quick points:

1) foreach loops are nice, use them

2) D has neat array-ops that can simplify your code for == and -= etc...
e.g.

	/**
	 * Subtracts all entries of this matrix by the given other matrix.
	 */
	ref Matrix opSubAssign(ref const(Matrix) other) nothrow
	in
	{
		assert(this.dim == other.dim);
	}
	body
	{
		this.data[] -= other.data[];
		return this;
	}


3) Although your generic indexing is nice, a lot of operations that you have implemented have access patterns that allow faster indexing without a multiplication, e.g.

	/**
	 * Creates a new identity matrix with the given size specified by the given rows and
	 * columns indices.
	 */
	static Matrix identity(size_t rows, size_t cols) nothrow {
		auto m = Matrix(rows, cols);
		size_t index = 0;
		foreach (i; 0 .. m.dim.min) {
			m.data[index] = 1;
			index += cols + 1;
		}
		return m;
	}
February 17, 2014
Robin:

> http://dpaste.dzfl.pl/3df8694eb47d

I think it's better to improve your code iteratively. First starting with small simple things:


>    this(size_t rows, size_t cols) nothrow pure {

=>

    this(in size_t rows, in size_t cols) nothrow pure {


>    this(ref const(Dimension) that) nothrow pure {

=>

    this(const ref Dimension that) nothrow pure {


>    ref Dimension opAssign(ref Dimension rhs) nothrow pure {

Is this method necessary?


>    bool opEquals(ref const(Dimension) other) nothrow const {

Not necessary.


>    size_t min() @property nothrow const {
>        if (this.rows <= this.cols) return this.rows;
>        else                        return this.cols;
>    }

=> (I have changed its name to reduce my confusion std.algorithm.min):

    @property size_t minSide() const pure nothrow {
        return (rows <= cols) ? rows : cols;
    }


>    size_t size() @property nothrow const {

=>

    @property size_t size() @property pure nothrow {


>    size_t offset(size_t row, size_t col) nothrow const {

=>

size_t offset(in size_t row, in size_t col) const pure nothrow {



>    this(in size_t rows, in size_t cols, bool initialize = true) nothrow pure {
>        this.dim = Dimension(rows, cols);
>        this.data = minimallyInitializedArray!(typeof(this.data))(rows * cols);
>        if (initialize) this.data[] = to!T(0);
>    }

=>

    this(in size_t rows, in size_t cols, in bool initialize = true) nothrow pure {
    ...
    if (initialize)
        this.data[] = 0;


>    /**
>     *
>     */

Sometimes I prefer to use just one line of code to reduce vertical wasted space:

/// Move constructor. From rvalue.


>    this(ref const(Matrix) that) {

=>

    this(const ref Matrix that) pure {


>        this.data = that.data.dup;

Currently array.dup is not nothrow. This will eventually be fixed.


>    ref Matrix transpose() const {
>        return Matrix(this).transposeAssign();

=>

    ref Matrix transpose() const pure nothrow {
        return Matrix(this).transposeAssign;


>    ref Matrix transposeAssign() nothrow {
>        size_t r, c;
>        for (r = 0; r < this.dim.rows; ++r) {
>            for (c = 0; c < r; ++c) {
>                std.algorithm.swap(
>                    this.data[this.dim.offset(r, c)],
>                    this.data[this.dim.offset(c, r)]);
>            }
>        }
>        return this;
>    }

Use immutable foreach loops.
And even when you use for loops always define the loop variable inside the for: "for (size_t r = 0; ...".
Also in general I suggest you to write a little less code.

A first try (not tested), but probably this can be done with much less multiplications:

    ref Matrix transposeAssign() pure nothrow {
        foreach (immutable r; 0 .. dim.rows) {
            foreach (immutable c; 0 .. r) {
                swap(data[dim.offset(r, c)], data[dim.offset(c, r)]);
            }
        }
        return this;
    }


>    Matrix opMul(ref const(Matrix) other)

Never use old-style operator overloading in new D code. Use only the new operator overloading style.


>        auto s = Matrix(other).transposeAssign();

=>

>        auto s = Matrix(other).transposeAssign;


>        size_t r1, r2, c;

Ditto.


>        foreach (ref T element; this.data) {
>            element *= scalar;
>        }

Try to use:

        data[] *= scalar;


>        for (auto i = 0; i < this.dim.size; ++i) {
>            this.data[i] += other.data[i];
>        }

Try to use:

        data[] += other.data[];


>        for (auto i = 0; i < this.dim.size; ++i) {
>            this.data[i] -= other.data[i];
>        }

Try to use:

        data[] -= other.data[];


>    bool opEquals(ref const(Matrix) other) nothrow const {
>        if (this.dim != other.dim) {
>            return false;
>        }
>        for (auto i = 0; i < this.dim.size; ++i) {
>            if (this.data[i] != other.data[i]) return false;
>        }
>        return true;
>    }

Try (in general try to write less code):

    return dim == other.dim && data == other.data;



>        auto stringBuilder = std.array.appender!string;

=>

        Appender!string result;


>        stringBuilder.put("Matrix: "
>            ~ to!string(this.dim.rows) ~ " x " ~ to!string(this.dim.cols)

=>

        result ~= text("Matrix: ", dim.rows, " x ", dim.cols


>        auto m = Matrix(rows, cols);
>        for (auto i = 0; i < m.dim.min; ++i) {
>            m[i, i] = 1;
>        }

See the comment by John Colvin.


>        auto generator = Random(unpredictableSeed);

What if I want to fill the matrix deterministically, with a given seed? In general it's better to not add a random matrix method to your matrix struct. Better to add an external free function that uses UFCS and is more flexible and simpler to reason about.



>    writeln("\tTime required: " ~ to!string(secs) ~ " secs, " ~ to!string(msecs) ~ " msecs");

Use .text to strinify, but here you don't need to stringify at all:

=>

    writeln("\tTime required: ", secs, " secs, ", msecs, " msecs");


>    sw.reset();

=>
    sw.reset;


> void multiplicationTest() {

In D there is also unittest {}. A benchmark is not exactly an unit test, so do as you prefer.

Bye,
bearophile
February 17, 2014
Hiho,

thank you for your code improvements and suggestions.

I really like the foreach loop in D as well as the slight (but existing) performance boost over conventional for loops. =)

Another success of the changes I have made is that I have achieved to further improve the matrix multiplication performance from 3.6 seconds for two 1000x1000 matrices to 1.9 seconds which is already very close to java and c++ with about 1.3 - 1.5 seconds.

The key to victory was pointer arithmetics as I notices that I have used them in the C++ implementation, too. xD

The toString implementation has improved its performance slightly due to the changes you have mentioned above: 1.37 secs -> 1.29 secs

I have also adjusted all operator overloadings to the "new style" - I just haven't known about that "new style" until then - thanks!

I will just post the whole code again so that you can see what I have changed.

Keep in mind that I am still using DMD as compiler and thus performance may still raise once I use another compiler!

All in all I am very happy with the code analysis and its improvements!
However, there were some strange things of which I am very confused ...

void allocationTest() {
	writeln("allocationTest ...");
	sw.start();
	auto m1 = Matrix!double(10000, 10000);
	{ auto m2 = Matrix!double(10000, 10000); }
	{ auto m2 = Matrix!double(10000, 10000); }
	{ auto m2 = Matrix!double(10000, 10000); }
	//{ auto m2 = Matrix!double(10000, 10000); }
	sw.stop();
	printBenchmarks();
}

This is the most confusing code snippet. I have just changed the whole allocation for all m1 and m2 from new Matrix!double (on heap) to Matrix!double (on stack) and the performance dropped significantly - the benchmarked timed raised from 2,3 seconds to over 25 seconds!! Now look at the code above. When I leave it as it is now, the code requires about 2,9 seconds runtime, however, when enabeling the currently out-commented line the code takes 14 to 25 seconds longer! mind blown ... 0.o This is extremely confusion as I allocate these matrices on the stack and since I have allocated them within their own scoped-block they should instantly release their memory again so that no memory consumption takes place for more than 2 matrices at the same time. This just wasn't the fact as far as I have tested it.

Another strange things was that the new opEquals implementation:

	bool opEquals(const ref Matrix other) const pure nothrow {
		if (this.dim != other.dim) {
			return false;
		}
		foreach (immutable i; 0 .. this.dim.size) {
			if (this.data[i] != other.data[i]) return false;
		}
		return true;
	}

is actually about 20% faster than the one you have suggested. With the single line of "return (this.dim == other.dim && this.data[] == other.data[]).

The last thing I haven't quite understood is that I tried to replace

auto t = Matrix(other).transposeAssign();

in the matrix multiplication algorithm with its shorter and clearer form

auto t = other.transpose(); // sorry for the nasty '()', but I like them! :/

This however gave me wonderful segmentation faults on runtime while using the matrix multiplication ...

And here is the complete and improved code:
http://dpaste.dzfl.pl/7f8610efa82b

Thanks in advance for helping me! =)

Robin
February 17, 2014
Robin:

> The key to victory was pointer arithmetics as I notices that I have used them in the C++ implementation, too. xD

Perhaps with LDC2 it's not necessary.


> I will just post the whole code again so that you can see what I have changed.

The code looks better.

There's no need to put Dimension in another module. In D modules
contain related stuff, unlike Java.

Also feel free to use some free-standing functions. With UFCS
they get used in the same way, and they help making
classes/structs simpler.

Some of your imports could be moved to more local scopes, instead
of being all at module-level.


>                result ~= to!string(this[r, c]);

=>

                 result ~= this[r, c].text;


> writeln("\tTime required: " ~ to!string(secs) ~ " secs, " ~ to!string(msecs) ~ " msecs");

=>

writeln("\tTime required: ", secs, " secs, ", msecs, " msecs");


>    ref Matrix opOpAssign(string s)(in T scalar) pure nothrow if (s == "*") {

=>

     ref Matrix opOpAssign(string op)(in T scalar) pure nothrow if
(op == "*") {


Also you have two functions with code like:

this.data[] op= scalar;

You can define a single template (untested):


     /**
      * Adds or subtracts all entries of this matrix by the given
other matrix.
      */
     ref Matrix opOpAssign(string op)(const ref Matrix other) pure
nothrow
     if (op == "+" || op == "-")
     in
     {
         assert(this.dim == other.dim);
     }
     body
     {
         mixin("this.data[] " ~ op ~ "= other.data[];");
         return this;
     }



>    Matrix opBinary(string s)(const ref Matrix other) const pure if (s == "*")

Given that on a 32 bit system a Matrix is just 16 bytes, it could
be better to not accept the argument by ref, and avoid one more
level of indirection:

     Matrix opBinary(string s)(in Matrix other) const pure if (s
== "*")


> However, there were some strange things of which I am very confused ...
>
> void allocationTest() {
> 	writeln("allocationTest ...");
> 	sw.start();
> 	auto m1 = Matrix!double(10000, 10000);
> 	{ auto m2 = Matrix!double(10000, 10000); }
> 	{ auto m2 = Matrix!double(10000, 10000); }
> 	{ auto m2 = Matrix!double(10000, 10000); }
> 	//{ auto m2 = Matrix!double(10000, 10000); }
> 	sw.stop();
> 	printBenchmarks();
> }
>
> This is the most confusing code snippet. I have just changed the whole allocation for all m1 and m2 from new Matrix!double (on heap) to Matrix!double (on stack)

The matrix data is always on the heap. What ends on the stack is
a very limited amount of stuff.


> This is extremely confusion as I allocate these matrices on the stack and since I have allocated them within their own scoped-block they should instantly release their memory

You are mistaken. minimallyInitializedArray allocates memory on
the GC heap (and there's not enough space on the stack to
allocate 10000^2 doubles). In both D and Java the deallocation of
memory managed by the GC is not deterministic, so it's not
immediately released at scope exit. Also unlike the Java GC, the
D GC is less refined, and by design it is currently not precise.
With such large arrays often there are _inbound_ pointers that
keep the memory alive, especially on 32 bit systems. So perhaps
your problems are caused by the GC.

You can have deterministic memory management in your struct-based
matrices, but you have to allocate the memory manually (from the
GC or probably better from the C heap) and free it in the struct
destructor usign RAII.


> Another strange things was that the new opEquals implementation:
>
> 	bool opEquals(const ref Matrix other) const pure nothrow {
> 		if (this.dim != other.dim) {
> 			return false;
> 		}
> 		foreach (immutable i; 0 .. this.dim.size) {
> 			if (this.data[i] != other.data[i]) return false;
> 		}
> 		return true;
> 	}
>
> is actually about 20% faster than the one you have suggested. With the single line of "return (this.dim == other.dim && this.data[] == other.data[]).

I think this small performance bug is fixed in dmd 2.065 that is
currently in beta3.


> The last thing I haven't quite understood is that I tried to replace
>
> auto t = Matrix(other).transposeAssign();
>
> in the matrix multiplication algorithm with its shorter and clearer form
>
> auto t = other.transpose(); // sorry for the nasty '()', but I like them! :/
>
> This however gave me wonderful segmentation faults on runtime while using the matrix multiplication ...

This looks like a null-related bug.

I'll benchmark your code a little, but I think I don't have as
much RAM as you.

Bye,
bearophile
February 17, 2014
On 2/16/2014 6:47 PM, Robin wrote:
>
> @Nick Sabalausky:
>
> Don't be shocked at the bad performance of a beginner D programmer. At
> least I am not shocked yet that my D implementation isn't performing as
> well as the highly optimized C++11 implementation or the JIT-optimized
> Java implementation. In general one can not compare the results of a
> native compilation optimization which shall run on different hardwares
> and a JIT optimization which is able to pull out every single
> optimization routine because it knows the system and all its components
> to compile at.
>

I think you've misunderstood me. I was only explaining D's class vs struct system, and why in D (as opposed to Java) structs are better than classes in this particular case.