October 08, 2014
On a second thought I can see that the main problem is my computation functions are not inlined. They are likes of this:

float sigmoid(float value, float alpha) nothrow
{
        return (value * alpha) / (1.0f + nfAbs(value * alpha)); // Elliot
}

float sigmoidDeriv(float value, float alpha) nothrow
{
	return alpha * 1.0f / ((1.0f + nfAbs(value * alpha)) * (1.0f + nfAbs(value * alpha))); // Elliot
}

float linear(float value, float alpha) nothrow
{
	return nfMin(nfMax(value * alpha, -alpha), alpha);
}

Why those calls are not inlined?? Or vectorized?
October 08, 2014
Here are the abs/min/max functions:

float nfAbs(float num) nothrow
{
    return num < 0.0f ? -num : num;
}

float nfMax(float num1, float num2) nothrow
{
    return num1 < num2 ? num2 : num1;
}

float nfMin(float num1, float num2) nothrow
{
    return num2 < num1 ? num2 : num1;
}

Didn't inlined too. Why?

October 08, 2014
> I'm not an ASM expert

'-output-ll' gives you llvm IR, a bit higher level.

> but as far as I can see it indeed use some SIMD registers and instructions. For examlple:
> 	movss	xmm0, dword ptr [rdx]
> 	mulss	xmm0, dword ptr [r12]
> 	addss	xmm1, xmm0

If you see a 'ps' suffix (packed single-precision) it's SIMD ;)

Your helper functions are probably in a different module.
Cross-module inlining is problematic currently.
October 08, 2014
Hi,

On Wednesday, 8 October 2014 at 07:37:15 UTC, Gabor Mezo wrote:
> There is a number crunching benchmark in it that doing a simple gradient descent learning on a small multilayer perceptron neural network. The core of the benchmark is about some loops doing basic computations on numbers in float[] arrays (add, mul, exp, abs).

Would it be possible to publish the relevant parts of the code, i.e. what is needed to reproduce the performance problem? I'm currently working on a D compiler performance tracking project, so real-world test-cases where one compiler does much better than another are interesting to me.

If the code is proprietary, would it be possible for or me another compiler dev to have a look at the code, so we can determine the issues more quickly?

> DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
> LDC2 0.14 -O3 -release                         : 0.051 secs

Note that array bounds checks are still enabled for LDC here if your code was @safe.

David
October 08, 2014
On Wednesday, 8 October 2014 at 16:23:19 UTC, Gabor Mezo wrote:
> I'm not an ASM expert, but as far as I can see it indeed use some SIMD registers and instructions.

On x86_64, scalar single and double precision math uses the SSE registers and instructions by default too. The relevant mnemonics (mostly) end with "ss", which stands for "scalar single". On the other hand, vectorize code would use e.g. the instructions ending in "ps", for "packed single" (multiple values in one SSE register).

Your snippet has not actually been vectorized. Assuming that the code you posted was from a hot loop, a much bigger problem are the many function calls, though.

David
October 08, 2014
On Wednesday, 8 October 2014 at 16:26:02 UTC, Gabor Mezo wrote:
> Why those calls are not inlined?

They are likely in a different module than the code using them, right? Modules in D are supposed to be their own, separate compilation unit, just like .cpp in C++. Thus by default no inlining across module boundaries will take place, unless you use something like link-time optimization.

Now of course this is rather undesirable and a big problem for trivial helper functions. If you just compile a single executable, you can pass -singleobj to LDC to instruct it to generate only one object file, so that the optimization boundaries disappear (arguably, this should be the default).

Furthermore, both DMD and LDC actually attempt to work around this by also analyzing imported modules so that functions in them can be inlined. Unfortunately, the LDC implementation of this is defunct as of a couple of DMD frontend merges ago. Thus, not even simple cases as in your example are not covered. I'm working on a reimplementation right now, hopefully to appear in master soon.

Cheers,
David
October 09, 2014
Hi David,

Thanks for trying to help me out.

Indeed, helper functions reside in separate modules. They are @system functions. I try to convert my helper function system to mixins then.
October 09, 2014
> Would it be possible to publish the relevant parts of the code, i.e. what is needed to reproduce the performance problem? I'm currently working on a D compiler performance tracking project, so real-world test-cases where one compiler does much better than another are interesting to me.
>
> If the code is proprietary, would it be possible for or me another compiler dev to have a look at the code, so we can determine the issues more quickly?
>
>> DMD 2.066 -O -release -inline -boundscheck=off : 0.06 secs
>> LDC2 0.14 -O3 -release                         : 0.051 secs

Of course. The code will be accessible on github on this week. This is an LGPL licensed hobbyist project, not confidential. ;)
October 09, 2014
Let me introduce my project for you guys.

There is the blog:

http://neuroflowblog.wordpress.com/

I started to work on it almost 10 years ago. It was a C# project, and the productivity of the language allowed me to implement advanced machine learning algorithms like Realtime Recurrent Learning and Scaled Conjugate Gradient. Sadly the performance was not that good, so I learned OpenCL. I implemented a provider model in my framework, so I became able to use managed and OpenCL implementations in the same system. Because my experimental code was implemented in C# and because managed code was slow, my experiments went really slow.

Then I decided to move my experimental layer to C++11, and my framework became pure native. Sadly productivity of C++ is poor compared to C#, so even my experimental code was fast, my experiments became slower than was by using the managed version.

The I decided to learn D, and the result is on the Github (a DUB project):

https://github.com/unbornchikken/neuroflow-D

This is a console application, the mentioned benchmark will start.

Please note, this is my first time D code. There are constructs those seems lead to nowhere, but they will gain purpose when I port all of the planned functionality. Because there are a provider model to have OpenCL and D (and whatever) based implementations in parallel, I wasn't able to avoid downcasting in my design. Because downcasting can hugely affect performance I implemented some ugly but performant void * magic. Sorry for that. :) Conversion of the OpenCL implementation to D is still TODO. Recurrent learning implementations are not implemented right now.
October 09, 2014
Hey,

We have made progress. I've merged my computation code into a single module, and now the LDC build is as perfomant as the Clang one! The benchmark took around 0.044 secs. It's slower that the GDC version but it is amazing that D language can be as performant as C++ by using the same compiler backend, so no magic allowed.

Results pushed in.