February 13, 2013
Am Wed, 13 Feb 2013 14:48:21 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling@webdrake.net>:

> On 02/13/2013 02:26 PM, Marco Leise wrote:
> > You get both, 50% more speed and more precision! It is a win-win situation. Also take a look at Phobos' std.math that returns real everywhere.
> 
> I have to say, it's not been my experience that using real improves speed. Exactly what optimizations are you using when compiling?

The target is Linux, AMD64 and the compiler arguments are:

ldc2 -O5 -check-printf-calls -fdata-sections -ffunction-sections -release -singleobj -strip-debug -wi -L=--gc-sections -L=-s

-- 
Marco

February 13, 2013
On 02/13/2013 03:29 PM, Marco Leise wrote:
> They are actual storage in memory, where every increase in
> size hurts.

When I replaced with TReal, it sped things up for double.

February 13, 2013
Am Wed, 13 Feb 2013 15:00:21 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling@webdrake.net>:

> Compiling with ldmd2 -O -inline -release on 64-bit Ubuntu, latest from-GitHub LDC, LLVM 3.2:
> 
>    D code serial with dimension 32768 ...
>      using floats Total time: 4.751 [sec]
>      using doubles Total time: 4.362 [sec]
>      using reals Total time: 5.95 [sec]

Ok, I get pretty much the same numbers as before with:
  ldmd2 -O -inline -release
It's even a bit faster than my loooong command line.
Do these numbers tell us, that there are such huge differences
in the handling of floating point value between different
AMD64 CPUs? I can't quite make a rhyme of it yet.
What version of LLVM are you using, mine is 3.1. 3.0 is
minimum and 3.2 is recommended for LDC2.

> Using double is indeed marginally faster than float, but real is slower than both.
> 
> What's disturbing is that when compiled instead with gdmd -O -inline -release the code is dramatically slower:
> 
>    D code serial with dimension 32768 ...
>      using floats Total time: 22.108 [sec]
>      using doubles Total time: 21.203 [sec]
>      using reals Total time: 23.717 [sec]
> 
> It's the first time I've encountered such a dramatic difference between GDC and LDC, and I'm wondering whether it's down to a bug or some change between D releases 2.060 and 2.061.

_THAT_ I can reproduce with GDC! :

D code serial with dimension 32768 ...
  using floats Total time: 24.415 [sec]
  using doubles Total time: 23.268 [sec]
  using reals Total time: 25.168 [sec]

It's the exact same pattern.

-- 
Marco

February 13, 2013
Am Wed, 13 Feb 2013 15:45:13 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling@webdrake.net>:

> On 02/13/2013 03:29 PM, Marco Leise wrote:
> > They are actual storage in memory, where every increase in size hurts.
> 
> When I replaced with TReal, it sped things up for double.

Give me that stuff, your northbridge is on!
But I still want to rule out the LLVM version, since GDC seems
to produce code with similar runtime on both our systems, but
LDC2 divergess so much.

-- 
Marco

February 13, 2013
On 02/13/2013 03:56 PM, Marco Leise wrote:
> Ok, I get pretty much the same numbers as before with:
>    ldmd2 -O -inline -release
> It's even a bit faster than my loooong command line.

My experience has been that the higher -O values of ldc don't do much, but of course, that's going to vary depending on your code.  I think above -O3 it's all link-time, no?

> Do these numbers tell us, that there are such huge differences
> in the handling of floating point value between different
> AMD64 CPUs? I can't quite make a rhyme of it yet.

AMD vs Intel might make a difference (my machine is an i7).

> What version of LLVM are you using, mine is 3.1. 3.0 is
> minimum and 3.2 is recommended for LDC2.

LLVM 3.2.

> _THAT_ I can reproduce with GDC! :
>
> D code serial with dimension 32768 ...
>    using floats Total time: 24.415 [sec]
>    using doubles Total time: 23.268 [sec]
>    using reals Total time: 25.168 [sec]
>
> It's the exact same pattern.

I've never, EVER had ldc-compiled code run four times faster than GDC-compiled code.  In fact, I don't think I've ever had LDC-compiled code run faster than GDC-compiled code at all, except where the choice of optimizations was different.  That's what makes me concerned that there's some kind of bug in play here ....

February 13, 2013
Good point about choosing the right type of floating point numbers.
Conclusion: when there's enough space, always pick double over float.
Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
I thought to myself: cool, I almost beat the 13.4s I got with C++, until I changed the C++ code to also use doubles and... got a massive speedup: 7.1s!
February 13, 2013
Am Wed, 13 Feb 2013 15:45:13 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling@webdrake.net>:

> On 02/13/2013 03:29 PM, Marco Leise wrote:
> > They are actual storage in memory, where every increase in size hurts.
> 
> When I replaced with TReal, it sped things up for double.

Oh this gets even better... I only added double as last step to that code, so I didn't notice this effect. Looks like we've got:

- CPUs that are good at converting to double
- 64-bit, so the size of a double matches
- only 16 bytes of memory in total

With double struct fields the 'double' case gains 50% speed for me, making it the overall fastest now (on LDC). I'd still bet a dollar that with an array of values floats would outperform doubles, when cache misses happen. (E.g. more or less random memory access.)

-- 
Marco

February 13, 2013
On 2013-02-13 16:26, Marco Leise wrote:
> I'd still bet a dollar that with an array of values floats would
> outperform doubles, when cache misses happen. (E.g. more or
> less random memory access.)

I'll play it safe and only bet my opDollar. :)

February 13, 2013
On 02/13/2013 04:17 PM, FG wrote:
> Good point about choosing the right type of floating point numbers.
> Conclusion: when there's enough space, always pick double over float.
> Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
> I thought to myself: cool, I almost beat the 13.4s I got with C++, until I
> changed the C++ code to also use doubles and... got a massive speedup: 7.1s!

Yea, ditto for C++: 5.3 sec with double, 9.3 with float (using g++ -O3).
February 13, 2013
Am Wed, 13 Feb 2013 16:17:12 +0100
schrieb FG <home@fgda.pl>:

> Good point about choosing the right type of floating point numbers.
> Conclusion: when there's enough space, always pick double over float.
> Tested with GDC in win64. floats: 16.0s / doubles: 14.1s / reals: 11.2s.
> I thought to myself: cool, I almost beat the 13.4s I got with C++, until I
> changed the C++ code to also use doubles and... got a massive speedup: 7.1s!

Yeah we are living in the 32-bit past ;)

Still, be aware that we only write to 2 memory locations in
that program!
We have neither exceeded the L1 cache size with that nor have
we put any strain on the prefetcher and memory bandwidth.
With the modification below it is more clear why I said "use
float for storage". The result with LDC2 for me is:

D code serial with dimension 8192 ...
  using floats Total time: 4.235 [sec]
  using doubles Total time: 5.58 [sec] // ~+32% over float
  using reals Total time: 6.432 [sec]

So all the in-CPU performance gain from using doubles is more than lost, when you run out of bandwidth.

---8<-----------------------------------

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;
import std.random;
import core.stdc.stdlib;


enum DIM = 8 * 1024;

int juliaValue;

size_t* randomAcc;

static this()
{
	randomAcc = cast(size_t*) malloc((DIM * DIM + 200) * size_t.sizeof);
	foreach (i; 0 .. DIM * DIM)
		randomAcc[i] = i;
	randomAcc[0 .. DIM * DIM].randomShuffle();
	randomAcc[DIM * DIM .. DIM * DIM + 200] = randomAcc[0 .. 200];
}

static ~this() { free(randomAcc); }

template Julia(TReal)
{
	TReal* squares;

	static this() { squares = cast(TReal*) malloc(DIM * DIM * TReal.sizeof); }

	static ~this() { free(squares); }

	struct ComplexStruct
	{
		TReal r;
		TReal i;

		TReal squarePlusMag(const ComplexStruct another)
		{
			TReal r1 = r*r - i*i + another.r;
			TReal i1 = 2.0*i*r + another.i;

			r = r1;
			i = i1;

			return (r1*r1 + i1*i1);
		}
	}

	int juliaFunction( int x, int y )
	{
		auto c = ComplexStruct(0.8, 0.156);
		auto a = ComplexStruct(x, y);

		foreach (i; 0 .. 200) {
			size_t idx = randomAcc[DIM * x + y + i];
			squares[idx] = a.squarePlusMag(c);
			if (squares[idx] > 1000)
				return 0;
		}
		return 1;
	}

	void kernel()
	{
		foreach (x; 0 .. DIM) {
			foreach (y; 0 .. DIM) {
				juliaValue = juliaFunction( x, y );
			}
		}
	}
}

void main()
{
	writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ...");
	StopWatch sw;
	foreach (Math; TypeTuple!(float, double, real))
	{
		sw.start();
		Julia!(Math).kernel();
		sw.stop();
		writefln("  using %ss Total time: %s [sec]",
		         Math.stringof, (sw.peek().msecs * 0.001));
		sw.reset();
	}
}
-- 
Marco