May 31, 2013
On 31 May 2013 11:26, finalpatch <fengli@gmail.com> wrote:

> Recently I ported a simple ray tracer I wrote in C++11 to D. Thanks to the similarity between D and C++ it was almost a line by line translation, in other words, very very close. However, the D verson runs much slower than the C++11 version. On Windows, with MinGW GCC and GDC, the C++ version is twice as fast as the D version. On OSX, I used Clang++ and LDC, and the C++11 version was 4x faster than D verson.  Since the comparison were between compilers that share the same codegen backends I suppose that's a relatively fair comparison.  (flags used for GDC: -O3 -fno-bounds-check -frelease,  flags used for LDC: -O3 -release)
>
> I really like the features offered by D but it's the raw performance that's worrying me. From what I read D should offer similar performance when doing similar things but my own test results is not consistent with this claim. I want to know whether this slowness is inherent to the language or it's something I was not doing right (very possible because I have only a few days of experience with D).
>
> Below is the link to the D and C++ code, in case anyone is interested to have a look.
>
> https://dl.dropboxusercontent.**com/u/974356/raytracer.d<https://dl.dropboxusercontent.com/u/974356/raytracer.d> https://dl.dropboxusercontent.**com/u/974356/raytracer.cpp<https://dl.dropboxusercontent.com/u/974356/raytracer.cpp>
>

Can you paste the disassembly of the inner loop (trace()) for each G++/GDC,
Or LDC/Clang++?

That said, I can see almost innumerable red flags (on basically every line).
The fact that it takes 200ms to render a frame in C++ (I would expect
<10ms) suggests that your approach is amazingly slow to begin with, at
which point I would start looking for much higher level problems.
Once you have an implementation that's approaching optimal, then we can
start making comparisons.

Here are some thoughts at first glance:
* The fact that you use STL makes me immediately concerned. Generic code
for this sort of work will never run well.
   That said, STL has decades more time spent optimising, so it stands to
reason that the C++ compiler will be able to do more to improve the STL
code.
* Your vector class both in C++/D are pretty nasty. Use 4d SIMD vectors.
* So many integer divisions!
* There are countless float <-> int casts.
* Innumerable redundant loads/stores.
* I would have raised the virtual-by-default travesty, but Andrei did it
for me! ;)
* intersect() should be __forceinline.
* intersect() is full of if's (it's hard to predict if the optimiser can
work across those if's. maybe it can...)

What's taking the most time?
The lighting loop is so template-tastic, I can't get a feel for how fast
that loop would be.

I believe the reason for the difference is not going to be so easily
revealed. It's probably hidden largely in the fact that C++ has had a good
decade of optimisation spent on STL over D.
It's also possible that the C++ compiler hooks many of those STL functions
as compiler intrinsics with internalised logic.

Frankly, this is a textbook example of why STL is the spawn of satan. For some reason people are TAUGHT that it's reasonable to write code like this.


May 31, 2013
On 31 May 2013 15:49, finalpatch <fengli@gmail.com> wrote:

> Thanks Nazriel,
>
> It is very cool you are able to narrow the gap to within 1.5x of c++ with a few simple changes.
>
> I checked your version, there are 3 changes (correct me if i missed any):
>
> * Change the (float) constructor from v= [x,x,x] to v[0] = x; v[1] = x;
> v[2] = x;
> * Get rid of the (float[]) constructor and use 3 floats instead
> * Change class methods to final
>
> The first change alone shaved off 220ms off the runtime, the 2nd one cuts
> 130ms
> and the 3rd one cuts 60ms.
>
> Lesson learned: by very very careful about dynamic arrays.


Yeah, I've actually noticed this too on a few occasions. It would be nice if array operations would unroll for short arrays. Particularly so for static arrays!


May 31, 2013
On Friday, 31 May 2013 at 05:49:55 UTC, finalpatch wrote:
> Thanks Nazriel,
>
> It is very cool you are able to narrow the gap to within 1.5x of c++ with a few simple changes.
>
> I checked your version, there are 3 changes (correct me if i missed any):
>
> * Change the (float) constructor from v= [x,x,x] to v[0] = x; v[1] = x; v[2] = x;
> * Get rid of the (float[]) constructor and use 3 floats instead

I thought GDC or LDC have something like:
float[$] v = [x, x, x];
which is converted to
flot[3] v = [x, x, x];

Am I wrong?
DMD need something like this too.
May 31, 2013
On 2013-05-31 06:06, deadalnix wrote:

>   - Introduce a virtual keyword.
>
> Virtual by default isn't such a big deal if you can do final: and
> reverse the default behavior. However, once you key in the final land,
> you are trapped here, you can't get out. Introducing a virtual keyword
> would allow for aggressive final: declarations.

I guess this would be least intrusive change.

-- 
/Jacob Carlborg
May 31, 2013
On 31 May 2013 14:06, deadalnix <deadalnix@gmail.com> wrote:

> On Friday, 31 May 2013 at 02:56:25 UTC, Andrei Alexandrescu wrote:
>
>> On 5/30/13 9:26 PM, finalpatch wrote:
>>
>>> https://dl.dropboxusercontent.**com/u/974356/raytracer.d<https://dl.dropboxusercontent.com/u/974356/raytracer.d> https://dl.dropboxusercontent.**com/u/974356/raytracer.cpp<https://dl.dropboxusercontent.com/u/974356/raytracer.cpp>
>>>
>>
>> Manu's gonna love this one: make all methods final.
>>
>>
> I don't think going as far as making thing final by default make sense at this point. But we sure need a way to be able to finalize methods. We had an extensive discussion with Don and Manu at DConf, here are some idea that came out :
>
>  - Final by default
>
> This one is really a plus when it come to performance code. However, virtual by default have proven itself very useful when performance isn't that big of a deal (and it is the case for 90% of a program's code usually) and limiting the usage of some pattern like decorator (that also have been proven to be useful). This is also  huge breakage.
>

The correct choice :)

> virtual by default have proven itself very useful when performance isn't
that big of a deal

Has it? I'm not sure it's ever proven its self useful, can you point to any
such proof, or evidence?
I think it's only *proven* its self to be a massive mistake; you've all
heard my arguments a million times, Don will also give a massive rant, and
others.

People already have to type 'override' in every derived class, and they're
happy to do that. Requiring to type 'virtual' in the base is hardly an
inconvenience by contrast. Actually, it's quite orthogonal.
D tends to prefer being explicit. Why bend the rules in this case,
especially considering the counterpart (override) is expected to be
explicit? Surely both being explicit is what people would expect?

There's another detail that came up late at night at dconf; I was talking to Daniel Murphy about the extern(C++) class work he's working on to write the D-DMD front end, and apparently it would make the problem a lot easier if virtual was explicit there too. So perhaps it also offers improved interoperability with C++, which I think most of us have agreed is of heightened importance recently.

> and it is the case for 90% of a program's code usually

Can you justify this? I believe it's more like 5% of methods, or less. In my experience, virtuals account for 1-2 functions, and trivial properties (which should avoid being virtual at all costs) account for almost all of them.

int getPrivateMember() { return privateMember; } // <- most methods are like this; should NEVER be virtual

> This is also  huge breakage.

I think this is another fallacy. The magnitude of the breakage is already
known. It's comparable to the recent change where 'override' became an
explicit requirement, which was hardly catastrophic.
In that case, the breakage occurred in *every derived class*, which is a
many:1 relationship to base classes.
Explicit virtual would only affect _base_ classes. The magnitude of which
is far smaller than what was recently considered acceptable, precisely:
magnitudeOfPriorBreakage / numberOfDerivedClasses.

So considering explicit override was an argument won, I'd like to think that explicit virtual deserves the same treatment for even more compelling reasons, and at a significantly lesser cost in breakage.

 - Introduce a virtual keyword.
>
> Virtual by default isn't such a big deal if you can do final: and reverse the default behavior. However, once you key in the final land, you are trapped here, you can't get out. Introducing a virtual keyword would allow for aggressive final: declarations.
>

This needs to be done either way.

 - Require explicitly export when you want to create shared objects.
>
> This one is an enabler for an optimizer to finalize virtual method. With this mechanism in place, the compile knows all the override and can finalize many calls during LTO. I especially like that one as it allow for stripping many symbols at link time and allow for other LTO in general (for instance, the compiler can choose custom calling conventions for some methods, knowing all call sites).
>
> The explicit export one have my preference, however, ti require that symbol used in shared lib are explicitly declared export. I think we shouldn't break the virtual by default behavior, but we still need to figure out a way to make thing more performant on this point.
>

You're concerned about the magnitude of breakage introducing an explicit virtual requirement. This would seem to be a much much bigger breakage to me, and I don't think it's intuitive.


May 31, 2013
On Friday, 31 May 2013 at 04:06:58 UTC, deadalnix wrote:

> I don't think going as far as making thing final by default make sense at this point. But we sure need a way to be able to finalize methods. We had an extensive discussion with Don and Manu at DConf, here are some idea that came out :
>
>  - Final by default
[..]
>  - Introduce a virtual keyword.
[..]

C# has final by default, and mandatory virtual keyword.

Barely anyone complained. Mostly is considered a good thing to be explicit. As for the D, since the override keyword is mandatory, it would be a compiler error to omit a virtual keyword on base method, and thus easily fixable.

As for the methods that are supposed to be virtual but never actual overridden - there you must know which ones that should be, and make that explicit. This applies only for libraries, and only the ones that use lot of methods intend to be overridden (generally questionable design) have some work to do on being explicit about virtual.
May 31, 2013
On 05/31/2013 08:34 AM, Manu wrote:
> What's taking the most time?
> The lighting loop is so template-tastic, I can't get a feel for how fast that
> loop would be.

Hah, I found this out the hard way recently -- have been doing some experimental reworking of code where some key inner functions were templatized, and it had a nasty effect on performance.  I'm guessing it made it impossible for the compilers to inline these functions :-(

May 31, 2013
On 05/31/2013 12:58 PM, Joseph Rushton Wakeling wrote:
> On 05/31/2013 08:34 AM, Manu wrote:
>> What's taking the most time?
>> The lighting loop is so template-tastic, I can't get a feel for how fast that
>> loop would be.
>
> Hah, I found this out the hard way recently -- have been doing some experimental
> reworking of code where some key inner functions were templatized, and it had a
> nasty effect on performance.  I'm guessing it made it impossible for the
> compilers to inline these functions :-(
>

That wouldn't make any sense though, since after template expansion there is no difference between the generated version and a particular handwritten version.
May 31, 2013
On 05/31/2013 01:05 PM, Timon Gehr wrote:
> That wouldn't make any sense though, since after template expansion there is no difference between the generated version and a particular handwritten version.

That's what I'd assumed too, but there _is_ a speed difference.  I'm open to suggestions as to why.

Compare these profile results for the core inner function -- the original (even though there's a 'New' in there somewhere):

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 93.33      4.69     4.69  4084637     0.00     0.00
_D8infected5model115__T13NewSimulationS248infected5model8StateSISS238infected5model7SeedSISS388infected5model21NewUpdateMeanFieldSISTdZ13NewSimulation19__T11UpdateStateTdZ6updateMFKC8infected5model15__T8StateSISTdZ8StateSISKxC8infected5model14__T7SeedSISTdZ7SeedSISZd


... and the newer version:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 92.73      5.23     5.23  4078287     0.00     0.00
_D8infected5model292__T10SimulationS358infected5model18UpdateMeanFieldSISTC8infected5model15__T8StateSISTdZ8StateSISTC8infected5model164__T7SeedSISTdTAyAS3std8typecons50__T5TupleTmVAyaa2_6964TdVAyaa9_696e666c75656e6365Z5TupleTyS3std8typecons50__T5TupleTmVAyaa2_6964TdVAyaa9_696e666c75656e6365Z5TupleZ7SeedSISTydZ10Simulation17__T11UpdateStateZ163__T6updateTdTAyAS3std8typecons50__T5TupleTmVAyaa2_6964TdVAyaa9_696e666c75656e6365Z5TupleTyS3std8typecons50__T5TupleTmVAyaa2_6964TdVAyaa9_696e666c75656e6365Z5TupleZ6updateMFNbNfKC8infected5model15__T8StateSISTdZ8StateSISKxC8infected5model164__T7SeedSISTdTAyAS3std8typecons50__T5TupleTmVAyaa2_6964TdVAyaa9_696e666c75656e6365Z5TupleTyS3std8typecons50__T5TupleTmVAyaa2_6964TdVAyaa9_696e666c75656e6365Z5TupleZ7SeedSISZd

I'm not sure what, other than a change of template design, could be responsible here.  The key bits of code follow -- the original version:

	mixin template UpdateState(T)
	{
		T update(ref StateSIS!T st, const ref SeedSIS!T sd)
		{
			T d = to!T(0);
			static T[] sick;
			sick.length = st.infected.length;
			sick[] = st.infected[];

			foreach(i; 0..sick.length) {
				T noTransmission = to!T(1);
				foreach(link; sd.network[i])
					noTransmission *= (to!T(1) - sick[link.id] * link.influence);
				T getSick = (to!T(1) - sick[i]) * (sd.susceptible[i] + (to!T(1) -
sd.susceptible[i]) * (to!T(1) - noTransmission));
				T staySick = sick[i] * (to!T(1) - sd.recover[i]);
				st.infected[i] = (to!T(1) - sd.immune[i]) * (getSick + staySick);
				assert(to!T(0) <= st.infected[i]);
				assert(st.infected[i] <= to!T(1));
				d = max(abs(st.infected[i] - sick[i]), d);
			}

			return d;
		}
	}

... and for clarity, the StateSIS and SeedSIS classes:

	class StateSIS(T)
	{
		T[] infected;

		this(){}

		this(T[] inf)
		{
			infected = inf;
		}

		auto size() @property pure const nothrow
		{
			return infected.length;
		}

		T infection() @property pure const nothrow
		{
			return reduce!"a+b"(to!T(0), infected);
		}
	}

	class SeedSIS(T)
	{
		T[] immune;
		T[] susceptible;
		T[] recover;
		Link!T[][] network;

		this() {}

		this(T[] imm, T[] sus, T[] rec, Link!T[][] net)
		{
			immune = imm;
			susceptible = sus;
			recover = rec;
			network = net;
		}

		auto size() @property pure const nothrow
			in
			{
				assert(immune.length == susceptible.length);
				assert(immune.length == recover.length);
				assert(immune.length == network.length);
			}
		body
		{
			return immune.length;
		}
	}

... and the "Link" template:

	template Link(T)
	{
		alias Tuple!(size_t, "id", T, "influence") Link;
	}

... and now for comparison the new versions:

	mixin template UpdateState()
	{
		T update(T, N : L[][], L)(ref StateSIS!T st, const ref SeedSIS!(T, N, L) sd)
		{
			T d = to!T(0);
			static T[] sick;
			sick.length = st.infected.length;
			sick[] = st.infected[];

			foreach(i; 0..sick.length) {
				T noTransmission = to!T(1);
				foreach(link; sd.network[i])
					noTransmission *= (to!T(1) - sick[link.id] * link.influence);
				T getSick = (to!T(1) - sick[i]) * (sd.susceptible[i] + (to!T(1) -
sd.susceptible[i]) * (to!T(1) - noTransmission));
				T staySick = sick[i] * (to!T(1) - sd.recover[i]);
				st.infected[i] = (to!T(1) - sd.immune[i]) * (getSick + staySick);
				assert(to!T(0) <= st.infected[i]);
				assert(st.infected[i] <= to!T(1));
				d = max(abs(st.infected[i] - sick[i]), d);
			}

			return d;
		}
	}

	class StateSIS(T)
	{
		T[] infected;

		this() {}

		this(T[] inf)
		{
			infected = inf;
		}

		auto size() @property pure const nothrow
		{
			return infected.length;
		}

		T infection() @property pure const nothrow
		{
			return reduce!"a+b"(to!T(0), infected);
		}
	}

	auto stateSIS(T)(T[] inf)
	{
		return new StateSIS!T(inf);
	}

	class SeedSIS(T, Network : L[][], L)
	{
		T[] immune;
		T[] susceptible;
		T[] recover;
		Network network;

		this() {}

		this(T[] imm, T[] sus, T[] rec, Network net)
		{
			immune = imm;
			susceptible = sus;
			recover = rec;
			network = net;
		}

		auto size() @property pure const nothrow
			in
			{
				assert(immune.length == susceptible.length);
				assert(immune.length == recover.length);
				assert(immune.length == network.length);
			}
		body
		{
			return immune.length;
		}
	}

	auto seedSIS(T, Network : L[][], L)(T[] imm, T[] sus, T[] rec, Network net)
	{
		return new SeedSIS!(T, Network, L)(imm, sus, rec, net);
	}

... note that the Network that is passed to SeedSIS is still always a Link!T[][].
May 31, 2013
On 31 May 2013 20:58, Joseph Rushton Wakeling <joseph.wakeling@webdrake.net>wrote:

> On 05/31/2013 08:34 AM, Manu wrote:
> > What's taking the most time?
> > The lighting loop is so template-tastic, I can't get a feel for how fast
> that
> > loop would be.
>
> Hah, I found this out the hard way recently -- have been doing some
> experimental
> reworking of code where some key inner functions were templatized, and it
> had a
> nasty effect on performance.  I'm guessing it made it impossible for the
> compilers to inline these functions :-(
>

I find that using templates actually makes it more likely for the compiler
to properly inline. But I think the totally generic expressions produce
cases where the compiler is considering too many possibilities that inhibit
many optimisations.
It might also be that the optimisations get a lot more complex when the
code fragments span across a complex call tree with optimisation
dependencies on non-deterministic inlining.

One of the most important jobs for the optimiser is code re-ordering.
Generic code is often written in such a way that makes it hard/impossible
for the optimiser to reorder the flattened code properly.
Hand written code can have branches and memory accesses carefully placed at
the appropriate locations.
Generic code will usually package those sorts of operations behind little
templates that often flatten out in a different order.
The optimiser is rarely able to re-order code across if statements, or
pointer accesses. __restrict is very important in generic code to allow the
optimiser to reorder across any indirection, otherwise compilers typically
have to be conservative and presume that something somewhere may have
changed the destination of a pointer, and leave the order as the template
expanded. Sadly, D doesn't even support __restrict, and nobody ever uses it
in C++ anyway.

I've always has better results with writing precisely what I intend the compiler to do, and using __forceinline where it needs a little extra encouragement.