August 19, 2015
On Tuesday, 18 August 2015 at 10:45:49 UTC, Walter Bright wrote:
> Martin ran some benchmarks recently that showed that ddmd compiled with dmd was about 30% slower than when compiled with gdc/ldc. This seems to be fairly typical.
>
> I'm interested in ways to reduce that gap.
>
> There are 3 broad kinds of optimizations that compilers do:
>
> 1. source translations like rewriting x*2 into x<<1, and function inlining
>
> 2. instruction selection patterns like should one generate:
>
>     SETC AL
>     MOVZ EAX,AL
>
> or:
>     SBB EAX
>     NEG EAX
>
> 3. data flow analysis optimizations like constant propagation, dead code elimination, register allocation, loop invariants, etc.
>
> Modern compilers (including dmd) do all three.
>
> So if you're comparing code generated by dmd/gdc/ldc, and notice something that dmd could do better at (1, 2 or 3), please let me know. Often this sort of thing is low hanging fruit that is fairly easily inserted into the back end.
>
> For example, recently I improved the usage of the SETcc instructions.
>
> https://github.com/D-Programming-Language/dmd/pull/4901
> https://github.com/D-Programming-Language/dmd/pull/4904
>
> A while back I improved usage of BT instructions, the way switch statements were implemented, and fixed integer divide by a constant with multiply by its reciprocal.

I lack the assembly language skills to determine the cause(s) myself, but my [CheckedInt](https://github.com/tsbockman/CheckedInt) benchmark runs about 10x slower when compiled with DMD rather than GDC. I'm sure there's some low-hanging fruit in there somewhere...

Note that while it's far from being a minimal test case, the runtime code is nowhere near as complicated as it might appear at first - the vast majority of the complexity in the code is compile-time logic. I could produce a similar example without most of the compile-time obfuscation, if requested.

Also note that the speed difference has nothing to do with the use of core.checkedint intrinsics, as it was there before those were implemented in GDC.
August 19, 2015
On 08/18/2015 12:45 PM, Walter Bright wrote:
> Martin ran some benchmarks recently that showed that ddmd compiled with
> dmd was about 30% slower than when compiled with gdc/ldc. This seems to
> be fairly typical.
>
> I'm interested in ways to reduce that gap.
>
> There are 3 broad kinds of optimizations that compilers do:
>
> 1. source translations like rewriting x*2 into x<<1, and function inlining
>
> 2. instruction selection patterns like should one generate:
>
>      SETC AL
>      MOVZ EAX,AL
>
> or:
>      SBB EAX
>      NEG EAX
>
> 3. data flow analysis optimizations like constant propagation, dead code
> elimination, register allocation, loop invariants, etc.
>
> Modern compilers (including dmd) do all three.
>
> So if you're comparing code generated by dmd/gdc/ldc, and notice
> something that dmd could do better at (1, 2 or 3), please let me know.
> Often this sort of thing is low hanging fruit that is fairly easily
> inserted into the back end.
>
> For example, recently I improved the usage of the SETcc instructions.
>
> https://github.com/D-Programming-Language/dmd/pull/4901
> https://github.com/D-Programming-Language/dmd/pull/4904
>
> A while back I improved usage of BT instructions, the way switch
> statements were implemented, and fixed integer divide by a constant with
> multiply by its reciprocal.

Maybe relevant: There's some work on automatically discovering peephole optimizations that a compiler misses, e.g. http://blog.regehr.org/archives/1109
August 20, 2015
On Wednesday, 19 August 2015 at 22:58:35 UTC, Timon Gehr wrote:
> Maybe relevant: There's some work on automatically discovering peephole optimizations that a compiler misses, e.g. http://blog.regehr.org/archives/1109

There is a talk of this
https://www.youtube.com/watch?v=Ux0YnVEaI6A
August 20, 2015
On Tuesday, 18 August 2015 at 22:55:06 UTC, Walter Bright wrote:
> On 8/18/2015 3:17 PM, welkam wrote:
>> People are lazy and if it takes more than one click people wont use it. Just
>> like unitesting everyone agrees that its good to write them but nobody does
>> that. When you put unitesting in compiler more people are writing tests. PGO is
>> awesome, but it needs to be made much simpler before people use it everyday.
>
> Exactly. That's why people just want to type "-O" and it optimizes.

At least without separate compilation, it probably wouldn't be that hard to add a compiler flag that made it so that the unit tests were run after the code was built and then made the compiler rebuild the program with the profiling results, but that would base the optimizations off of the unit tests rather than the actual program, which probably wouldn't be a good idea in general.

Another possibility would be to build something into dub. If it handled it for you automatically, then that would make it comparable to just slapping on the -O flag.

- Jonathan M Davis
August 20, 2015
On 2015-08-19 20:41, Walter Bright wrote:

> It's true that if generating an exe, the compiler can mark leaf classes
> as final and get devirtualization.

Since D has the "override" keyword it should be possible to devirtualize a call to a non leaf classes as well, as long as the method is not overridden.

> (Of course, you can manually add'final' to classes.)

The same way as you can manually do many of the other optimizations as well.

One case where "final" does not help is if there's a piece of code the is shared between two projects. In one project there is a subclass with a method overridden but in the other project there's not.

// shared code
module base;

class Base
{
    void foo () {}
}

// project A
import base;

class A : Base { }

// project B
import base;

class B : Base
{
    override void foo () {}
}

-- 
/Jacob Carlborg
August 21, 2015
On Tue, Aug 18, 2015 at 04:30:26PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/18/2015 4:05 PM, H. S. Teoh via Digitalmars-d wrote:
> >Maybe when I get some free time this week, I could look at the disassembly of one of my programs again to give some specific examples.
> 
> Please do.

Rather than try to reduce one of my programs into a self-contained test case, I decided instead to write a contrived range-based program that exemplifies the limitations of dmd as I have observed. The program code follows; the comments at the end describe the results, analysis, and my conclusions.

I didn't include the assembly listings, as it's rather long, but if people are interested I'll trim them down to the parts of interest and post them in a follow-up post.

----------------------snip-----------------------
/* Crude benchmark for comparing dmd/gdc output. */

import std.algorithm : map, filter, reduce;
import std.conv : to;
import std.range : iota;
import std.stdio : writeln;

auto fun(int n)
{
	return iota(n)
		.map!(a => a*7)
		.filter!(a => a % 2 == 0)
		.reduce!((a,b) => a/2 + b);
}

void main() {
	writeln(fun(100_000_000));
}

/* RESULTS:

Compiled with:

	dmd -release -O -inline test.d -oftest.dmd
	gdc -frelease -finline -O3 test.d -o test.gdc

(dmd git HEAD, gdc 5.2.1)

Execution times:

	% time test.dmd ; time test.gdc
	1399999944

	real	0m0.628s
	user	0m0.627s
	sys	0m0.001s
	1399999944

	real	0m0.168s
	user	0m0.167s
	sys	0m0.000s
	%

As can be seen, the executable produced by gdc runs about 3.7 times faster than the executable produced by dmd. Why?

Looking at the disassembly, the first thing that stands out is that gdc has inlined the call to writeln, whereas dmd calls a separate function. While this isn't the bottleneck, it gives a hint of the things to come. We look next at fun(), which in both cases are standalone functions.

The dmd version of fun() is pretty straightforward: it calls iota() to create
the range, then map(), then filter(), and finally reduce() where most of the
work takes place. This is pretty much a direct translation of the code.

The gdc version of fun() is markedly different. We immediately notice that the
only function call in it is a call to enforce(). Not only the first level calls
to iota, map, filter, reduce have been inlined, pretty much their *entire* call
trees have been inlined, except for the call to enforce().

Here I'd like to highlight the fact that in the dmd version of the code, the call to iota() has another function call to the ctor of the returned range. So we see that gdc has inlined two levels of function calls where dmd has inlined none, even though one would expect that with -inline, at least the call from iota() to the ctor of the range should have been inlined, since it's the only place where that ctor would be called; iota() itself being merely a thin wrapper around it. (The ctor itself is also pretty simple; I'm not sure why dmd fails to inline it.)

Similar remarks apply to the calls to map() and filter() as well.

Now let's look at reduce(), which is where the actual action takes place. The dmd version, of course, involves a separate function call, which in the grand scheme of things isn't all that important, since it's only a single function call. However, a look at the disassembly of reduce() shows that dmd has not inlined the calls to .empty, .front, and .popFront. In fact, the function calls yet another function -- reduceImpl -- where the main loop sits. Inside this main loop, .empty, .front, and .popFront are again called with no inlining -- even though .empty has a trivial body, .front involves only 1 multiplication, and .popFront only 1 multiplication and a single odd/even test.  On top of this, each of these nested function calls involve a certain amount of boilerplate: twiddling with the stack registers, shuffling call arguments about, etc., that add up to quite a large overhead in reduceImpl's inner loop.

The gdc version, by contrast, inlines *everything*, except the call to enforce() which is outside the inner loop. This aggressive inlining allowed gdc to trim the loop body down to about only 18 instructions with no function calls.  While the dmd inner loop itself has only 15 instructions, it involves 3 function calls, with .front having 8 instructions, .empty also 8 instructions, and .popFront 13 instructions, making a total of 44 instructions per iteration. A significant percentage of these instructions are function call boilerplate. The entire inner loop in the gdc version would fit in about 4-5 CPU cache lines, whereas the dmd version would require a lot more.

To dmd's credit, it did manage to inline the nested calls in .empty, .front, and .popFront(), which would have involved more function calls when no inlining at all is done (each wrapper range forwards the calls to the next). This probably helped to reduce the cost of running their respective function bodies. However, this isn't quite enough, since the overhead of 3 function calls in the inner loop is pretty expensive when the cost could have been eliminated completely, as gdc had done.

*/
----------------------snip-----------------------


T

-- 
Political correctness: socially-sanctioned hypocrisy.
August 21, 2015
On 8/20/2015 4:56 PM, H. S. Teoh via Digitalmars-d wrote:
> I didn't include the assembly listings, as it's rather long, but if
> people are interested I'll trim them down to the parts of interest and
> post them in a follow-up post.

Thank you. This belongs as an enhancement request in bugzilla, using the 'performance' keyword. I don't think adding the assembler listings is necessary for this one, what you've posted here is sufficient.

August 21, 2015
On Thu, Aug 20, 2015 at 05:30:25PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/20/2015 4:56 PM, H. S. Teoh via Digitalmars-d wrote:
> >I didn't include the assembly listings, as it's rather long, but if people are interested I'll trim them down to the parts of interest and post them in a follow-up post.
> 
> Thank you. This belongs as an enhancement request in bugzilla, using the 'performance' keyword. I don't think adding the assembler listings is necessary for this one, what you've posted here is sufficient.

https://issues.dlang.org/show_bug.cgi?id=14943


T

-- 
People tell me that I'm skeptical, but I don't believe them.
August 21, 2015
On Friday, 21 August 2015 at 00:00:09 UTC, H. S. Teoh wrote:
>
> The gdc version, by contrast, inlines *everything*,

This could be why I've observed performance differentials in dmd for doing some manual for loops rather than using the stuff in std.algorithms.
August 21, 2015
On Fri, Aug 21, 2015 at 01:20:25AM +0000, jmh530 via Digitalmars-d wrote:
> On Friday, 21 August 2015 at 00:00:09 UTC, H. S. Teoh wrote:
> >
> >The gdc version, by contrast, inlines *everything*,
> 
> This could be why I've observed performance differentials in dmd for doing some manual for loops rather than using the stuff in std.algorithms.

Very likely, I'd say. IME dmd tends to give up inlining rather easily. This is very much something that needs to improve, since ranges in D are supposed to be a big selling point. Wouldn't want them to perform poorly compared to hand-written loops.

Have you tried using gdc -O3 (or ldc) to see if there's a big
difference?


T

-- 
Prosperity breeds contempt, and poverty breeds consent. -- Suck.com
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19