August 18, 2015
On 8/18/2015 2:32 PM, Joseph Rushton Wakeling wrote:
> However true that may be in general, those almost certainly aren't the reasons
> why ddmd benchmarks 30% slower than dmd.  I would suspect that particular speed
> difference is heavily backend-dependent.

That's exactly why I started this thread. To find out why, and if there are low cost steps we can take to close that gap.

August 18, 2015
On Tue, Aug 18, 2015 at 02:25:34PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/18/2015 1:47 PM, deadalnix wrote:
> >Realistically, D does not have the man power required to reach the same level of optimization, and have many higher impact task to spend that manpower on.
> 
> dmd also does a sludge of patterns. I'm just looking for a few that would significantly impact the result.

>From the little that I've seen of dmd's output, it seems that it's
rather weak in the areas of inlining and loop unrolling / refactoring.

Inner loops especially would benefit from much more aggressive inlining, which dmd seems to be unable to do once the loop body gets moderately complex. Unrolling in dmd seems to be very minimal or absent.

In both cases, it seems that the optimizer gives up too quickly -- an if-else function body will get inlined, but an if without an else doesn't, etc..  This means the slightest bump in the road will cause the inliner to throw up its hands and not inline, which in turn causes missed opportunities for further refactoring / simplification in the calling function. Especially in range-based code, this can make a big difference.

There's also the more general optimizations, like eliminating redundant loads, eliding useless allocation of stack space in functions that return constant values, etc.. While DMD does do some of this, it's not as thorough as GDC. While it may sound like only a small difference, if they happen to run inside an inner loop, they can add up to quite a significant difference.

DMD needs to be much more aggressive in eliminating useless / redundant code blocks; a lot of this comes not from people writing unusually redundant code, but from template expansions and inlined range-based code, which sometimes produce a lot of redundant operations if translated directly.  Aggressively reducing these generated code blocks will often open up further optimization opportunities.


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
August 18, 2015
On Tuesday, 18 August 2015 at 21:53:43 UTC, Meta wrote:
> On Tuesday, 18 August 2015 at 21:45:42 UTC, rsw0x wrote:
>> If you want D to have a GC, you have to design the language around having a GC. Right now, D could be likened to using C++ with Boehm.
>
> The irony is that most GC-related complaints are the exact opposite - that the language depends too much on the GC.

Phobos relies on the GC, but the language itself is not designed around being GC friendly.
August 18, 2015
On 18 August 2015 at 23:25, Walter Bright via Digitalmars-d < digitalmars-d@puremagic.com> wrote:

> On 8/18/2015 1:47 PM, deadalnix wrote:
>
>> Realistically, D does not have the man power required to reach the same
>> level of
>> optimization, and have many higher impact task to spend that manpower on.
>>
>
> dmd also does a sludge of patterns. I'm just looking for a few that would significantly impact the result.
>

Speculative devirtualization? http://hubicka.blogspot.de/2014/02/devirtualization-in-c-part-4-analyzing.html


August 18, 2015
On Tuesday, 18 August 2015 at 21:55:26 UTC, Walter Bright wrote:
> On 8/18/2015 2:33 PM, deadalnix wrote:
>> There is none. There is a ton of 0.5% one that adds up to the 30% difference.
>
> I regard a simple pattern that nets 0.5% as quite a worthwhile win. That's only 60 of those to make up the difference.
>
> If you've got any that you know of that would net 0.5% for dmd, lay it on me!
>
>
>> If I'd were to bet on what would impact DMD perfs the most, I'd go for SRAO, and
>> a inliner in the middle end that works bottom up :
>>   - Explore the call graph to-down optimizing functions along the way
>>   - Backtrack bottom-up and check for inlining opportunities.
>>   - Rerun optimizations on the function inlining was done in.
>
> That's how the inliner already works. The data flow analysis optimizer also runs repeatedly as each optimization exposes more possibilities. The register allocator also runs repeatedly.
>

My understanding is that the inliner is in the front end. This definitively do not work the way I describe it here.

> (I am unfamiliar with the term SRAO. Google comes up with nothing for it.)
>

That is because I made a typo, sorry.

It stand for 'scalar replacement of aggregate' aka SROA (not SRAO).

You can find literature on the subject.

>> It require a fair amount of tweaking and probably need a way for the backends to
>> provide a cost heuristic for various functions,
>
> A cost heuristic is already used for the inliner (it's in the front end, in 'inline.d'). A cost heuristic is also used for the register allocator.

I'm not sure how this can be made to work the way I describe it in the frontend.

August 18, 2015
On Tuesday, 18 August 2015 at 21:48:13 UTC, rsw0x wrote:
> D's current GC could see improvements, but it will never ever catch up to the GC of any other major language without changes to the language itself.

There's been lots and lots and lots of forum discussions in the past few years of how the language can be changed to support better GC latency, but… Walter does not want to carry ownership information in pointer types. Which I don't think is all that consistent given const and shared, but there you go.

So the most likely path to success is to minimize the role of the GC and leave it as a "prototyping tool".
August 18, 2015
On Tuesday, 18 August 2015 at 21:58:58 UTC, Walter Bright wrote:
> On 8/18/2015 2:32 PM, Joseph Rushton Wakeling wrote:
>> However true that may be in general, those almost certainly aren't the reasons
>> why ddmd benchmarks 30% slower than dmd.  I would suspect that particular speed
>> difference is heavily backend-dependent.
>
> That's exactly why I started this thread. To find out why, and if there are low cost steps we can take to close that gap.

Well, yes, quite.  I was backing up your rationale, even if I disagree with your prioritizing these concerns at this stage of the dmd => ddmd transition.
August 18, 2015
On 8/18/2015 3:04 PM, deadalnix wrote:
> My understanding is that the inliner is in the front end. This definitively do
> not work the way I describe it here.

But it uses a cost function and runs repeatedly until there is no more inlining to be done.


> It stand for 'scalar replacement of aggregate' aka SROA (not SRAO).
> You can find literature on the subject.

I'm aware of the technique, though I didn't know the name for it (I always called it "horizontal slicing of aggregates"). It is one optimization that dmd could significantly benefit from, and wouldn't be too hard to implement. The ubiquitous use of ranges makes it a much more important optimization.

I suspect it would net much more than 0.5%.
August 18, 2015
On Tuesday, 18 August 2015 at 22:14:52 UTC, Walter Bright wrote:
> On 8/18/2015 3:04 PM, deadalnix wrote:
>> My understanding is that the inliner is in the front end. This definitively do
>> not work the way I describe it here.
>
> But it uses a cost function and runs repeatedly until there is no more inlining to be done.
>
>

You need to have the optimization done on the way down or the cost function only tells you about the unoptimized cost, which doesn't really matter, especially after inlining is done as the code can become fairly redundant.

>> It stand for 'scalar replacement of aggregate' aka SROA (not SRAO).
>> You can find literature on the subject.
>
> I'm aware of the technique, though I didn't know the name for it (I always called it "horizontal slicing of aggregates"). It is one optimization that dmd could significantly benefit from, and wouldn't be too hard to implement. The ubiquitous use of ranges makes it a much more important optimization.
>
> I suspect it would net much more than 0.5%.

Yes. It would make many thing apparent to other part of the optimizer.
August 18, 2015
On Tuesday, 18 August 2015 at 21:43:44 UTC, Walter Bright wrote:
> On 8/18/2015 1:33 PM, Jacob Carlborg wrote:
>> There's profile guided optimization, which LLVM supports.
>
> dmd does have that to some extent. If you run with -profile, the profiler will emit a trace.def file. This is a script which can be fed to the linker which controls the layout of functions in the executable. The layout is organized so that strongly connected functions reside in the same page, minimizing swapping and maximizing cache hits.
>
> Unfortunately, nobody makes use of it, which makes me reluctant to expend further effort on PGO.
>
>   http://www.digitalmars.com/ctg/trace.html
>
> I wonder how many people actually use the llvm profile guided optimizations. I suspect very, very few.

People are lazy and if it takes more than one click people wont use it. Just like unitesting everyone agrees that its good to write them but nobody does that. When you put unitesting in compiler more people are writing tests. PGO is awesome, but it needs to be made much simpler before people use it everyday.