Thread overview
Re: Disappointing performance from DMD/Phobos
Jun 26, 2018
Manu
Jun 26, 2018
Nathan S.
Jun 26, 2018
Basile B.
Jun 26, 2018
Jonathan M Davis
Jun 26, 2018
H. S. Teoh
Jun 26, 2018
Manu
Jun 26, 2018
kinke
Jun 26, 2018
Jonathan M Davis
June 25, 2018
On Mon, 25 Jun 2018 at 19:10, Manu <turkeyman@gmail.com> wrote:
>
> Some code:
> ---------------------------------
> struct Entity
> {
>   enum NumSystems = 4;
>   struct SystemData
>   {
>     uint start, length;
>   }
>   SystemData[NumSystems] systemData;
>   @property uint systemBits() const { return systemData[].map!(e =>
> e.length).sum; }
> }
> Entity e;
> e.systemBits(); // <- call the function, notice the codegen
> ---------------------------------
>
> This property sum's 4 ints... that should be insanely fast. It should
> also be something like 5-8 lines of asm.
> Turns out, that call to sum() is eating 2.5% of my total perf
> (significant among a substantial workload), and the call tree is quite
> deep.
>
> Basically, inliner tried, but failed to seal the deal, and leaves a call stack 7 levels deep.
>
> Pipeline programming is hip and also *recommended* D usage. The
> optimiser must do a good job. This is such a trivial workloop, and
> with constant length (4).
> I expect 3 integer adds to unroll and inline. A call-tree 7 levels
> deep is quite a ways from the mark.
>
> Maybe this is another instance of Walter's "phobos begat madness" observation?
> The unoptimised callstack is mental. Compiling with -O trims most of
> the noise in the call tree, but it fails to inline the remaining work
> which ends up 7-levels down a redundant call-tree.

I optimised another major gotcha eating perf, and now this issue is taking 13% of my entire work time... bummer.
June 25, 2018
On Monday, June 25, 2018 19:10:17 Manu via Digitalmars-d wrote:
> Some code:
> ---------------------------------
> struct Entity
> {
>   enum NumSystems = 4;
>   struct SystemData
>   {
>     uint start, length;
>   }
>   SystemData[NumSystems] systemData;
>   @property uint systemBits() const { return systemData[].map!(e =>
> e.length).sum; }
> }
> Entity e;
> e.systemBits(); // <- call the function, notice the codegen
> ---------------------------------
>
> This property sum's 4 ints... that should be insanely fast. It should
> also be something like 5-8 lines of asm.
> Turns out, that call to sum() is eating 2.5% of my total perf
> (significant among a substantial workload), and the call tree is quite
> deep.
>
> Basically, inliner tried, but failed to seal the deal, and leaves a call stack 7 levels deep.
>
> Pipeline programming is hip and also *recommended* D usage. The
> optimiser must do a good job. This is such a trivial workloop, and
> with constant length (4).
> I expect 3 integer adds to unroll and inline. A call-tree 7 levels
> deep is quite a ways from the mark.
>
> Maybe this is another instance of Walter's "phobos begat madness" observation? The unoptimised callstack is mental. Compiling with -O trims most of the noise in the call tree, but it fails to inline the remaining work which ends up 7-levels down a redundant call-tree.

dmd's inliner is notoriously poor, but I don't know how much effort has really been put into fixing the problem. I do recall it being argued several times that it only should only be in the backend and that there shouldn't be one in the frontend, but either way, the typical solution seems to be to use ldc instead of dmd if you really care about the performance of the generated binary.

I don't follow dmd PRs closely, but I get the impression that far more effort gets put into feature-related stuff and bug fixes than performance improvements. Walter at least occasionally does performance improvements, but when he talks about them, it seems like a number of folks react negatively, thinking that his time would be better spent on features and the like and that folks just use ldc for performance.

So, all in all, the result is not great for dmd's performance. I don't know what the solution is, though I agree that we're better off if dmd generates fast code in general even if it's not as good as what ldc does.

Regardless, if you can give simple test cases that clearly should be generating far better code than they are, then at least there's a clear target for improvement rather than just "dmd should generate faster code," so there's something clearly actionable.

- Jonathan M Davis

June 26, 2018
On Tuesday, 26 June 2018 at 02:20:37 UTC, Manu wrote:
> I optimised another major gotcha eating perf, and now this issue is taking 13% of my entire work time... bummer.

Without disagreeing with you, ldc2 optimizes this fine.

https://run.dlang.io/is/NJct6U
----
const @property uint onlineapp.Entity.systemBits():
	.cfi_startproc
	movl	4(%rdi), %eax
	addl	12(%rdi), %eax
	addl	20(%rdi), %eax
	addl	28(%rdi), %eax
	retq
----
June 26, 2018
When I see "DMD" and "performance" in the same sentence, my first reaction is, "why aren't you using LDC or GDC"?

Seriously, doing performance measurements with DMD is a waste of time, because its optimizer has been proven time and again to be suboptimal (har!).  DMD frequently produces suboptimal code for things that ldc/gdc easily optimizes away.  If you're doing anything where performance is important, use ldc/gdc.

My personal recommendation is, use dmd for compilation speed and/or latest features, use ldc/gdc for anything performance-related.


T

-- 
They pretend to pay us, and we pretend to work. -- Russian saying
June 26, 2018
On Mon, 25 Jun 2018 at 20:17, Jonathan M Davis via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>
> dmd's inliner is notoriously poor,

I know, but it's still the reference compiler, and it should at least
to a reasonable job at the kind of D code that it's *recommended* that
users write.
That line of code is the sort of line that should showcase what you
can do with D that you can't do with other languages... but DMD will
lead you to believe that you can't do it with D either.
My point is, it's a really bad thing to present to users. DMD should
really care about that impression.

> but I don't know how much effort has
> really been put into fixing the problem. I do recall it being argued several
> times that it only should only be in the backend and that there shouldn't be
> one in the frontend, but either way, the typical solution seems to be to use
> ldc instead of dmd if you really care about the performance of the generated
> binary.

I'm using unreleased 2.081, which isn't in LDC yet. Also, LDC seems to
have more problems with debuginfo than DMD.
Once LDC is on 2.081, I might have to flood their bugtracker with
debuginfo related issues.

> So, all in all, the result is not great for dmd's performance. I don't know what the solution is, though I agree that we're better off if dmd generates fast code in general even if it's not as good as what ldc does.

It's not so much that I expect perfect optimisation, but this is a
failure to detect a prolific and recommended pattern used in a lot of
D code.
I think extra care should be taken to make such common code at least
approach a users expectation for codegen.

> Regardless, if you can give simple test cases that clearly should be generating far better code than they are, then at least there's a clear target for improvement rather than just "dmd should generate faster code," so there's something clearly actionable.

I'm pretty sure that's exactly what I did above...
Build that code, suggest; don't generate a callstack 7-levels deep.
Ideally, observe inline code that adds 4 ints together.
June 26, 2018
On Tuesday, 26 June 2018 at 17:38:42 UTC, Manu wrote:
> I know, but it's still the reference compiler, and it should at least
> to a reasonable job at the kind of D code that it's *recommended* that
> users write.

I get your point, but IMO it's all about efficient allocation of the manpower we have. Spending it on improving the inliner/optimizer or even adding ARM codegen as was recently suggested IIRC would be very unwise IMO; LDC and GDC are already there (and way beyond ;)), so I consider fixing bugs, tightening the language spec, improving tooling, druntime/Phobos, C++ interop (thx for your recent contributions there!) and so forth *way* more important than improving DMD codegen.
Not too long ago (closed backend), DMD as reference compiler wasn't provided by Linux package managers (but LDC and GDC were), so it's not like D would automatically imply DMD. I don't recall ever being interested in whether a compiler for some language was the 'reference' one or not.

> I'm using unreleased 2.081, which isn't in LDC yet.

We're (inofficially) at 2.081-beta.2 and fully green except for a very minor OSX debuginfo thingy, see https://github.com/ldc-developers/ldc/pull/2752. So expect a 1.11 beta as soon as 2.081 is released.

> Also, LDC seems to have more problems with debuginfo than DMD.
> Once LDC is on 2.081, I might have to flood their bugtracker with debuginfo related issues.

Looking forward to that. ;) - CodeView debuginfo support is pretty new in LLVM (and not backed by Microsoft AFAIK, plus there have been regressions with LLVM 6). With LLVM 7, there's a new debuginfo intrinsic which I hope will allow us to significantly improve DI for LDC.
June 26, 2018
On Tuesday, June 26, 2018 10:38:42 Manu via Digitalmars-d wrote:
> On Mon, 25 Jun 2018 at 20:17, Jonathan M Davis via Digitalmars-d
>
> <digitalmars-d@puremagic.com> wrote:
> > dmd's inliner is notoriously poor,
>
> I know, but it's still the reference compiler, and it should at least
> to a reasonable job at the kind of D code that it's *recommended* that
> users write.
> That line of code is the sort of line that should showcase what you
> can do with D that you can't do with other languages... but DMD will
> lead you to believe that you can't do it with D either.
> My point is, it's a really bad thing to present to users. DMD should
> really care about that impression.

I think that it mainly comes down to priorities, and performance is not at the top of the list for the work being done on dmd. It's desirable to be sure, but with everything else that needs to be done, it tends to fall be the wayside even when it arguably shouldn't.

> > but I don't know how much effort has
> > really been put into fixing the problem. I do recall it being argued
> > several times that it only should only be in the backend and that there
> > shouldn't be one in the frontend, but either way, the typical solution
> > seems to be to use ldc instead of dmd if you really care about the
> > performance of the generated binary.
>
> I'm using unreleased 2.081, which isn't in LDC yet. Also, LDC seems to
> have more problems with debuginfo than DMD.
> Once LDC is on 2.081, I might have to flood their bugtracker with
> debuginfo related issues.

Well, if you're using the newest stuff, then sadly, that tends to mean that you're stuck with dmd and that your performance just isn't going to be as good, and I doubt that that will be fixed anytime soon (though specific cases may be improved).

> > Regardless, if you can give simple test cases that clearly should be generating far better code than they are, then at least there's a clear target for improvement rather than just "dmd should generate faster code," so there's something clearly actionable.
>
> I'm pretty sure that's exactly what I did above...
> Build that code, suggest; don't generate a callstack 7-levels deep.
> Ideally, observe inline code that adds 4 ints together.

I wasn't trying to say that you didn't have a test case. My point was that if you have an actual, reasonably-sized test case (which you do), then a bug report can be opened for that particular example, and it has some chance of being fixed. Too often with stuff like this, folks complain that "dmd's inliner is bad" or that "dmd's error messages aren't clear enough" without giving concrete examples, which makes it effectively unactionable even if someone is trying to spend time on it (Walter has complained in the past about folks saying that something about dmd isn't good enough without giving concrete examples, which makes it really hard for him to fix the problem).

- Jonathan M Davis

June 26, 2018
On Tuesday, 26 June 2018 at 02:20:37 UTC, Manu wrote:
> On Mon, 25 Jun 2018 at 19:10, Manu <turkeyman@gmail.com> wrote:
>>
>> Some code:
>> ---------------------------------
>> struct Entity
>> {
>>   enum NumSystems = 4;
>>   struct SystemData
>>   {
>>     uint start, length;
>>   }
>>   SystemData[NumSystems] systemData;
>>   @property uint systemBits() const { return systemData[].map!(e =>
>> e.length).sum; }
>> }
>> Entity e;
>> e.systemBits(); // <- call the function, notice the codegen
>> ---------------------------------
>>
>> This property sum's 4 ints... that should be insanely fast. It should
>> also be something like 5-8 lines of asm.
>> Turns out, that call to sum() is eating 2.5% of my total perf
>> (significant among a substantial workload), and the call tree is quite
>> deep.
>>
>> Basically, inliner tried, but failed to seal the deal, and leaves a call stack 7 levels deep.
>>
>> Pipeline programming is hip and also *recommended* D usage. The
>> optimiser must do a good job. This is such a trivial workloop, and
>> with constant length (4).
>> I expect 3 integer adds to unroll and inline. A call-tree 7 levels
>> deep is quite a ways from the mark.
>>
>> Maybe this is another instance of Walter's "phobos begat madness" observation?
>> The unoptimised callstack is mental. Compiling with -O trims most of
>> the noise in the call tree, but it fails to inline the remaining work
>> which ends up 7-levels down a redundant call-tree.
>
> I optimised another major gotcha eating perf, and now this issue is taking 13% of my entire work time... bummer.

The inliner bug is that nested types are not hanlded.
And here is the relevant issue:

https://run.dlang.io/is/tVYrh9
https://issues.dlang.org/show_bug.cgi?id=16360

Even `static` struct inside (no context ptr) are enough to bug the inliner.
Bye bye voldemort types, see ya.