Thread overview | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
April 26, 2015 [hackathon] My and Walter's ideas | ||||
---|---|---|---|---|
| ||||
I've been on this project at work that took the "functionality first, performance later" approach. It has a Java-style approach of using class objects throughout and allocating objects casually. So now we have a project that works but is kinda slow. Profiling shows it spends a fair amount of time collecting garbage (which is easily visible by just looking at code). Yet there is no tooling that tells where most allocations happen. Since it's trivial to make D applications a lot faster by avoiding big ticket allocations and leave only the peanuts for the heap, there should be a simple tool to e.g. count how many objects of each type were allocated at the end of a run. This is the kind of tool that should be embarrassingly easy to turn on and use to draw great insights about the allocation behavior of any application. First shot is a really simple proof of concept at http://dpaste.dzfl.pl/8baf3a2c4a38. I used manually replaced all "new T(args)" with "make!T(args)" and all "new T[n]" with "makeArray!T(n)". I didn't even worry about concatenations and array literals in the first approximation. The support code collects in a thread-local table the locus of each allocation (file, line, and function of the caller) alongside with the type created. Total bytes allocated for each locus are tallied. When a thread exits, it's table is dumped wholesale into a global table, which is synchronized. It's fine to use a global lock because the global table is only updated when a thread exits, not with each increment. When the process exits, the global table is printed out. This was extraordinarily informative essentially taking us from "well let's grep for new and reduce those, and replace class with struct where sensible" to a much more focused approach that targeted the top allocation sites. The distribution is Pareto, e.g. the locus with most allocations accounts for four times more bytes than the second, and the top few are responsible for statistically all allocations that matter. I'll post some sample output soon. Walter will help me with hooking places that allocate in the runtime (new operator, catenations, array literals etc) to allow building this into druntime. At the end we'll write an article about this all. Andrei |
April 26, 2015 Re: [hackathon] My and Walter's ideas | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 26 April 2015 at 03:40:43 UTC, Andrei Alexandrescu wrote: > Since it's trivial to make D applications a lot faster by avoiding big ticket allocations and leave only the peanuts for the heap, there should be a simple tool to e.g. count how many objects of each type were allocated at the end of a run. This is the kind of tool that should be embarrassingly easy to turn on and use to draw great insights about the allocation behavior of any application. https://github.com/CyberShadow/Diamond Among other features: > can display "top allocators" - call stacks that allocated most bytes Unfortunately still D1-only. |
April 26, 2015 Re: [hackathon] My and Walter's ideas | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | On 4/25/15 8:49 PM, Vladimir Panteleev wrote: > On Sunday, 26 April 2015 at 03:40:43 UTC, Andrei Alexandrescu wrote: >> Since it's trivial to make D applications a lot faster by avoiding big >> ticket allocations and leave only the peanuts for the heap, there >> should be a simple tool to e.g. count how many objects of each type >> were allocated at the end of a run. This is the kind of tool that >> should be embarrassingly easy to turn on and use to draw great >> insights about the allocation behavior of any application. > > https://github.com/CyberShadow/Diamond > > Among other features: > >> can display "top allocators" - call stacks that allocated most bytes (Enthusiasm rises) > Unfortunately still D1-only. (Enthusiasm decreases) Andrei |
April 26, 2015 Re: [hackathon] My and Walter's ideas | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 26 April 2015 at 03:58:44 UTC, Andrei Alexandrescu wrote:
> On 4/25/15 8:49 PM, Vladimir Panteleev wrote:
>> On Sunday, 26 April 2015 at 03:40:43 UTC, Andrei Alexandrescu wrote:
>>> Since it's trivial to make D applications a lot faster by avoiding big
>>> ticket allocations and leave only the peanuts for the heap, there
>>> should be a simple tool to e.g. count how many objects of each type
>>> were allocated at the end of a run. This is the kind of tool that
>>> should be embarrassingly easy to turn on and use to draw great
>>> insights about the allocation behavior of any application.
>>
>> https://github.com/CyberShadow/Diamond
>>
>> Among other features:
>>
>>> can display "top allocators" - call stacks that allocated most bytes
>
> (Enthusiasm rises)
>
>> Unfortunately still D1-only.
>
> (Enthusiasm decreases)
Maybe I should work on it for this hackathon. But I also have two other interesting D projects in the pipeline, much closer to being ready (or at least, announce-ready).
|
April 27, 2015 Re: [hackathon] My and Walter's ideas | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 4/25/15 11:40 PM, Andrei Alexandrescu wrote:
> I've been on this project at work that took the "functionality first,
> performance later" approach. It has a Java-style approach of using class
> objects throughout and allocating objects casually.
>
> So now we have a project that works but is kinda slow. Profiling shows
> it spends a fair amount of time collecting garbage (which is easily
> visible by just looking at code). Yet there is no tooling that tells
> where most allocations happen.
>
> Since it's trivial to make D applications a lot faster by avoiding big
> ticket allocations and leave only the peanuts for the heap, there should
> be a simple tool to e.g. count how many objects of each type were
> allocated at the end of a run. This is the kind of tool that should be
> embarrassingly easy to turn on and use to draw great insights about the
> allocation behavior of any application.
>
> First shot is a really simple proof of concept at
> http://dpaste.dzfl.pl/8baf3a2c4a38. I used manually replaced all "new
> T(args)" with "make!T(args)" and all "new T[n]" with "makeArray!T(n)". I
> didn't even worry about concatenations and array literals in the first
> approximation.
>
> The support code collects in a thread-local table the locus of each
> allocation (file, line, and function of the caller) alongside with the
> type created. Total bytes allocated for each locus are tallied.
>
> When a thread exits, it's table is dumped wholesale into a global table,
> which is synchronized. It's fine to use a global lock because the global
> table is only updated when a thread exits, not with each increment.
>
> When the process exits, the global table is printed out.
>
> This was extraordinarily informative essentially taking us from "well
> let's grep for new and reduce those, and replace class with struct where
> sensible" to a much more focused approach that targeted the top
> allocation sites. The distribution is Pareto, e.g. the locus with most
> allocations accounts for four times more bytes than the second, and the
> top few are responsible for statistically all allocations that matter.
> I'll post some sample output soon.
>
> Walter will help me with hooking places that allocate in the runtime
> (new operator, catenations, array literals etc) to allow building this
> into druntime. At the end we'll write an article about this all.
Everything to alter is in lifetime.d. It would be trivial to create this. The only thing is to have a malloc-based AA for tracking, so the tracking doesn't track itself (as that would likely be the most allocations!), where's that std.allocator? I think it's something that easily can be turned on via runtime variable, as allocating is so expensive, checking a bool to see if you should track it would be non-existent performance wise.
Doing it via altering calls to new would be very invasive.
However, note that this wouldn't track allocations that the compiler did for closures. I don't know how that works, as there's not an appropriate lifetime.d function for that. If we wanted to have a generic comprehensive solution, that would need to be added.
-Steve
|
April 27, 2015 Re: [hackathon] My and Walter's ideas | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Monday, 27 April 2015 at 10:56:17 UTC, Steven Schveighoffer wrote: > Everything to alter is in lifetime.d. It would be trivial to create this. https://issues.dlang.org/show_bug.cgi?id=13988 > The only thing is to have a malloc-based AA for tracking https://github.com/D-Programming-Language/druntime/blob/18d57ffe3eed8674ca2052656bb3f410084379f6/src/rt/util/container/hashtab.d > However, note that this wouldn't track allocations that the compiler did for closures. Plain _d_allocmemory. |
April 27, 2015 Re: [hackathon] My and Walter's ideas | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin Nowak | On 4/27/15 7:10 AM, Martin Nowak wrote: > On Monday, 27 April 2015 at 10:56:17 UTC, Steven Schveighoffer wrote: >> The only thing is to have a malloc-based AA for tracking > > https://github.com/D-Programming-Language/druntime/blob/18d57ffe3eed8674ca2052656bb3f410084379f6/src/rt/util/container/hashtab.d sweet, that makes things REALLY trivial :) > > >> However, note that this wouldn't track allocations that the compiler >> did for closures. > > Plain _d_allocmemory. OK, I wasn't sure how it worked. But this really doesn't help much for fine-grained statistics gathering. In what situations does this function get called by the compiler? If it's just for closures, we can lump all closure allocations together in one stat. If you see closures are your big nemesis, it may be time to redesign :) -Steve |
Copyright © 1999-2021 by the D Language Foundation