Jump to page: 1 25  
Page
Thread overview
[xmlp] the recent garbage collector performance improvements
Feb 01, 2012
Richard Webb
Feb 01, 2012
dsimcha
Feb 01, 2012
Martin Nowak
Feb 02, 2012
Richard Webb
Feb 02, 2012
Marco Leise
Feb 02, 2012
bearophile
Feb 02, 2012
dsimcha
Feb 02, 2012
dsimcha
Feb 02, 2012
a
Feb 02, 2012
Richard Webb
Feb 02, 2012
Jesse Phillips
Feb 02, 2012
Jonathan M Davis
Feb 03, 2012
Richard Webb
Feb 03, 2012
H. S. Teoh
Feb 03, 2012
Robert Jacques
Feb 01, 2012
Jesse Phillips
Feb 01, 2012
Richard Webb
Feb 01, 2012
dsimcha
Feb 01, 2012
H. S. Teoh
Feb 01, 2012
David Nadlinger
Feb 02, 2012
bearophile
Feb 02, 2012
dsimcha
Feb 02, 2012
Robert Jacques
Feb 02, 2012
dsimcha
Feb 02, 2012
Manu
Feb 02, 2012
dsimcha
Feb 03, 2012
Manu
Feb 02, 2012
Jonathan M Davis
Feb 02, 2012
Andrej Mitrovic
Feb 02, 2012
dsimcha
Feb 02, 2012
H. S. Teoh
Feb 02, 2012
a
Feb 02, 2012
Richard Webb
Feb 02, 2012
Timon Gehr
Feb 02, 2012
Sean Kelly
Feb 03, 2012
F i L
Feb 02, 2012
H. S. Teoh
Feb 03, 2012
Manu
Feb 03, 2012
Manu
Feb 02, 2012
Manu
Feb 02, 2012
Richard Webb
Feb 25, 2012
Michael Rynn
February 01, 2012
Last night I tried loading a ~20 megabyte xml file using xmlp (via the
DocumentBuilder.LoadFile function) and a recent dmd build, and found that it
took ~48 seconds to complete, which is rather poor.
I tried running it through a profiler, and that said that almost all the
runtime was spent inside the garbage collector.

I then tried the same test using the latest Git versions of dmd/druntime (with pull request 108 merged in), and that took less than 10 seconds. This is a rather nice improvement, though still somewhat on the slow side.

Some profiler numbers, if anyone is interested:

Old version:
Gcxfullcollect: 31.14 seconds, 69.26% runtime.
Gcxmark: 4.84 seconds, 10.77% runtime.
Gcxfindpool: 2.10 seconds, 4.67% runtime.

New version:
Gcxmark: 11.67 seconds, 50.77% runtime.
Gcxfindpool: 3.58 seconds, 15.55% runtime.
Gcxfullcollect: 1.69 seconds, 7.37% runtime.

(Assuming that Sleepy is giving me accurate numbers. The new version is
definately faster though).
February 01, 2012
Interesting.  I'm glad my improvements seem to matter in the real world, though I'm thoroughly impressed with the amount of speedup.  Even the small allocation benchmark that I was optimizing only sped up by ~50% from 2.057 to 2.058 overall and ~2x in collection time.  I'd be very interested if you could make a small, self-contained test program to use as a benchmark.

GC performance is one of D's biggest weak spots, so it would probably be a good bit of marketing to show that the performance is substantially better than it used to be even if it's not great yet.  Over the past year I've been working on and off at speeding it up.  It's now at least ~2x faster than it was last year at this time on every benchmark I've tried and up to several hundred times faster in the extreme case of huge allocations.

On Wednesday, 1 February 2012 at 18:33:58 UTC, Richard Webb wrote:
> Last night I tried loading a ~20 megabyte xml file using xmlp (via the
> DocumentBuilder.LoadFile function) and a recent dmd build, and found that it
> took ~48 seconds to complete, which is rather poor.
> I tried running it through a profiler, and that said that almost all the
> runtime was spent inside the garbage collector.
>
> I then tried the same test using the latest Git versions of dmd/druntime (with
> pull request 108 merged in), and that took less than 10 seconds. This is a
> rather nice improvement, though still somewhat on the slow side.
>
> Some profiler numbers, if anyone is interested:
>
> Old version:
> Gcxfullcollect: 31.14 seconds, 69.26% runtime.
> Gcxmark: 4.84 seconds, 10.77% runtime.
> Gcxfindpool: 2.10 seconds, 4.67% runtime.
>
> New version:
> Gcxmark: 11.67 seconds, 50.77% runtime.
> Gcxfindpool: 3.58 seconds, 15.55% runtime.
> Gcxfullcollect: 1.69 seconds, 7.37% runtime.
>
> (Assuming that Sleepy is giving me accurate numbers. The new version is
> definately faster though).


February 01, 2012
XML is a prime example of a shallow tree structure,
so it's the prime target for the recent optimization.
February 01, 2012
On Wednesday, 1 February 2012 at 18:33:58 UTC, Richard Webb wrote:
> Last night I tried loading a ~20 megabyte xml file using xmlp (via the
> DocumentBuilder.LoadFile function) and a recent dmd build, and found that it
> took ~48 seconds to complete, which is rather poor.
> I tried running it through a profiler, and that said that almost all the
> runtime was spent inside the garbage collector.

Interesting, I've been using XmlVisitor on two 30meg and one 10meg, load time for all files being 30secs, this is actually an improvement from an older version of xmlp which took 60secs. Not sure if DocumentBuilder would be much different.
February 01, 2012
On 01/02/2012 20:02, Jesse Phillips wrote:
>
> Interesting, I've been using XmlVisitor on two 30meg and one 10meg, load
> time for all files being 30secs, this is actually an improvement from an
> older version of xmlp which took 60secs. Not sure if DocumentBuilder
> would be much different.


For reference, the file i was testing with has ~50000 root nodes, each of which has several children.
The number of nodes seems to have a much larger effect on the speed that the amount of data.


I tried it with an old version of KXML and that seems somewhat faster, although the latest version doesn't seem to build with 2.058 and it's a more basic library.
February 01, 2012
On Wednesday, 1 February 2012 at 22:53:11 UTC, Richard Webb wrote:
> For reference, the file i was testing with has ~50000 root nodes, each of which has several children.
> The number of nodes seems to have a much larger effect on the speed that the amount of data.
>

Sounds about right.  For very small allocations sweeping time dominates the total GC time.  You can see the breakdown at https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 .  The Tree1 benchmark is the very small allocation benchmark.  Sweeping takes time linear in the number of memory blocks allocated and, for blocks <1 page, constant time in the size of the blocks.
February 01, 2012
On Thu, Feb 02, 2012 at 12:09:01AM +0100, dsimcha wrote:
> On Wednesday, 1 February 2012 at 22:53:11 UTC, Richard Webb wrote:
> >For reference, the file i was testing with has ~50000 root nodes,
> >each of which has several children.
> >The number of nodes seems to have a much larger effect on the
> >speed that the amount of data.
> >
> 
> Sounds about right.  For very small allocations sweeping time dominates the total GC time.  You can see the breakdown at https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . The Tree1 benchmark is the very small allocation benchmark. Sweeping takes time linear in the number of memory blocks allocated and, for blocks <1 page, constant time in the size of the blocks.

Out of curiosity, is there a way to optimize for the "many small allocations" case? E.g., if a function allocates, as temporary storage, a tree with a large number of nodes, which becomes garbage when it returns. Perhaps a way to sweep the entire space used by the tree in one go?

Not sure if such a thing is possible.


T

-- 
Tell me and I forget. Teach me and I remember. Involve me and I understand. -- Benjamin Franklin
February 01, 2012
On 2/2/12 12:44 AM, H. S. Teoh wrote:
> Out of curiosity, is there a way to optimize for the "many small
> allocations" case? E.g., if a function allocates, as temporary storage,
> a tree with a large number of nodes, which becomes garbage when it
> returns. Perhaps a way to sweep the entire space used by the tree in one
> go?

The easiest way would be to use a specialized allocator for this – I'm thinking of David Simcha's proposed RegionAllocator (formerly TempAlloc).

David
February 02, 2012
dsimcha has done a good work on the GC. After this change I have seen that some of my GC-heavy programs are just a bit slower, while most of them are significantly faster, so overall it's a good improvement. In my opinion this is one of the best things that DMD 2.058 brings (short lambda syntax is another one of them, I am appreciating it a lot. Sometimes syntax sugar matters).

Still, even with dsimcha improvements, I think the current D GC is almost a toy compared to some of the more advanced GCs that you are able to find around. So is it right to work on small (but useful) incremental improvements on the current GC, instead of working on creating a fully redesigned D GC (this is a large work)? I don't know. But surely even incremental improvements are welcome, because there are (or there were) enough low hanging fruits to pick.

I don't know how the current D GC performance scales to heaps of 5 or 30 Gbytes size, composed of mostly small objects. From what I've seen this is a hard problem even for almost-state-of-the-art GCs.


David Nadlinger:

> The easiest way would be to use a specialized allocator for this – I'm thinking of David Simcha's proposed RegionAllocator (formerly TempAlloc).

I'd like a simple way to define a hierarchy of such allocations (and when you free a node of the tree, the whole sub-tree gets deallocated).

Bye,
bearophile
February 02, 2012
On Wednesday, 1 February 2012 at 23:43:24 UTC, H. S. Teoh wrote:
> Out of curiosity, is there a way to optimize for the "many small
> allocations" case? E.g., if a function allocates, as temporary storage,
> a tree with a large number of nodes, which becomes garbage when it
> returns. Perhaps a way to sweep the entire space used by the tree in one
> go?
>
> Not sure if such a thing is possible.
>
>
> T

My RegionAllocator is probably the best thing for this if the lifetime is deterministic as you describe.  I rewrote the Tree1 benchmark using RegionAllocator a while back just for comparison.  D Tree1 + RegionAllocator had comparable speed to a Java version of Tree1 run under HotSpot.  (About 6 seconds on my box vs. in the low 30s for Tree1 with the 2.058 GC.)

If all the objects are going to die at the same time but not at a deterministic time, you could just allocate a big block from the GC and place class instances in it using emplace().
« First   ‹ Prev
1 2 3 4 5