Jump to page: 1 2
Thread overview
[Issue 5623] New: Slow GC with large heaps
Feb 20, 2011
David Simcha
Feb 20, 2011
David Simcha
Feb 20, 2011
Walter Bright
Feb 20, 2011
David Simcha
Feb 21, 2011
David Simcha
Feb 21, 2011
David Simcha
Feb 22, 2011
David Simcha
Feb 22, 2011
Sean Kelly
Feb 24, 2011
David Simcha
Feb 24, 2011
David Simcha
Feb 25, 2011
David Simcha
Aug 27, 2011
David Simcha
February 20, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623

           Summary: Slow GC with large heaps
           Product: D
           Version: D2
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Keywords: patch, performance
          Severity: normal
          Priority: P2
         Component: druntime
        AssignedTo: nobody@puremagic.com
        ReportedBy: dsimcha@yahoo.com


--- Comment #0 from David Simcha <dsimcha@yahoo.com> 2011-02-20 12:10:57 PST ---
Created an attachment (id=918)
The patch.

Here's a GC benchmark and its performance on the stock GC.

import std.stdio, std.datetime, core.memory, std.conv;

void main(string[] args) {
    if(args.length < 2) {
        stderr.writeln("Need size.");
        return;
    }

    immutable mul = to!size_t(args[1]);
    auto ptr = GC.malloc(mul * 1_048_576, GC.BlkAttr.NO_SCAN);

    auto sw = StopWatch(autoStart);
    GC.collect();
    immutable msec = sw.peek.msecs;
    writefln("Collected a %s megabyte heap in %s milliseconds.",
        mul, msec);
}

Outputs for various sizes (pre-patch):

Collected a 10 megabyte heap in 1 milliseconds.
Collected a 50 megabyte heap in 4 milliseconds.
Collected a 200 megabyte heap in 16 milliseconds.
Collected a 500 megabyte heap in 41 milliseconds.
Collected a 1000 megabyte heap in 80 milliseconds.
Collected a 5000 megabyte heap in 397 milliseconds.
Collected a 10000 megabyte heap in 801 milliseconds.
Collected a 30000 megabyte heap in 2454 milliseconds.
Collected a 50000 megabyte heap in 4096 milliseconds.

Attached is a patch that creates separate large and small object pools, only stores the GC bits for the large object pool at pagesize offsets, not 16-byte offsets, and stores, rather than computes, offsets for B_PAGEPLUS pages.  So far, all unit tests for Phobos, dstats and std.parallelism/parallelfuture pass with this enabled.

Here are the new benchmarks (post-patch):

Collected a 10 megabyte heap in 0 milliseconds.
Collected a 50 megabyte heap in 0 milliseconds.
Collected a 250 megabyte heap in 1 milliseconds.
Collected a 500 megabyte heap in 0 milliseconds.
Collected a 1000 megabyte heap in 1 milliseconds.
Collected a 5000 megabyte heap in 3 milliseconds.
Collected a 10000 megabyte heap in 6 milliseconds.
Collected a 30000 megabyte heap in 16 milliseconds.
Collected a 50000 megabyte heap in 26 milliseconds.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 20, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #1 from David Simcha <dsimcha@yahoo.com> 2011-02-20 12:12:02 PST ---
Created an attachment (id=919)
The whole gcx.d file, in case you have trouble applying the patch.

Here's the whole gcx.d file, in case you have trouble applying the patch. (These things never quite work for me.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 20, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623


Walter Bright <bugzilla@digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla@digitalmars.com


--- Comment #2 from Walter Bright <bugzilla@digitalmars.com> 2011-02-20 13:03:57 PST ---
More explanation from David from the n.g.:

I've found a way to speed up the GC massively on large heaps without excessive
ripple effects.  Technically it's still O(N), but with about a hundred fold
smaller constant in the case of large heaps with most stuff not scanned.  Now,
I think the O(N) (where N is the total size of the heap) term has such a small
constant that it's for almost all practcal purposes the GC is O(S) (where S is
the size of the scanned portion of the heap).  It also no longer has any O(N^2)
pathological case (which I had discovered while reading the code).

So far all unittests for Phobos, dstats and std.parallelism/parallelfuture pass with this enabled.  Please test some other code so we can wring out the corner cases in time for the next release.

Basically all I did was diverge the Pool struct slightly into large and small object sub-varieties.  The large object sub-variety is used to allocate objects of at least a page.  It only stores gcbits at page-size offsets, and tracks the offsets of B_PAGEPLUS bins from the nearest B_PAGE bin so that they can be found in O(1).

I also added a field to the Pool struct so that the number of free pages in a pool can be tracked in O(1).  This should drastically lessen the time it takes to perform large allocations on large heaps.  Right now a free memory region is found by a linear search through the pools in the case of large allocations. Unfortunately, I don't see any easy way to fix this.  This patch at least allows short circuiting a large number of pools, if there isn't enough free space in the whole pool, let alone contiguous space.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 20, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623


bearophile_hugs@eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs@eml.cc


--- Comment #3 from bearophile_hugs@eml.cc 2011-02-20 13:19:15 PST ---
A very simple benchmark, in D and Java. You may use the D version with and without your patch, and then the Java version too, and show the three timings, with N = 15 or 18 or more:

http://codepad.org/yMGK34cb
http://codepad.org/DIOeYn6p

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 20, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #4 from David Simcha <dsimcha@yahoo.com> 2011-02-20 14:48:45 PST ---
(In reply to comment #3)
> A very simple benchmark, in D and Java. You may use the D version with and without your patch, and then the Java version too, and show the three timings, with N = 15 or 18 or more:
> 
> http://codepad.org/yMGK34cb
> http://codepad.org/DIOeYn6p

I didn't both(In reply to comment #3)
> A very simple benchmark, in D and Java. You may use the D version with and without your patch, and then the Java version too, and show the three timings, with N = 15 or 18 or more:
> 
> http://codepad.org/yMGK34cb
> http://codepad.org/DIOeYn6p

Using n = 18:

D (with patch):  62 seconds
D (without patch):  68 seconds
Java:  6 seconds

I'm not sure what this told us that we didn't already know, though.  We already knew Java's GC is much better than D's.  My patch is meant to address issues with large object allocation, not small object.  (Large is defined as at least one full page.  If no large objects are allocated for the duration of the program, then my patch has virtually no effect.)  This benchmark does tons of small object allocation and no large object allocation.  The small difference between with and without patch may even just be noise.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 20, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #5 from bearophile_hugs@eml.cc 2011-02-20 15:07:17 PST ---
(In reply to comment #4)

> I'm not sure what this told us that we didn't already know, though.

Thank you for the timings. This synthetic benchmark tells us something important: that for a different situation (but an important one, because that benchmark is quite revealing, despite being so simple) your patch doesn't make the GC significantly worse (it may even improve things a bit).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 20, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #6 from bearophile_hugs@eml.cc 2011-02-20 15:11:18 PST ---
(In reply to comment #4)

> We already knew Java's GC is much better than D's.

The purpose of a 3-legged comparison: the Java timing is used as a rough zero-scale, to allow an estimate of the absolute difference between the other two timings.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 21, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623


David Simcha <dsimcha@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #918 is|0                           |1
           obsolete|                            |


--- Comment #7 from David Simcha <dsimcha@yahoo.com> 2011-02-20 18:29:08 PST ---
Created an attachment (id=920)
New patch.

Here's a new patch that adds one small but important optimization that I forgot.  Now that we're storing the number of pages for B_PAGE stuff, we can skip over large blocks in O(1) time when doing allocPages().

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 21, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #8 from David Simcha <dsimcha@yahoo.com> 2011-02-20 18:29:43 PST ---
Created an attachment (id=921)
New gcx.d

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 22, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #9 from David Simcha <dsimcha@yahoo.com> 2011-02-21 16:12:54 PST ---
Here's another benchmark.  This one's designed more to be similar to reasonably common scientific computing/large allocation intensive use cases with moderately large overall heap size, rather than to highlight the specific performance problem at hand.

import std.random, core.memory, std.datetime, std.stdio;

enum nIter = 1000;

void main() {
    auto ptrs = new void*[1024];

    auto sw = StopWatch(autoStart);

    // Allocate 1024 large blocks with size uniformly distributed between 1
    // and 128 kilobytes.
    foreach(i; 0..nIter) {
        foreach(ref ptr; ptrs) {
            ptr = GC.malloc(uniform(1024, 128 * 1024 + 1), GC.BlkAttr.NO_SCAN);
        }
    }

    writefln("Done %s iter in %s milliseconds.", nIter, sw.peek.msecs);
}

With patch:

Done 1000 iter in 7410 milliseconds.

Without patch:

Done 1000 iter in 38737 milliseconds.

Memory usage is about the same (based on looking at task manager):  About 88 megs w/ patch, about 92 without.

To verify that this patch doesn't have deleterious effects when allocating lots of objects close to the border between "large" and "small", I also tried it with uniform(1024, 8 * 1024 + 1) and got:

With patch:

Done 1000 iter in 2122 milliseconds.

Without patch:

Done 1000 iter in 4153 milliseconds.

The relatively small difference here isn't surprising, as there are no huge allocations being done (max size is 8k).  Again, the difference in memory usage was negligible (within epsilon of 12 MB for both).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
« First   ‹ Prev
1 2