| Thread overview | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
December 05, 2011 rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d:
extern (C) void rt_finalize(void* p, bool det = true)
{
debug(PRINTF) printf("rt_finalize(p = %p)\n", p);
if (p) // not necessary if called from gc
{
ClassInfo** pc = cast(ClassInfo**)p;
if (*pc)
{
ClassInfo c = **pc;
byte[] w = c.init;
try
{
if (det || collectHandler is null || collectHandler(cast(Object)p))
{
do
{
if (c.destructor)
{
fp_t fp = cast(fp_t)c.destructor;
(*fp)(cast(Object)p); // call destructor
}
c = c.base;
} while (c);
}
if ((cast(void**)p)[1]) // if monitor is not null
_d_monitordelete(cast(Object)p, det);
(cast(byte*) p)[0 .. w.length] = w[]; // WTF?
}
catch (Throwable e)
{
onFinalizeError(**pc, e);
}
finally // WTF?
{
*pc = null; // zero vptr
}
}
}
}
Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer.
Is there any good reason to keep this code around?
| ||||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha@yahoo.com> wrote: > I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d: > > extern (C) void rt_finalize(void* p, bool det = true) > { > debug(PRINTF) printf("rt_finalize(p = %p)\n", p); > > if (p) // not necessary if called from gc > { > ClassInfo** pc = cast(ClassInfo**)p; > > if (*pc) > { > ClassInfo c = **pc; > byte[] w = c.init; > > try > { > if (det || collectHandler is null || collectHandler(cast(Object)p)) > { > do > { > if (c.destructor) > { > fp_t fp = cast(fp_t)c.destructor; > (*fp)(cast(Object)p); // call destructor > } > c = c.base; > } while (c); > } > if ((cast(void**)p)[1]) // if monitor is not null > _d_monitordelete(cast(Object)p, det); > (cast(byte*) p)[0 .. w.length] = w[]; // WTF? > } > catch (Throwable e) > { > onFinalizeError(**pc, e); > } > finally // WTF? > { > *pc = null; // zero vptr > } > } > } > } > > Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer. > > Is there any good reason to keep this code around? Not for the try block. With errors being not recoverable you don't need to care about zeroing the vtbl or you could just copy the code into the catch handler. This seems to cause less spilled variables. Most expensive is the call to a memcpy@PLT, replace it with something inlineable. Zeroing is not much faster than copying init[] for small classes. At least zeroing should be worth it, unless the GC would not scan the memory otherwise. gcbench/tree1 => 41.8s => https://gist.github.com/1432117 => gcbench/tree1 => 33.4s Please add useful benchmarks to druntime. martin | |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | dsimcha: > I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. DMD or LDC/GDC compiler? > extern (C) void rt_finalize(void* p, bool det = true) > { > debug(PRINTF) printf("rt_finalize(p = %p)\n", p); > > if (p) // not necessary if called from gc > { That if(p) seems fit to become a static if on bool template argument (but it needs to become D code or two C functions): void rt_finalize(bool byGC)(void* p, bool det = true) { ... Bye, bearophile | |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Martin Nowak | Thanks for the benchmark. I ended up deciding to just create a second function, rt_finalize_gc, that gets rid of a whole bunch of cruft that isn't necessary in the GC case. I think it's worth the small amount of code duplication it creates. Here are the results of my efforts so far: https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . I've got one other good idea that I think will shave a few seconds off the Tree1 benchmark if I don't run into any unforeseen obstacles in implementing it. On 12/4/2011 10:07 PM, Martin Nowak wrote: > On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha@yahoo.com> wrote: > >> I'm at my traditional passtime of trying to speed up D's garbage >> collector again, and I've stumbled on the fact that rt_finalize is >> taking up a ridiculous share of the time (~30% of total runtime) on a >> benchmark where huge numbers of classes **that don't have >> destructors** are being created and collected. Here's the code to this >> function, from lifetime.d: >> >> extern (C) void rt_finalize(void* p, bool det = true) >> { >> debug(PRINTF) printf("rt_finalize(p = %p)\n", p); >> >> if (p) // not necessary if called from gc >> { >> ClassInfo** pc = cast(ClassInfo**)p; >> >> if (*pc) >> { >> ClassInfo c = **pc; >> byte[] w = c.init; >> >> try >> { >> if (det || collectHandler is null || collectHandler(cast(Object)p)) >> { >> do >> { >> if (c.destructor) >> { >> fp_t fp = cast(fp_t)c.destructor; >> (*fp)(cast(Object)p); // call destructor >> } >> c = c.base; >> } while (c); >> } >> if ((cast(void**)p)[1]) // if monitor is not null >> _d_monitordelete(cast(Object)p, det); >> (cast(byte*) p)[0 .. w.length] = w[]; // WTF? >> } >> catch (Throwable e) >> { >> onFinalizeError(**pc, e); >> } >> finally // WTF? >> { >> *pc = null; // zero vptr >> } >> } >> } >> } >> >> Getting rid of the stuff I've marked with //WTF? comments (namely the >> finally block and the re-initializing of the memory occupied by the >> finalized object) speeds things up by ~15% on the benchmark in >> question. Why do we care what state the blob of memory is left in >> after we finalize it? I can kind of see that we want to clear things >> if delete/clear was called manually and we want to leave the object in >> a state that doesn't look valid. However, this has significant >> performance costs and IIRC is already done in clear() and delete is >> supposed to be deprecated. Furthermore, I'd like to get rid of the >> finally block entirely, since I assume its presence and the effect on >> the generated code is causing the slowdown, not the body, which just >> assigns a pointer. >> >> Is there any good reason to keep this code around? > > Not for the try block. With errors being not recoverable you don't need > to care > about zeroing the vtbl or you could just copy the code into the catch > handler. > This seems to cause less spilled variables. > > Most expensive is the call to a memcpy@PLT, replace it with something > inlineable. > Zeroing is not much faster than copying init[] for small classes. > > At least zeroing should be worth it, unless the GC would not scan the > memory otherwise. > > gcbench/tree1 => 41.8s => https://gist.github.com/1432117 => > gcbench/tree1 => 33.4s > > Please add useful benchmarks to druntime. > > martin | |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | On Sunday, December 04, 2011 23:41:08 dsimcha wrote:
> Thanks for the benchmark. I ended up deciding to just create a second
> function, rt_finalize_gc, that gets rid of a whole bunch of cruft that
> isn't necessary in the GC case. I think it's worth the small amount of
> code duplication it creates. Here are the results of my efforts so far:
> https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 .
> I've got one other good idea that I think will shave a few seconds off
> the Tree1 benchmark if I don't run into any unforeseen obstacles in
> implementing it.
It's best to avoid code duplication in general, but I'd argue that if only a small amount of code duplication is required to get a big gain from the GC, it's well worth it. In general, anything we can do to improve the GC is worthwhile, since it's _the_ place where D risks being inefficient in comparison to languages such as C++.
- Jonathan M Davis
| |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | Last time I looked at it, the try/catch/finally block was rather expensive because it always invokes the exception handler unwinding mechanism, even if no exception occurs.
Try moving the try/catch block out of the loops that call rt_finalize. (maybe just remove it, onFinalizeError just rethrows, and it seems noone is using the information that the exception is thrown from inside the finalizer)
On 05.12.2011 02:46, dsimcha wrote:
> I'm at my traditional passtime of trying to speed up D's garbage
> collector again, and I've stumbled on the fact that rt_finalize is
> taking up a ridiculous share of the time (~30% of total runtime) on a
> benchmark where huge numbers of classes **that don't have destructors**
> are being created and collected. Here's the code to this function, from
> lifetime.d:
>
> extern (C) void rt_finalize(void* p, bool det = true)
> {
> debug(PRINTF) printf("rt_finalize(p = %p)\n", p);
>
> if (p) // not necessary if called from gc
> {
> ClassInfo** pc = cast(ClassInfo**)p;
>
> if (*pc)
> {
> ClassInfo c = **pc;
> byte[] w = c.init;
>
> try
> {
> if (det || collectHandler is null || collectHandler(cast(Object)p))
> {
> do
> {
> if (c.destructor)
> {
> fp_t fp = cast(fp_t)c.destructor;
> (*fp)(cast(Object)p); // call destructor
> }
> c = c.base;
> } while (c);
> }
> if ((cast(void**)p)[1]) // if monitor is not null
> _d_monitordelete(cast(Object)p, det);
> (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
> }
> catch (Throwable e)
> {
> onFinalizeError(**pc, e);
> }
> finally // WTF?
> {
> *pc = null; // zero vptr
> }
> }
> }
> }
>
> Getting rid of the stuff I've marked with //WTF? comments (namely the
> finally block and the re-initializing of the memory occupied by the
> finalized object) speeds things up by ~15% on the benchmark in question.
> Why do we care what state the blob of memory is left in after we
> finalize it? I can kind of see that we want to clear things if
> delete/clear was called manually and we want to leave the object in a
> state that doesn't look valid. However, this has significant performance
> costs and IIRC is already done in clear() and delete is supposed to be
> deprecated. Furthermore, I'd like to get rid of the finally block
> entirely, since I assume its presence and the effect on the generated
> code is causing the slowdown, not the body, which just assigns a pointer.
>
> Is there any good reason to keep this code around?
| |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Rainer Schuetze | On Mon, 05 Dec 2011 10:14:00 +0200, Rainer Schuetze <r.sagitario@gmx.de> wrote: > Try moving the try/catch block out of the loops that call rt_finalize. This sounds like a good idea. Just make sure that all code paths that lead to rt_finalize (e.g. delete) can recover from an exception. > (maybe just remove it, onFinalizeError just rethrows, and it seems noone is using the information that the exception is thrown from inside the finalizer) The point of onFinalizeError is that it throws an *error* (it is unrecoverable). Currently, the GC cannot recover from an exception thrown by a finalizer, which is why onFinalizeError exists. -- Best regards, Vladimir mailto:vladimir@thecybershadow.net | |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | On Sun, 04 Dec 2011 20:46:27 -0500, dsimcha <dsimcha@yahoo.com> wrote:
> I'm at my traditional passtime of trying to speed up D's garbage collector again, and I've stumbled on the fact that rt_finalize is taking up a ridiculous share of the time (~30% of total runtime) on a benchmark where huge numbers of classes **that don't have destructors** are being created and collected. Here's the code to this function, from lifetime.d:
>
> extern (C) void rt_finalize(void* p, bool det = true)
> {
> debug(PRINTF) printf("rt_finalize(p = %p)\n", p);
>
> if (p) // not necessary if called from gc
> {
> ClassInfo** pc = cast(ClassInfo**)p;
>
> if (*pc)
> {
> ClassInfo c = **pc;
> byte[] w = c.init;
>
> try
> {
> if (det || collectHandler is null || collectHandler(cast(Object)p))
> {
> do
> {
> if (c.destructor)
> {
> fp_t fp = cast(fp_t)c.destructor;
> (*fp)(cast(Object)p); // call destructor
> }
> c = c.base;
> } while (c);
> }
> if ((cast(void**)p)[1]) // if monitor is not null
> _d_monitordelete(cast(Object)p, det);
> (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
> }
> catch (Throwable e)
> {
> onFinalizeError(**pc, e);
> }
> finally // WTF?
> {
> *pc = null; // zero vptr
> }
> }
> }
> }
>
> Getting rid of the stuff I've marked with //WTF? comments (namely the finally block and the re-initializing of the memory occupied by the finalized object) speeds things up by ~15% on the benchmark in question. Why do we care what state the blob of memory is left in after we finalize it? I can kind of see that we want to clear things if delete/clear was called manually and we want to leave the object in a state that doesn't look valid. However, this has significant performance costs and IIRC is already done in clear() and delete is supposed to be deprecated. Furthermore, I'd like to get rid of the finally block entirely, since I assume its presence and the effect on the generated code is causing the slowdown, not the body, which just assigns a pointer.
I think it might be a good idea to move those two operations to the clear function. Currently, clear(obj) simply calls rt_finalize(obj).
It does seem rather strange to have that finally there, why do we care on an exception/error that we zero the vptr? I'd at least put that code up where the first WTF is.
But I think we do need to have the reinitialization of the object and zeroing of vptr for clear, since it's part of clear's charter.
I'd support any effort to speed up the GC, it's definitely the worst offender for performance. Looks like there's quite a few good ideas in this thread.
-Steve
| |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Rainer Schuetze | On Mon, 05 Dec 2011 09:14:00 +0100, Rainer Schuetze <r.sagitario@gmx.de> wrote: > Last time I looked at it, the try/catch/finally block was rather expensive because it always invokes the exception handler unwinding mechanism, even if no exception occurs. > Try moving the try/catch block out of the loops that call rt_finalize. (maybe just remove it, onFinalizeError just rethrows, and it seems noone is using the information that the exception is thrown from inside the finalizer) > Just an unconditional jump into the finally body here, but it still affects register assignment. Install an exception handler at sweep scope would save quite some Moving the exception handler to the sweep scope seems promising, can save lots of register saves. I appreciate the recursion during mark, wanted to do this myself sometime ago but expected a little more gain. Some more ideas: - Do a major refactoring of the GC code, making it less reluctant to changes. Adding sanity checks or unit tests would be great. This probably reveals some obfuscated performance issues. - Add more realistic GC benchmarks, just requires adding to druntime/test/gcbench using the new runbench. The tree1 mainly uses homogeneous classes, so this is very synthesized. - There is one binary search pool lookup for every scanned address in range. Should be a lot to gain here, but it's difficult. It needs a multilevel mixture of bitset/hashtab. - Reduce the GC roots range. I will have to work on this for shared library support anyhow. martin > On 05.12.2011 02:46, dsimcha wrote: >> I'm at my traditional passtime of trying to speed up D's garbage >> collector again, and I've stumbled on the fact that rt_finalize is >> taking up a ridiculous share of the time (~30% of total runtime) on a >> benchmark where huge numbers of classes **that don't have destructors** >> are being created and collected. Here's the code to this function, from >> lifetime.d: >> >> extern (C) void rt_finalize(void* p, bool det = true) >> { >> debug(PRINTF) printf("rt_finalize(p = %p)\n", p); >> >> if (p) // not necessary if called from gc >> { >> ClassInfo** pc = cast(ClassInfo**)p; >> >> if (*pc) >> { >> ClassInfo c = **pc; >> byte[] w = c.init; >> >> try >> { >> if (det || collectHandler is null || collectHandler(cast(Object)p)) >> { >> do >> { >> if (c.destructor) >> { >> fp_t fp = cast(fp_t)c.destructor; >> (*fp)(cast(Object)p); // call destructor >> } >> c = c.base; >> } while (c); >> } >> if ((cast(void**)p)[1]) // if monitor is not null >> _d_monitordelete(cast(Object)p, det); >> (cast(byte*) p)[0 .. w.length] = w[]; // WTF? >> } >> catch (Throwable e) >> { >> onFinalizeError(**pc, e); >> } >> finally // WTF? >> { >> *pc = null; // zero vptr >> } >> } >> } >> } >> >> Getting rid of the stuff I've marked with //WTF? comments (namely the >> finally block and the re-initializing of the memory occupied by the >> finalized object) speeds things up by ~15% on the benchmark in question. >> Why do we care what state the blob of memory is left in after we >> finalize it? I can kind of see that we want to clear things if >> delete/clear was called manually and we want to leave the object in a >> state that doesn't look valid. However, this has significant performance >> costs and IIRC is already done in clear() and delete is supposed to be >> deprecated. Furthermore, I'd like to get rid of the finally block >> entirely, since I assume its presence and the effect on the generated >> code is causing the slowdown, not the body, which just assigns a pointer. >> >> Is there any good reason to keep this code around? | |||
December 05, 2011 Re: rt_finalize WTFs? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Martin Nowak | == Quote from Martin Nowak (dawg@dawgfoto.de)'s article > I appreciate the recursion during mark, wanted to do this myself sometime ago but expected a little more gain. The reason the gain wasn't huge is because on the benchmark I have that involves a deep heap graph, sweeping time dominates marking time. The performance gain for the mark phase only (which is important b/c this is when the world needs to be stopped) is ~20-30%. > Some more ideas: > - Do a major refactoring of the GC code, making it less reluctant > to changes. Adding sanity checks or unit tests would be great. > This probably reveals some obfuscated performance issues. Not just obfuscated ones. I've wanted to fix an obvious perf bug for two years and haven't done it because the necessary refactoring would be unbelievably messy and I'm too afraid I'll break something. Basically, malloc() sets the bytes between the size you requested and the size of the block actually allocated to zero to prevent false pointers. This is reasonable. The problem is that it does so **while holding the GC's lock**. Fixing it for just the case when malloc() is called by the user is also easy. The problem is fixing it when malloc() gets called from realloc(), calloc(), etc. > - Add more realistic GC benchmarks, just requires adding to > druntime/test/gcbench using the new runbench. The tree1 mainly > uses homogeneous classes, so this is very synthesized. I'll crowdsource this. I can't think of any good benchmarks that are < a few hundred lines w/ no dependencies but aren't pretty synthetic. > - There is one binary search pool lookup for every scanned address in > range. > Should be a lot to gain here, but it's difficult. It needs a multilevel > mixture of bitset/hashtab. I understand the problem, but please elaborate on the proposed solution. You've basically got a bunch of pools, each of which represents a range of memory addresses, not a single address (so a basic hashtable is out). You need to know which range some pointer fits in. How would you beat binary search/O(log N) for this? > - Reduce the GC roots range. I will have to work on this for > shared library support anyhow. Please clarify what you mean by "reduce" the roots range. Thanks for the feedback/suggestions. | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply