December 05, 2011
On 05/12/2011 01:46, dsimcha wrote:
> I'm at my traditional passtime of trying to speed up D's garbage
> collector again

Have you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC?

-- 
Robert
http://octarineparrot.com/
December 05, 2011
On Mon, 05 Dec 2011 23:07:09 +0200, dsimcha <dsimcha@yahoo.com> wrote:

> I understand the problem, but please elaborate on the proposed solution.  You've
> basically got a bunch of pools, each of which represents a range of memory
> addresses, not a single address (so a basic hashtable is out).  You need to know
> which range some pointer fits in.  How would you beat binary search/O(log N) for this?

A tree, with a few bits of the address space per level. It becomes bound to the size of the address space, not the number of pools.

-- 
Best regards,
 Vladimir                            mailto:vladimir@thecybershadow.net
December 05, 2011
On Mon, 05 Dec 2011 22:07:09 +0100, dsimcha <dsimcha@yahoo.com> wrote:

> == Quote from Martin Nowak (dawg@dawgfoto.de)'s article
>> I appreciate the recursion during mark, wanted to do this myself
>> sometime ago but expected a little more gain.
>
> The reason the gain wasn't huge is because on the benchmark I have that involves a
> deep heap graph, sweeping time dominates marking time.  The performance gain for
> the mark phase only (which is important b/c this is when the world needs to be
> stopped) is ~20-30%.
>
>> Some more ideas:
>>   - Do a major refactoring of the GC code, making it less reluctant
>>     to changes. Adding sanity checks or unit tests would be great.
>>     This probably reveals some obfuscated performance issues.
>
> Not just obfuscated ones.  I've wanted to fix an obvious perf bug for two years
> and haven't done it because the necessary refactoring would be unbelievably messy
> and I'm too afraid I'll break something.  Basically, malloc() sets the bytes
> between the size you requested and the size of the block actually allocated to
> zero to prevent false pointers.  This is reasonable.  The problem is that it does
> so **while holding the GC's lock**.  Fixing it for just the case when malloc() is
> called by the user is also easy.  The problem is fixing it when malloc() gets
> called from realloc(), calloc(), etc.
>
>>   - Add more realistic GC benchmarks, just requires adding to
>>     druntime/test/gcbench using the new runbench. The tree1 mainly
>>     uses homogeneous classes, so this is very synthesized.
>
> I'll crowdsource this.  I can't think of any good benchmarks that are < a few
> hundred lines w/ no dependencies but aren't pretty synthetic.
>
Probably write a tool that can replay recorded allocations profiles.
But it will be difficult to fake internal pointers if they are not recorded.
If we were able to enable recording in the runtime, people having performance
issues with the GC could then send a file that can be used for optimizations.

>>   - There is one binary search pool lookup for every scanned address in
>> range.
>>     Should be a lot to gain here, but it's difficult. It needs a multilevel
>>     mixture of bitset/hashtab.
>
> I understand the problem, but please elaborate on the proposed solution.  You've
> basically got a bunch of pools, each of which represents a range of memory
> addresses, not a single address (so a basic hashtable is out).  You need to know
> which range some pointer fits in.  How would you beat binary search/O(log N) for this?
>
It is possible to use a hashtable or a bitset because each pool occupies
at least PAGESIZE.
One would need to use addresses relative to the minimum address
but that still needs 256K entries per GByte, so the maintenance
overhead is likely too big.

More promising is to put pool addresses ranges in a trie.

addr[7]          [...      .     ...]
                     /     |    \
addr[6]     [...   .   ...]    [...   .   ...]
               /   |    \         /   |   \
addr[5]     pool:8             [...   .  ...]
                              /   |   \
addr[4]                  pool:8 [....] pool:5

Lookup is bounded constant time, with at maximum 8(4) memory accesses.
Maintenance is only required when adding/removing pools, little sophisticated though.

-----

alias Node Trie;

struct Node
{
   union
   {
      Type           _type;
      Pool*          _pool;
      Node[16]*   _nodes16; // uses upper 3 bits per level
      ...
      Node[256]* _nodes256; // uses  full 8 bits per level
   }

   // use lower 3 bits to encode type, given guaranteed 8-byte alignment of malloc
   enum NTypeBits = 3;
   enum TypeMask = (1 << NTypeBits) - 1);
   enum PtrMask = ~TypeMask;

   enum Type { Empty, Pool, Nodes16, Nodes256 };
   static assert(Type.max <= TypeMask);

   Pool* getPool(void* p)
   {
       return getPoolImpl((cast(ubyte*)&p) + size_t.sizeof - 1);
   }

   Pool* getPoolImpl(ubyte* p)
   {
       Node* node = void;

       final switch (type)
       {
       case Type.Empty:
          return null;
       case Type.Pool:
          return pool;
       case Type.Nodes16:
          node = nodes16[*p >> 5];
          break;
       case Type.Nodes256:
          node = nodes256[*p];
          break;
       }

       if (node.type == Type.Pool)
           return node.pool;
       else
           return node.getPoolImpl(p - 1);
   }

   void insertPool(Pool* npool, uint level=0)
   {
       final switch (type)
       {
       case Type.Empty:
           pool = npool;

       case Type.Pool:
           Pool *opool = pool;
           if (opool != npool)
           {
               nodes16 = new Node[16];

               foreach (i; 0 .. 16)
               {
                   // non-trivial code here
                   // Needs to figure out into which subnode opool and npool
                   // should be inserted. They can use pool.minAddr and pool.maxAddr

                   nodes16[i].insertPool(opool, level + 1);
                   nodes16[i].insertPool(npool, level + 1);
               }
           }
       }

       case Type.Nodes16:
           // count empty nodes and decide whether
           // to switch to more child nodes.
       }
   }

   void removePool(Pool* npool, uint level=0)
   {
       // reverse of above
   }

   @property Type type()
   {
       return cast(Type)(_type & TypeMask);
   }

   @property void type(Type t)
   {
       assert(t <= TypeMask);
       _type |= t;
   }

   @property Pool* pool()
   {
       return cast(Pool*)(cast(size_t)_pool & PtrMask);
   }

   @property void pool(Pool* p)
   {
       _pool = p;
       type = Type.Pool;
   }

   ... for nodesN
}

-----

>>   - Reduce the GC roots range. I will have to work on this for
>>     shared library support anyhow.
>
> Please clarify what you mean by "reduce" the roots range.
>
We currently add something from etext to _end (rt.memory) as static root.
This probably contains read-only sections, data from other languages
and (unlikely) other unrelated sections.
I think some help from the compiler will be necessary to support
shared libraries anyhow, so we should be able to get a precise range.

> Thanks for the feedback/suggestions.
December 05, 2011
On 5 December 2011 22:34, Martin Nowak <dawg@dawgfoto.de> wrote:
> We currently add something from etext to _end (rt.memory) as static root.
> This probably contains read-only sections, data from other languages
> and (unlikely) other unrelated sections.
> I think some help from the compiler will be necessary to support
> shared libraries anyhow, so we should be able to get a precise range.
>

Indeed.  The current implementation kicking around is to scan /proc/self/maps on init (this is true for the runtime that ships with GDC at least).  Which is not the most pleasant of things to do - not to mention only supports Unix-flavoured systems, but it works well enough for when linking against shared D libraries.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
December 05, 2011
> More promising is to put pool addresses ranges in a trie.
>
> addr[7]          [...      .     ...]
>                       /     |    \
> addr[6]     [...   .   ...]    [...   .   ...]
>                 /   |    \         /   |   \
> addr[5]     pool:8             [...   .  ...]
>                                /   |   \
> addr[4]                  pool:8 [....] pool:5
>
Actually 64-bit should use a hashtable for the upper 32-bit and then
the the 32-bit trie for lower.
December 05, 2011
== Quote from Martin Nowak (dawg@dawgfoto.de)'s article
> > More promising is to put pool addresses ranges in a trie.
> >
> > addr[7]          [...      .     ...]
> >                       /     |    \
> > addr[6]     [...   .   ...]    [...   .   ...]
> >                 /   |    \         /   |   \
> > addr[5]     pool:8             [...   .  ...]
> >                                /   |   \
> > addr[4]                  pool:8 [....] pool:5
> >
> Actually 64-bit should use a hashtable for the upper 32-bit and then the the 32-bit trie for lower.

Why do you expect this to be faster than a binary search?  I'm not saying it won't be, just that it's not a home run that deserves a high priority as an optimization.  You still have a whole bunch of indirections, probably more than you would ever have for binary search.
December 05, 2011
> On 05/12/2011 01:46, dsimcha wrote:
>> I'm at my traditional passtime of trying to speed up D's garbage
>> collector again
>
> Have you thought about pushing for the inclusion of CDGC at all/working on the tweaks needed to make it the main GC?

So true, it's been rotting in that branch.
December 06, 2011
On 12/5/2011 6:39 PM, Trass3r wrote:
>> On 05/12/2011 01:46, dsimcha wrote:
>>> I'm at my traditional passtime of trying to speed up D's garbage
>>> collector again
>>
>> Have you thought about pushing for the inclusion of CDGC at
>> all/working on the tweaks needed to make it the main GC?
>
> So true, it's been rotting in that branch.

IIRC CDGC includes two major enhancements:

1.  The snapshot GC for Linux.  (Does this work on OSX/FreeBSD/anything Posix, or just Linux?  I'm a bit skeptical about whether a snapshot GC is really that great an idea given its propensity to waste memory on long collect cycles with a lot of mutation.)

2.  I think there was some precise heap scanning-related stuff in it.  I originally tried to implement precise heap scanning a couple years ago, but it went nowhere for reasons too complicated to explain here.  Given this experience, I'm not inclined to try again until the compiler has extensions for generating pointer offset information.
December 06, 2011
On Tue, 06 Dec 2011 00:16:01 +0100, dsimcha <dsimcha@yahoo.com> wrote:

> == Quote from Martin Nowak (dawg@dawgfoto.de)'s article
>> > More promising is to put pool addresses ranges in a trie.
>> >
>> > addr[7]          [...      .     ...]
>> >                       /     |    \
>> > addr[6]     [...   .   ...]    [...   .   ...]
>> >                 /   |    \         /   |   \
>> > addr[5]     pool:8             [...   .  ...]
>> >                                /   |   \
>> > addr[4]                  pool:8 [....] pool:5
>> >
>> Actually 64-bit should use a hashtable for the upper 32-bit and then
>> the the 32-bit trie for lower.
>
> Why do you expect this to be faster than a binary search?  I'm not saying it won't
> be, just that it's not a home run that deserves a high priority as an
> optimization.  You still have a whole bunch of indirections, probably more than
> you would ever have for binary search.

It's the tightest loop in the garbage collector.
It will end with few array accesses and the locality of
memory being pointed too is paired by tree locality,
thus you'd have good chances to finding them in the cache.

What this gives you is that it scales very good to many pools/regions.
Thus you can better tweak the pool granularity against pool sizes.

Also:
for (p < pend)
  if (*p in lastPool)
can be:
for (p < pend)
  if (*p in lastNode)

It is definitely no home run, but I'll try it
out when I find some time.
December 06, 2011
> IIRC CDGC includes two major enhancements:
>
> 1.  The snapshot GC for Linux.  (Does this work on OSX/FreeBSD/anything Posix, or just Linux?  I'm a bit skeptical about whether a snapshot GC is really that great an idea given its propensity to waste memory on long collect cycles with a lot of mutation.)

I guess, at least it uses fork IIRC.


> 2.  I think there was some precise heap scanning-related stuff in it.  I originally tried to implement precise heap scanning a couple years ago, but it went nowhere for reasons too complicated to explain here.  Given this experience, I'm not inclined to try again until the compiler has extensions for generating pointer offset information.

What's the status of that anyway? Every patch in that bugzilla entry is marked obsolete and there's no real final answer.