Thread overview
Fast Memory Allocation Strategy
Mar 16, 2005
Craig Black
Mar 16, 2005
Ilya Minkov
Mar 17, 2005
Craig Black
Mar 17, 2005
Ilya Minkov
Mar 17, 2005
Craig Black
Mar 18, 2005
Ilya Minkov
Mar 18, 2005
David Medlock
Mar 21, 2005
Ilya Minkov
Mar 25, 2005
Georg Wrede
March 16, 2005
I've been looking at D's approach to memory allocation and I have an idea that I think will cause memory allocation to be blazingly fast.

I don't know what to call it, and there probably is a name for it.  It's kind of similar to D's mark and release approach but slightly more involved.

Objects are organized into buffers according to their size.  So if an object is 10 words long, it will go into the 10-word buffer.  If an object is 5 words long, it goes into the 5-word buffer, etc.  Thus all objects in each buffer are of equal size.

The buffers must be resizable.  When an object is requested from a buffer that is full, it will be enlarged to facilitate more objects.  When the buffer is resized the GC must move all the objects.  When an object is freed from memory, the gap in the buffer is filled with the last item in the buffer.  This requires the GC to move that block of memory.  Thus there is no memory fragmentation within each buffer.

The problem with this approach is that it requires interaction with the GC that I don't think D currently provides, so Walter may have to get involved for this to be implemented.

All you smart D folks, let me know what you think.

-Craig


March 16, 2005
That's how Doug Lea's memory manager (i.e. malloc/free implementation) works and is known to be one of the most non-fragmenting memory managers. And it's fast! GNU C library has its malloc derived from it, and this is one of the many reasons for good Linux performance.

With the difference that it naturally cannot move anything, but it hardly has any negative consequence, since the program potentially allocates more data of the same type. Even otherwise, there are different strategies - if we are speaking of very small objects, the deficiency can be safely ignored, if of multi-kilobyte ones one can consider assigning this space to different pools or returning it to the OS. Basically, since we have virtually mapped memory, it is much more efficient to unmap a piece of adress space than to move the data and go back correct the pointers.

I've stumbled over the subject over an article on Gamasutra, as i wanted to learn whether i should expect to run into memory management trouble on a (now abandoned but historically gorgeous) game console with 16 megabytes of main RAM, and how to avoid or solve them.

I might be wrong, but i seem to remember Phobos GC also uses a sort of size-discriminated pool allocator, and i would guess the Boehm does as well - fast memory allocation is one of its main selling points.

I suggest you read an article on it, which says that with such an allocator fragmentation basically ceases to bother anyone - usually even on a game console which doesn't have virtual memory enabled for performance reasons.

	"The Memory Fragmentation Problem: Solved?"
http://www.cs.utexas.edu/users/wilson/papers/fragsolved.pdf

This document is also available from a few other locations on the internet.

Generally, precise garbage collection is a hard problem - and moving data around is a particularly hard sub-problem. It would force us to have typed stack - naturally not impossible - at least as long as one could exclude foreign languege (e.g. C) calls - but a running overhead, i think. What we are effectively best left with is a collector with conservative roots but precise object scanning - but it would be unable to detect each and every pointer and be able to patch it if a piece of adress space moved.

-eye


Craig Black wrote:
> I've been looking at D's approach to memory allocation and I have an idea that I think will cause memory allocation to be blazingly fast.
> 
> I don't know what to call it, and there probably is a name for it.  It's kind of similar to D's mark and release approach but slightly more involved.
> 
> Objects are organized into buffers according to their size.  So if an object is 10 words long, it will go into the 10-word buffer.  If an object is 5 words long, it goes into the 5-word buffer, etc.  Thus all objects in each buffer are of equal size.
> 
> The buffers must be resizable.  When an object is requested from a buffer that is full, it will be enlarged to facilitate more objects.  When the buffer is resized the GC must move all the objects.  When an object is freed from memory, the gap in the buffer is filled with the last item in the buffer.  This requires the GC to move that block of memory.  Thus there is no memory fragmentation within each buffer.
> 
> The problem with this approach is that it requires interaction with the GC that I don't think D currently provides, so Walter may have to get involved for this to be implemented.
> 
> All you smart D folks, let me know what you think.
> 
> -Craig 
> 
> 
March 17, 2005
Thank you for your response.  You seem to be quite informed about memory allocation.  I was searching for a name for my approach.  Perhaps I can steel the term "size-discriminated pool allocator" from you?  You don't have a copyright on that do you? :)

> Generally, precise garbage collection is a hard problem - and moving data around is a particularly hard sub-problem. It would force us to have typed stack - naturally not impossible - at least as long as one could exclude foreign languege (e.g. C) calls - but a running overhead, i think. What we are effectively best left with is a collector with conservative roots but precise object scanning - but it would be unable to detect each and every pointer and be able to patch it if a piece of adress space moved.

I disagree.  Moving memory around is quite possible and could be implemented efficiently, given that we are dealing with GC references rather than pointers.  That is, after all, what GC is all about, right?  Not having to deal with those nasty, hateful little pointers?  Having to move memory around once in a while is quite worth the effort in this case.  A slight drawback is when the buffer is resized, this may take a few clock cycles, especially if the buffer is quite large.  However, this should very rarely happen, especially for large buffers.

I believe that with the right GC algorithm, we can outperform non-GC.

-Craig


March 17, 2005
Craig Black wrote:
> Thank you for your response.  You seem to be quite informed about memory allocation.  I was searching for a name for my approach.

I think in fact i know too little on almost any subject, including
memory allocation. However, if i was shure i knew a lot, i wouldn't come
to learn quite as much.

> Perhaps I can steel the term "size-discriminated pool allocator" from
>  you?  You don't have a copyright on that do you? :)

How could i? This is simply a bunch of technical terms thrown together.
I'm not even sure it's correct. I didn't even make one long word out of
it - which would be in a fine german tradition.

> I disagree.  Moving memory around is quite possible and could be implemented efficiently, given that we are dealing with GC references
>  rather than pointers.  That is, after all, what GC is all about, right?  Not having to deal with those nasty, hateful little pointers?
>  Having to move memory around once in a while is quite worth the effort in this case.  A slight drawback is when the buffer is resized, this may take a few clock cycles, especially if the buffer is quite large.  However, this should very rarely happen, especially
>  for large buffers.

We are dealing with pointers. What do you mean or imagine under a GC
reference? What should it look like? Take multiple words of space? Be
double-indirect? Currently they are just pointers which is of direct
advantage to the execution performance, and i would be interested in any
clue how to do it better. Among problems to consider is that D runtime
environment is not encapsulated, as, say, Java is - D is intendet to
interact with low-level code directly - or in fact everything vaguely
resembling C from the interface.

It would in fact be interesting to see an efficient and safe way to
resettle pointers. Now we are paying a few percent of CPU performance
for the adress indirection. Many embedded systems thus don't enable it
at all - which however makes the memory fragmentation problem quite
somewhat more real, especially for garbage-collected languages. Having a
good way to reposition pointers would be thus a helpful thing there. If,
however, we have a system where we already have to pay an overhead on
memory and on CPU, perhaps for security or other reasons, i would be
rather willing to make use of it than layer anything else on top of it.

Resizing the buffer is on the contrary perfectly unproblematic - new
chunks need not be adjacent, and the buffer need not be contiguous. As
long as the buffer chunks are of the same size for many differently
large types, neither large nor small, there should be no fragmentation.

> I believe that with the right GC algorithm, we can outperform non-GC.

I believe one cannot outperform non-GC allocation directly - we can
rather outperform it by reducing the costs of semi-manual bookkeeping.
Many C++ programs use shared_ptr<> to do their memory management, which
is both syntactically wordy and quite inefficient because of often count
manipulation and checking. Even worse is relying on undocumented
reference-counted implementation of some standard containers, when their
specified semantics is value. Besides, we are slowly entering a
multithreaded world, where using reference counting is a *crime*, while
GC works much better and is on the way to further improvement.

-eye
March 17, 2005
> We are dealing with pointers. What do you mean or imagine under a GC reference? What should it look like? Take multiple words of space? Be double-indirect? Currently they are just pointers which is of direct advantage to the execution performance, and i would be interested in any clue how to do it better. Among problems to consider is that D runtime environment is not encapsulated, as, say, Java is - D is intendet to interact with low-level code directly - or in fact everything vaguely resembling C from the interface.

No doubt about it, my approach would very much discourage the use of pointers.  I'm glad you have inquired about GC references.  Again, in my approach, all objects are allocated in object pools or buffers, according to their size, up to a certain point.  Lets say the cuttoff is 1024 words (could also be 4096 or whatever).  All objects equal to or larger than 1024 words go in the last buffer.  Thus, the last buffer must be handled differently, but this shouldn't be a problem for performance because very few objects will be so large.

These buffers are stored as an array of buffers so that they can be quickly indexed. Since all objects are allocated in these buffers, we be can break up the GC reference into two indices.  The first index denotes the particular buffer, the second denotes the specific object in the buffer. These indices could be stored in 32 bits, or 64 bits for computers with a lot of memory.

> It would in fact be interesting to see an efficient and safe way to resettle pointers. Now we are paying a few percent of CPU performance for the adress indirection.

True.  There is more indirection involved than with simple pointers, but the access is still O(1).  However, allocation and deallocation is cheaper on an order of magnitude.

> Many embedded systems thus don't enable it
> at all - which however makes the memory fragmentation problem quite
> somewhat more real, especially for garbage-collected languages. Having a
> good way to reposition pointers would be thus a helpful thing there. If,
> however, we have a system where we already have to pay an overhead on
> memory and on CPU, perhaps for security or other reasons, i would be
> rather willing to make use of it than layer anything else on top of it.
>
> Resizing the buffer is on the contrary perfectly unproblematic - new chunks need not be adjacent, and the buffer need not be contiguous. As long as the buffer chunks are of the same size for many differently large types, neither large nor small, there should be no fragmentation.

Exactly.  A relatively simple solution.  No fragmentation within the buffers.

>> I believe that with the right GC algorithm, we can outperform non-GC.
>
> I believe one cannot outperform non-GC allocation directly -

I disagree.  If the GC has the capability to compact memory, then I believe non-GC can be outdone.  However, this has yet to be seen.

> we can
> rather outperform it by reducing the costs of semi-manual bookkeeping.
> Many C++ programs use shared_ptr<> to do their memory management, which
> is both syntactically wordy and quite inefficient because of often count
> manipulation and checking. Even worse is relying on undocumented
> reference-counted implementation of some standard containers, when their
> specified semantics is value. Besides, we are slowly entering a
> multithreaded world, where using reference counting is a *crime*,

What are the problems with reference counting and multithreading?  It would seem to me that as long as you "synchronize" the appropriate methods, there should be no problem.  I don't use a shared object templates in C++, but I personally prefer reference counting to single references.  It seems more powerful.

> while GC works much better and is on the way to further improvement.

I hope so.

-Craig


March 18, 2005
Craig Black wrote:
> No doubt about it, my approach would very much discourage the use of pointers. 

!?!

They are even syntactically the same.

> I'm glad you have inquired about GC references.  Again, in my approach, all objects are allocated in object pools or buffers, according to their size, up to a certain point.  Lets say the cuttoff is 1024 words (could also be 4096 or whatever).  All objects equal to or larger than 1024 words go in the last buffer.  Thus, the last buffer must be handled differently, but this shouldn't be a problem for performance because very few objects will be so large.
> 
> These buffers are stored as an array of buffers so that they can be quickly indexed. Since all objects are allocated in these buffers, we be can break up the GC reference into two indices.  The first index denotes the particular buffer, the second denotes the specific object in the buffer. These indices could be stored in 32 bits, or 64 bits for computers with a lot of memory.

Ok, so you give up a minor amount of performance for indirection. You can have, for example, 64-kilobyte large pools, and up to 65536 pool pointers table on a 32-bit memory architecture. So far so good.

There is even no huge problem accomodating for normal pointers - let them be simple one-word-pooled garbage-collected objects.

However, this leaves one question open: when scanning, your starting point is the stack and the globals. The stack does not contain any type information. When gliding over an element of stack - of which you think it might be a GC reference, finally, it is nothing more or less than a machine word - you have to unambiguosly decide whether this is a pointer or not. There are many contexts where one could supply type information without run-time disadvantage, but i don't know a way how to do that for stack. The current policy is, we assume everything is a pointer. If we are right, then if it points into some actually allocated region of memory, we mark it as currently used and thus know we cannot deallocate this region. There are naturally some memory regions which are marked while they aren't actually pointed to by a pointer, it was just an integer or just about any collection of bits, but this happens rarely and hardly makes any significant difference. But can we move a marked region? No! Because we need to go back and change anything that we thought was a pointer to it - and what if that something was just a random collection of bits? Your proposal has the same problem, if i understood it correctly.

>>It would in fact be interesting to see an efficient and safe way to
>>resettle pointers. Now we are paying a few percent of CPU performance
>>for the adress indirection.
> 
> True.  There is more indirection involved than with simple pointers, but the access is still O(1).  However, allocation and deallocation is cheaper on an order of magnitude.

Would memory management really become cheaper? Allocation is already from size-typed pools. Deallocation with free involves searching for a pointer in a map, to determine the size of a memory region to free, which is O(log n), or this size of a memory region can be written just before the pointed-to spot in the allocated region. The second variation is O(1), but it involves that the allocator allocates always one word more memory than requiered and returns the pointer to the following byte.

There is one problem in C which forces this whole design: a pointer designates any sort of memory region, which may be a stuct, an array, a struct with an array field at the end, ... anything! C++ and D enforce stronger design - anything allocated with new must be deallocated with delete, or in D's case also by GC. Furthermore, one cannot do many sorts of pointer conversions legally. Thus we can assume all pointers are well-typed at the point where deallocation happens. This has one of the following consequences:

(1) The type is of statically known size. Thus, the size need not be stored with the object, but can be hardcoded at the deallocation point, and given to the actual deallocator.

(2) The type is a class which can be substituted by a descendant when handled by pointer, which means in C++ that it has virtual member functions declared. Because the size is being determined by which descendant we are holding, it can be hidden per-class in the classinfo or vtable, taken then from there and given the deallocator.

(3) The type is an array. We use over-allocation technique to store the length, and given the deallocator.

Deallocator, having the size, can quickly select a pool to which the reclaimed memory goes. A pool contains a list of free blocks within. I see how this can be optimized when we can compact though.

I don't know whether actual C++ implementations employ such a technique, but i think they could, and thus C++ gives us any sort of freedom to make the almost optimal memory management. As to D, i would not be so shure with the first point, because when a GC scans, it need not know type, it cares more about size. Definately, this is one direction where one should think about and probably get some improvement.

>>>I believe that with the right GC algorithm, we can outperform non-GC.
>>
>>I believe one cannot outperform non-GC allocation directly -
> 
> 
> I disagree.  If the GC has the capability to compact memory, then I believe non-GC can be outdone.  However, this has yet to be seen.

As i have shown above, marginally. Deallocation happens usually in the GC, but it doesn't take really time. A much higher amount of time goes into determining what memory regions are alive. Currently, the GC recursively goes, starting with the stack, through the whole heap, and considers every word it encounters a pointer. Checking whether a number points into allocated block of memory is a pointer takes a map lookup which is logarithmic in time to the number of allocated blocks, thus a complete scan is O(n log n) to the amount of active memory. Walter knows of enough ways to improve this, he just hasn't come to implementing it yet since there have always been more important issues.

> What are the problems with reference counting and multithreading?  It would seem to me that as long as you "synchronize" the appropriate methods, there should be no problem.  I don't use a shared object templates in C++, but I personally prefer reference counting to single references.  It seems more powerful.

"Synchronize" is the problem. You have to suspend all threads while you increment a counter - unless you can prove that your thread is the only one which could be manipulating the counter at the moment. This might not be trivial when we have multiple processors. And thus a simple small operation of incrementing an integer starts to take an unproportionally huge amount of time. It all just bogs down.

-eye
March 18, 2005
Ilya Minkov wrote:
<snip>
> -eye

Good points.

Pretty funny we both posted such similar responses in different groups (the other is on digitalmars.d) to the same thread.

Hehe!
-David
March 21, 2005
David Medlock wrote:
> Good points.
> 
> Pretty funny we both posted such similar responses in different groups (the other is on digitalmars.d) to the same thread.

Really? The other newsgroup is too fat for me to read. Of course most of
what i say was already stated by me, someone else, and is to the most
from some knowlegde i gained through scientific papers posted on the
newsgroup back when i was an active participant years ago.

In fact i must imagine Walter must be getting very tired of similar
proposals coming over and over, just to be discarded or put off after a
few rounds of collective thinking and gathering the arguments over again.

If i may do that sort of stupid thing and throw in a question for
consideration: is there a sensble way to separate foreign language
execution from D stack-wise? I think i have an idea which is worth
thinking about. Imagine, you have a function, alongside of which, its
stack layout information is stored. (*1) Then, when a GC kicks in, you
must just figure out what function is running now, read its stack layout
- from which you can exactly determine the pointers. Go to the entry
side of the stack, and find the return adress there. Search for that
function. And so on. However, the C functions interfere. And a C
function can call D function can call C function, ... This might mean we
would need to introduce some restrictions into the languege. What kind
of restrictions?

(*1) I know. There is no static stack layout per se, but one could do
something like over-allocate so that all locals get in, and separate
alloca from the execution stack - which would be of performance
advantage anyway because alloca "heapstack" need not be scanned for
pointers unless explicitly made a GC root. Additionally, for all
functions called the point on which the return adress is stored must be
the same.

I also wonder how symbolic debuggers work. Probably not an option since
they slow the whole thing down. Another option is to use a mechanism
like stack unrolling, which means around each stack frame there would be
handlers - something that D has been avoiding even for exceptions, to
maintain a performance edge against C++.

Nothing is for consideration for soon implementation, since we now only
need to consern ourselves with crafting a minimal set of language
features which wouldn't force us to break the compatibility afterwards.
To cut all the weak branches now.

-eye
March 25, 2005
>>>I believe that with the right GC algorithm, we can outperform non-GC.
>>
>>I believe one cannot outperform non-GC allocation directly -
> 
> I disagree.  If the GC has the capability to compact memory, then I believe non-GC can be outdone.  However, this has yet to be seen.

This is like asking whether auto transmission or stick shift is faster. It really depends on the circumstances. (I know, I know, mostly stick shift is faster, but that was not the point, right?)

If a particular GC algorithm outperforms some code, or not, depends first of all on the purpose of the code itself, and on the particular programmer.

Want to find a GC outperforming Bjarne Stroustrup doing a simple linked list? OTOH, want to find a GC _not_ outperforming a newbie doing a simple linked list?

"This code is faster than C." "This code is faster than ASM."

Besides, a GC for web applications would probably be different than one for a text editor. For example.

-------------

Besides, the real advantage of GC is not speed. It is reliability of code, ease of programming, and separation of disparate issues (here, memory management and program logic).