Thread overview
Memory Allocation and Locking
Aug 22, 2008
dsimcha
Aug 22, 2008
BCS
Aug 22, 2008
Sean Kelly
Aug 22, 2008
dsimcha
Aug 22, 2008
Sean Kelly
August 22, 2008
I guess this really is more of a C++ question then a D question, but it's kind of both, so what the heck?

Anyhow, I had some old C++ code laying around that I've been using, and I had added some quick and dirty multithreading of embarrassingly parallel parts of it to it to speed it up.  Lo and behold, it actually worked.  A while later, I decided I needed to extend it in ways that couldn't be done nearly as easily in C++ as in D, and it wasn't much code, so I just translated it to D.

First iteration was slow as dirt because of resource contention for memory allocation.  Put in a few ugly hacks to avoid all memory allocation in inner loops (i.e. ~= ), and now the D version is faster than the C++ version.

Anyhow, the real quesiton is, why is it that I can get away w/ heap allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++ code but not in multithreaded D code?  I'm assuming, since there didn't seem to be any resource contention in the C++ version, that there was no locking for the memory allocation, but all of the allocation was done through STL rather than explicitly, so I don't really know.  However, the code actually worked like this, so I never really gave it much thought until today.  Is it just a minor miracle that this worked at all w/o me explicitly using some kind of lock?

By the way, if it's implementation defined, my C++ compiler is MinGW GCC 3.45.
August 22, 2008
Reply to dsimcha,

> I guess this really is more of a C++ question then a D question, but
> it's kind of both, so what the heck?
> 
> Anyhow, I had some old C++ code laying around that I've been using,
> and I had added some quick and dirty multithreading of embarrassingly
> parallel parts of it to it to speed it up.  Lo and behold, it actually
> worked.  A while later, I decided I needed to extend it in ways that
> couldn't be done nearly as easily in C++ as in D, and it wasn't much
> code, so I just translated it to D.
> 
> First iteration was slow as dirt because of resource contention for
> memory allocation.  Put in a few ugly hacks to avoid all memory
> allocation in inner loops (i.e. ~= ), and now the D version is faster
> than the C++ version.
> 
> Anyhow, the real quesiton is, why is it that I can get away w/ heap
> allocations in inner loops (i.e. vector.push_back() ) in multithreaded
> C++ code but not in multithreaded D code?  I'm assuming, since there
> didn't seem to be any resource contention in the C++ version, that
> there was no locking for the memory allocation, but all of the
> allocation was done through STL rather than explicitly, so I don't
> really know.  However, the code actually worked like this, so I never
> really gave it much thought until today.  Is it just a minor miracle
> that this worked at all w/o me explicitly using some kind of lock?
> 
> By the way, if it's implementation defined, my C++ compiler is MinGW
> GCC 3.45.
> 

under C++ I'll guess that new is a thin skin on malloc, under D, new uses the GC. If one of these is thread safe and the other isn't, there's your issue. But I'm just guessing.


August 22, 2008
"dsimcha" wrote
>I guess this really is more of a C++ question then a D question, but it's kind
> of both, so what the heck?
>
> Anyhow, I had some old C++ code laying around that I've been using, and I
> had
> added some quick and dirty multithreading of embarrassingly parallel parts
> of
> it to it to speed it up.  Lo and behold, it actually worked.  A while
> later, I
> decided I needed to extend it in ways that couldn't be done nearly as
> easily
> in C++ as in D, and it wasn't much code, so I just translated it to D.
>
> First iteration was slow as dirt because of resource contention for memory
> allocation.  Put in a few ugly hacks to avoid all memory allocation in
> inner
> loops (i.e. ~= ), and now the D version is faster than the C++ version.
>
> Anyhow, the real quesiton is, why is it that I can get away w/ heap
> allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
> code but not in multithreaded D code?  I'm assuming, since there didn't
> seem
> to be any resource contention in the C++ version, that there was no
> locking
> for the memory allocation, but all of the allocation was done through STL
> rather than explicitly, so I don't really know.  However, the code
> actually
> worked like this, so I never really gave it much thought until today.  Is
> it
> just a minor miracle that this worked at all w/o me explicitly using some
> kind
> of lock?
>
> By the way, if it's implementation defined, my C++ compiler is MinGW GCC 3.45.

I think it may be the way the GC is implemented.  If it needs memory, it first tries running a collect cycle (not sure if there is some algorithm to limit running the GC) before trying to get more memory from the OS.  If you put the allocations in an inner loop, you are running collect cycles more frequently (you should be able to see this by profiling the code).  I'm not sure of the algorithm to decide when to run, but worst case, it is every time you allocate and there is no free space left.

Most memory allocation schemes that I've seen need to lock the heap, so it is always a point of contention.  However, with the new shared/unshared thing, it looks like D might be advancing past that obstacle.  But the GC really needs some work to be smarter about how many times it runs.  When implementing dcollections, I created a custom allocator that allocates pages of elements at once instead of individual elements.  This saw a huge speedup, because I'm not running the GC as frequently.

-Steve


August 22, 2008
dsimcha wrote:
> I guess this really is more of a C++ question then a D question, but it's kind
> of both, so what the heck?
> 
> Anyhow, I had some old C++ code laying around that I've been using, and I had
> added some quick and dirty multithreading of embarrassingly parallel parts of
> it to it to speed it up.  Lo and behold, it actually worked.  A while later, I
> decided I needed to extend it in ways that couldn't be done nearly as easily
> in C++ as in D, and it wasn't much code, so I just translated it to D.
> 
> First iteration was slow as dirt because of resource contention for memory
> allocation.  Put in a few ugly hacks to avoid all memory allocation in inner
> loops (i.e. ~= ), and now the D version is faster than the C++ version.
> 
> Anyhow, the real quesiton is, why is it that I can get away w/ heap
> allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
> code but not in multithreaded D code?  I'm assuming, since there didn't seem
> to be any resource contention in the C++ version, that there was no locking
> for the memory allocation, but all of the allocation was done through STL
> rather than explicitly, so I don't really know.  However, the code actually
> worked like this, so I never really gave it much thought until today.  Is it
> just a minor miracle that this worked at all w/o me explicitly using some kind
> of lock?

You can get away with it in C++ because there are no garbage collection cycles occurring when the allocator runs low on memory.  To easiest way to approach an apples-apples comparison would be to disable the GC at the start of your program.

Another performance issue for some programs is a result of how the GC allocator obtains memory from the OS.  It does so in bite-sized blocks, so if your app does a ton of allocation and then runs the GC is not only collecting periodically but then it's also obtaining a chunk of memory from the OS to store the latest allocation (with some extra room for additional blocks), then it's doing the same thing again, etc.  The Tango GC provides a reserve() method to optimize for this situation by allowing the user to have the GC pre-build a pool of N bytes for use by future allocations.  In my testing this increased app performance by 50% or more given certain program designs.


Sean
August 22, 2008
But D's malloc appears to require locking even when memory is available, just to allocate it.  Heck, it even synchronizes to check the capacity of an allocated block, which (I'm guessing) means every ~= requires a synch if the code is multithreaded, even if no realloc ends up being necessary.  I base this on reading the GC implementation.  I assume that allocating via builtin arrays is basically a thin layer on top of D's malloc.  Here's the Phobos implementation.  For those who haven't looked at this, the function thread_needLock just returns true if more than 1 thread is currently running.  Haven't looked at Tango b/c, unfortunately, it's not on D2 yet.

    void *malloc(size_t size)
    {
        if (!size)
        {
            return null;
        }

	if (!thread_needLock())
	{
	    return mallocNoSync(size);
	}
	else synchronized (gcLock)
	{
	    return mallocNoSync(size);
	}
    }

Bottom line is, it's mostly just a curiosity/future reference question, since with a few tweaks, the D version is now *faster* than the C++ version, but how can C++ get away w/ concurrent memory allocations while D can't, or should my C++ version never have worked in the first place?
August 22, 2008
dsimcha wrote:
> But D's malloc appears to require locking even when memory is available, just to
> allocate it.  Heck, it even synchronizes to check the capacity of an allocated
> block, which (I'm guessing) means every ~= requires a synch if the code is
> multithreaded, even if no realloc ends up being necessary.  I base this on reading
> the GC implementation.

You're right.  But it's likely that an allocator for C++ would have to do the same thing, unless you're talking about a special-purpose one like HOARD.  Also, the Tango GC will only lock on this mutex if the app is actually multithreaded.  Single-threaded apps skip the locking process.  The Phobos GC does this as well, but I recall finding some gaps in its coverage for this feature.

> I assume that allocating via builtin arrays is basically a
> thin layer on top of D's malloc.  Here's the Phobos implementation.  For those who
> haven't looked at this, the function thread_needLock just returns true if more
> than 1 thread is currently running.  Haven't looked at Tango b/c, unfortunately,
> it's not on D2 yet.
> 
>     void *malloc(size_t size)
>     {
>         if (!size)
>         {
>             return null;
>         }
> 
> 	if (!thread_needLock())
> 	{
> 	    return mallocNoSync(size);
> 	}
> 	else synchronized (gcLock)
> 	{
> 	    return mallocNoSync(size);
> 	}
>     }

thread_needLock() was actually cribbed from Tango a while back.  Phobos used to do something like "Thread.getAll.length > 1."  The effect is essentially the same though.

> Bottom line is, it's mostly just a curiosity/future reference question, since with
> a few tweaks, the D version is now *faster* than the C++ version, but how can C++
> get away w/ concurrent memory allocations while D can't, or should my C++ version
> never have worked in the first place?

I don't think it's the locking that's an issue but rather the collection cycles plus possibly the need to obtain additional pools from the OS.


Sean