Thread overview
TempAlloc
Dec 05, 2008
dsimcha
Dec 06, 2008
Fawzi Mohamed
Dec 06, 2008
dsimca
Dec 07, 2008
Fawzi Mohamed
Dec 07, 2008
dsimcha
Dec 07, 2008
Fawzi Mohamed
December 05, 2008
As per a discusson on digitalmars.D about a month ago, I've created a temp
space allocator for quickly allocating memory in a last in, first out manner.
 Its main use is for allocating scratch space to be used within a function
without introducing threading bottlenecks, etc.  I've released it as an alpha
on scrapple.  I'm working on a major update to my statistics library (also on
scrapple) and will dogfood this there when I'm done with the update.  I think
that it's pretty useful, at least in number crunching-type code, in its
current form as a library.  However, it might be even better if integrated
into druntime and the core language to make it safer (it's pretty dangerous if
used incorrectly) or cleaner to use (I've done the best I can here, but if I
had some compiler support, I could automate some stuff that I make the caller do.)

For the code, see http://dsource.org/projects/scrapple/browser/trunk/tempAlloc.  The code is for D2 only, but could probably be ported to D1 using Tango's thread-local storage.
December 06, 2008
On 2008-12-06 00:20:49 +0100, dsimcha <dsimcha@yahoo.com> said:

> As per a discusson on digitalmars.D about a month ago, I've created a temp
> space allocator for quickly allocating memory in a last in, first out manner.
>  Its main use is for allocating scratch space to be used within a function
> without introducing threading bottlenecks, etc.  I've released it as an alpha
> on scrapple.  I'm working on a major update to my statistics library (also on
> scrapple) and will dogfood this there when I'm done with the update.  I think
> that it's pretty useful, at least in number crunching-type code, in its
> current form as a library.  However, it might be even better if integrated
> into druntime and the core language to make it safer (it's pretty dangerous if
> used incorrectly) or cleaner to use (I've done the best I can here, but if I
> had some compiler support, I could automate some stuff that I make the caller do.)
> 
> For the code, see
> http://dsource.org/projects/scrapple/browser/trunk/tempAlloc.  The code is for
> D2 only, but could probably be ported to D1 using Tango's thread-local storage.

I think that it could be useful to add an argument to to frameInit/free, in any case not just to speed things up, but to quickly catch mismatched init/free calls.

using GC.malloc in malloc for large requested sizes kind of defeats the stated purpose (and what makes superstack more difficult to use) of not being GC scanned, even if I understand the problem of either making bookkeeping more complicated or create potentially big holes in the stack...

Fawzi


December 06, 2008
== Quote from Fawzi Mohamed (fmohamed@mac.com)'s article
> I think that it could be useful to add an argument to to frameInit/free, in any case not just to speed things up, but to quickly catch mismatched init/free calls.

Can you elaborate on this?  I'm not exactly sure what such an argument would look like.  Also, I wanted to make TempAlloc fairly simple even if it costs some performance so that I would actually want to use it.  If doing something like this adds a lot of complexity for the caller, I probably won't implement it.  Right now, I like the simple mixin API.

> using GC.malloc in malloc for large requested sizes kind of defeats the stated purpose (and what makes superstack more difficult to use) of not being GC scanned, even if I understand the problem of either making bookkeeping more complicated or create potentially big holes in the stack...

I see over 4MB in a single allocation to be kind of a corner case anyhow.  If you need to allocate this much memory for a temp variable that doesn't escape the current scope, in a single allocation (probably pretty unusual already), since the cost of an allocation is sublinear in bytes allocated, the allocation time will likely be dwarfed by the time it takes to use the memory for whatever you need it for.  Furthermore, whether by frameInit/frameFree or by TempAlloc.malloc/TempAlloc.free, the memory gets freed deterministically rather than waiting for the GC to run to free it, so it's still faster than using new, etc.  In fact, I'm thinking I should have made this limit ~1MB (about 1/4 the size of a TempAlloc block) since allocating 4MB will always require the creation of a new block anyhow.

I'm open to suggestions about how to deal with these large block cases, but I will prioritize handling the common case of much smaller allocations efficiently and cleanly over what I feel is somewhat of a corner case.

December 07, 2008
On 2008-12-06 15:55:44 +0100, dsimca <dsimcha@yahoo.com> said:

> == Quote from Fawzi Mohamed (fmohamed@mac.com)'s article
>> I think that it could be useful to add an argument to to
>> frameInit/free, in any case not just to speed things up, but to quickly
>> catch mismatched init/free calls.
> 
> Can you elaborate on this?  I'm not exactly sure what such an argument would look
> like.  Also, I wanted to make TempAlloc fairly simple even if it costs some
> performance so that I would actually want to use it.  If doing something like this
> adds a lot of complexity for the caller, I probably won't implement it.  Right
> now, I like the simple mixin API.

it could simply be an int storing the place in the stack...
So if someone forgets to remove the frame the next call will see it immediately.
By the way couldn't you avoid the thread local State by simply keeping a stack of State structures?
Then instead of the thread local State you would simply look at the top of the State Stack, it should be faster and more self contained.
Then the element to check could really be the size of the stack, and if it is simply an int you can make it optional, if present you check for mismatches... and you get rid of passing the State out...

int initFrame()// returns the number of frames of the stack
void freeFrame(int handler=-1)// frees the frame, if handler is >=0 then checks the number of frames

>> using GC.malloc in malloc for large requested sizes kind of defeats the
>> stated purpose (and what makes superstack more difficult to use) of not
>> being GC scanned, even if I understand the problem of either making
>> bookkeeping more complicated or create potentially big holes in the
>> stack...
> 
> I see over 4MB in a single allocation to be kind of a corner case anyhow.  If you
> need to allocate this much memory for a temp variable that doesn't escape the
> current scope, in a single allocation (probably pretty unusual already), since the
> cost of an allocation is sublinear in bytes allocated, the allocation time will
> likely be dwarfed by the time it takes to use the memory for whatever you need it
> for.  Furthermore, whether by frameInit/frameFree or by
> TempAlloc.malloc/TempAlloc.free, the memory gets freed deterministically rather
> than waiting for the GC to run to free it, so it's still faster than using new,
> etc.  In fact, I'm thinking I should have made this limit ~1MB (about 1/4 the size
> of a TempAlloc block) since allocating 4MB will always require the creation of a
> new block anyhow.
> 
> I'm open to suggestions about how to deal with these large block cases, but I will
> prioritize handling the common case of much smaller allocations efficiently and
> cleanly over what I feel is somewhat of a corner case.

no problem I understand, and I am not a user yet.

OT:
I just read (when looking for your previous messages) about your statistics library. Nice!
You might be interested in giving a look to tango's random package, it has fast generation of accurate random bumbers with gauss exp and gamma distributions...
D1.0, but bringing it to D2.0 shouldn't be difficult...

Fawzi


December 07, 2008
== Quote from Fawzi Mohamed (fmohamed@mac.com)'s article
> it could simply be an int storing the place in the stack...
> So if someone forgets to remove the frame the next call will see it
> immediately.
> By the way couldn't you avoid the thread local State by simply keeping
> a stack of State structures?
> Then instead of the thread local State you would simply look at the top
> of the State Stack, it should be faster and more self contained.

I really don't see how this can work.  AFAIK, there are only 4 ways to deal with global mutable state in a multithreaded program:

1.  Locking.
2.  Somehow do everything atomically, if you can.
3.  Make the "global" state thread-local.
4.  Eliminate it altogether.

The biggest reason for me writing TempAlloc in the first place was to avoid lock contention for GC malloc access.  TempAlloc is somewhat faster even in single-threaded tests, but the really huge, orders of magnitude speedups come when it's used to avoid this contention.

> Then the element to check could really be the size of the stack, and if
> it is simply an int you can make it optional, if present you check for
> mismatches... and you get rid of passing the State out...
> int initFrame()// returns the number of frames of the stack
> void freeFrame(int handler=-1)// frees the frame, if handler is >=0
> then checks the number of frames

I thought about doing something like this. The biggest problem is large blocks. Since frequent very large allocations are where GC performance really suffers and false pointer issues turn up, I wanted to make sure that these get deleted deterministically, too.

Also, just to note:  I realized that I had not taken into account alignment when I first wrote TempAlloc.  I've updated it to use 16-byte alignment (good for x86), and to fix tempdup to work right with const.


December 07, 2008
On 2008-12-07 18:44:30 +0100, dsimcha <dsimcha@yahoo.com> said:

> == Quote from Fawzi Mohamed (fmohamed@mac.com)'s article
>> it could simply be an int storing the place in the stack...
>> So if someone forgets to remove the frame the next call will see it
>> immediately.
>> By the way couldn't you avoid the thread local State by simply keeping
>> a stack of State structures?
>> Then instead of the thread local State you would simply look at the top
>> of the State Stack, it should be faster and more self contained.
> 
> I really don't see how this can work.  AFAIK, there are only 4 ways to deal with
> global mutable state in a multithreaded program:
> 
> 1.  Locking.
> 2.  Somehow do everything atomically, if you can.
> 3.  Make the "global" state thread-local.
> 4.  Eliminate it altogether.
> 
> The biggest reason for me writing TempAlloc in the first place was to avoid lock
> contention for GC malloc access.  TempAlloc is somewhat faster even in
> single-threaded tests, but the really huge, orders of magnitude speedups come when
> it's used to avoid this contention.

mmh sorry this comes from not having really understood how TmpAlloc was supposed to work.
Indeed if you want to hide completely the presence of a State, and not pass it explicitly to the allocation routines you cannot avoid thread local.

I was mislead by the name, State is actually the superstack object, what I meant was to keep an extra index in it that simply counts the number of frames, and return/ask it in initFrame freeFrame so that you can check that you did not have any mismatched frames.