April 29, 2011
On 28.04.2011 18:17, Steven Schveighoffer wrote:

> It would seem making the metadata be a constant offset from the data page would be better, but of course, it cannot be that way for multi-page blocks.

  Why it cannot?

> If you have ideas on how to improve the performance, I encourage you to learn how the GC works and submit some patches!  It actually can be pretty fun.

  Probably I do, I'll take a look.

> Yes, the allocation and free performance could be improved, but it doesn't change the fact that delete manually is not a huge performance gain, if at all.

  I agree, just in some cases it is a gain, if not to performance, then to memory usage. Imagine a tight loop with aggressive GC , when collection is attempted on every allocation...

> If an object is to be collected, there will be no references to that object.  This means you do not have to scan that object's memory to see if it points to something else.  Less memory needs to be scanned.

  Right, but I still have to scan other objects and roots, which are still alive, and there can be many. Good GC shouldn't attempt a collection unless memory is tight, though.

  Anyway, I'll take a look into GC code, probably will make a benchmark to find out its weakness...

/Alexander
April 29, 2011
On Thu, 28 Apr 2011 16:14:48 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:

> Steven Schveighoffer wrote:
>> I would need to see what your benchmark is before I'd say it was
>> conclusive.
>
> Ok. Version with delete is GC_hints (delete does not give many semantic
> guarantees, ergo it is a hint), version without delete is GC. It is also attached
> for convenience.
>
> This benchmark tests quite high loads on the GC heap, so for usual applications
> the GC won't perform all that bad. But then again, usual applications don't come
> with a memory allocation bottleneck.
>
> @Andrei: The benchmark also includes a comparison to malloc/free (version c_heap).
>
> Timon
>
>
> /* Benchmark GC vs GC & hints vs freelist vs C Heap
>  * My measurements on Intel Core Duo 2x2.4 GhZ on Ubuntu 10.04 (lucid) Amd64
>  * version=GC:       1m45s
>  * version=GC_hints: ~6.8s
>  * //using custom allocators:
>  * version=freelist: ~3.7s
>  * version=c_heap:   ~4.5s
>  */
>
>
> import std.stdio;
> import std.c.stdlib;
> import std.c.string;
> import std.algorithm:swap;
>
>
> version=GC;//default
> //version=GC_hints;
> //version=freelist;
> //version=c_heap;
>
>
> version(freelist) version=calldel;
> version(GC_hints) version=calldel;
> version(c_heap) version=calldel;
>
> class foo{
> 	char data[1020]=void;
> 	static foo freelist;
> 	foo next;
> 	version(freelist){
> 		new(size_t s){
> 			if(freelist !is null){
> 				void *r=cast(void*)freelist;
> 				freelist=freelist.next;
> 				return r;
> 			}
> 			return cast(void*)new char[s];//could use malloc(s); effects are similar
> 		}
> 		delete(void *p){
> 			if(p){
> 				foo oldfl=freelist;
> 				freelist=cast(foo)p;
> 				freelist.next=oldfl;
> 			}
> 		}
> 	}
> 	version(c_heap){
> 		new(size_t s){return malloc(s);}
> 		delete(void *p){if(p) free(p);}
> 	}
> };
>
> foo[100000] a;
>
> void main(){
> 	srand(100);
> 	int top;
> 	int dir;//0: expected growing, 1: expected shrinking
> 	foreach(t;0..10000000){
> 		if(!top){
> 			a[top++]=new foo;
> 			dir=0;
> 		}else if(top==100000){
> 			top--;
> 			version(calldel) delete a[top];
> 			dir=1;
> 		}else if(dir^!(rand()%3)){
> 			top--;
> 			version(calldel) delete a[top];
> 		}else a[top++]=new foo;
> 		if(!rand()%100) for(int i=0;i<100;i++) swap(a[0],a[rand()%top]);//mess around
> 	}
> }

I'll point out a couple of issues here:

1. you are never removing elements from the array when version(calldel) is not enabled.  That is, the GC might run, but does not clean any objects when you just do top--.  Although this might not make a huge difference, it is not an equivalent comparison.  I'd suggest adding else clauses to the version(calldel) where you set a[top] to null.
2. the GC is conservative, and you are allocating swaths of 1K objects.  It's quite possible that you are running into false pointer issues.  This might be helped with a more precise GC (which there is a patch for in bugzilla).  David Simcha has also checked into github improvements for allocation of large objects.  This may also help with your benchmark.

I too have seen the GC perform horribly in situations like this, where you are allocating lots and lots of data.  All you need to do to prove it is append an array for so many elements, you will see huge pauses where the GC runs.

These are situations where delete helps because of the conservative implementation of the GC.  Yes, it is a valid point, but it doesn't prove that delete is better against *any* GC.  What we need to do is improve the GC.

I tried setting a[top] to null in the GC system.  For comparison, on my system your original GC test took 2m42s.

With setting a[top] to null, the new run on my system is 2m36s.  So the time does not decrease significantly.

Some more data:

I added the following line to the foreach loop:

if(!(t % 10000)) writeln(t, "\t", top);

There is some interesting behavior, particularly for the GC and GC_hints versions.  Your code obviously goes through cycles in the array, first growing up to 100,000, then shrinking back down to 0, then repeating the cycle.  There is some noise with the rand % 3, but it is not enough to make the growth and shrinking truly random.  What is interesting in this display is, in both GC_hints and GC versions, the initial growth to 100,000 is somewhat slow, with the GC_hints version being a bit faster.

However, after that, the difference is drastic.  The GC_hints version shrinks very quickly, and the GC version continues its slow pace.  This is mildly surprising, because setting the top element to null should be much faster than delete, and both are using the GC to allocate new nodes.  However, what I find really interesting is the subsequent growth back up to 100,000 is still slow in the GC version, but is still very speedy in the GC_hints version.  Given that both are using the GC for allocation, I would expect even footing.  There is definitely something different going on when delete is called than when a collection cycle is called.  The GC version may leave more objects allocated than the GC_hints version, but it should speed up as it can collect more object blocks to reuse.  It's almost as if it's not reusing memory, but my system isn't running out of memory.  So I'm not sure exactly what's going on.

This clearly needs some investigation, thank you for sharing your benchmark.

-Steve
April 29, 2011
On Thu, 28 Apr 2011 19:59:52 -0400, Alexander <aldem+dmars@nk7.net> wrote:

> On 28.04.2011 18:17, Steven Schveighoffer wrote:
>
>> It would seem making the metadata be a constant offset from the data page would be better, but of course, it cannot be that way for multi-page blocks.
>
>   Why it cannot?

Because in the second or subsequent page of a multi-page memory block, the previous page cannot be metadata, it must be data (this is one contiguous block).  So how do you find the metadata for such a block?

In fact, thinking about it, the metadata is what tells the GC how big the block is, and where it starts, so you can't store the metadata in a constant offset, it's just not possible.

>> Yes, the allocation and free performance could be improved, but it doesn't change the fact that delete manually is not a huge performance gain, if at all.
>
>   I agree, just in some cases it is a gain, if not to performance, then to memory usage. Imagine a tight loop with aggressive GC , when collection is attempted on every allocation...

Yes, if you have specific memory requirements, it's best to use an alternative strategy.  That does not mean delete is the answer, and even if it is, the functionality that delete represents is not going away, just the keyword itself does.

>> If an object is to be collected, there will be no references to that object.  This means you do not have to scan that object's memory to see if it points to something else.  Less memory needs to be scanned.
>
>   Right, but I still have to scan other objects and roots, which are still alive, and there can be many.

Yes, but your assertion was that more dead objects means the scan takes longer.  It's actually the opposite (or at least, it should be).

The sweep phase probably takes longer, though, but I wouldn't expect this to be the bottleneck.

> Good GC shouldn't attempt a collection unless memory is tight, though.

I think D's GC only attempts a collection if memory is tight, but I'm not sure now (after investigating somewhat Timon's benchmarks).

-Steve
April 29, 2011
On 29.04.2011 15:10, Steven Schveighoffer wrote:

> I think D's GC only attempts a collection if memory is tight, but I'm not sure now (after investigating somewhat Timon's benchmarks).

  I am not sure as well - running some code I've noticed that memory usage is always quite low (there were a lot of allocations in a tight loop).

  PS: Quickly peeked into D's GC code - oh... it will take time :)

/Alexander
April 29, 2011
Steven Schveighoffer wrote:
> I'll point out a couple of issues here:
>
> 1. you are never removing elements from the array when version(calldel) is not enabled.  That is, the GC might run, but does not clean any objects when you just do top--.  Although this might not make a huge difference, it is not an equivalent comparison.  I'd suggest adding else clauses to the version(calldel) where you set a[top] to null.

Whoops, good catch, this is a bug. I get 1m27 when reassigning a[top] to null, which is a little bit faster.

> 2. the GC is conservative, and you are allocating swaths of 1K objects. It's quite possible that you are running into false pointer issues.  This might be helped with a more precise GC (which there is a patch for in bugzilla).  David Simcha has also checked into github improvements for allocation of large objects.  This may also help with your benchmark.

The data=void can be removed, but it actually slows down the benchmark a little bit on my machine.

>
> I too have seen the GC perform horribly in situations like this, where you are allocating lots and lots of data.  All you need to do to prove it is append an array for so many elements, you will see huge pauses where the GC runs.
>
> These are situations where delete helps because of the conservative implementation of the GC.  Yes, it is a valid point, but it doesn't prove that delete is better against *any* GC.  What we need to do is improve the GC
>
> I tried setting a[top] to null in the GC system.  For comparison, on my system your original GC test took 2m42s.
>
> With setting a[top] to null, the new run on my system is 2m36s.  So the time does not decrease significantly.
>
> Some more data:
>
> I added the following line to the foreach loop:
>
> if(!(t % 10000)) writeln(t, "\t", top);
>
> There is some interesting behavior, particularly for the GC and GC_hints versions.  Your code obviously goes through cycles in the array, first growing up to 100,000, then shrinking back down to 0, then repeating the cycle.  There is some noise with the rand % 3, but it is not enough to make the growth and shrinking truly random.  What is interesting in this display is, in both GC_hints and GC versions, the initial growth to 100,000 is somewhat slow, with the GC_hints version being a bit faster.

The growing and shrinking is intended to work exactly that way, it makes the allocations more, say, dynamic. ;) If it was truly random, the memory usage would be kept more or less constant.

>
> However, after that, the difference is drastic.  The GC_hints version shrinks very quickly, and the GC version continues its slow pace.  This is mildly surprising, because setting the top element to null should be much faster than delete, and both are using the GC to allocate new nodes. However, what I find really interesting is the subsequent growth back up to 100,000 is still slow in the GC version, but is still very speedy in the GC_hints version.  Given that both are using the GC for allocation, I would expect even footing.  There is definitely something different going on when delete is called than when a collection cycle is called.  The GC version may leave more objects allocated than the GC_hints version, but it should speed up as it can collect more object blocks to reuse.  It's almost as if it's not reusing memory, but my system isn't running out of memory.  So I'm not sure exactly what's going on.
>
> This clearly needs some investigation, thank you for sharing your benchmark.
>
> -Steve
May 02, 2011
On Fri, 29 Apr 2011 17:02:55 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:

> Steven Schveighoffer wrote:
>
>> 2. the GC is conservative, and you are allocating swaths of 1K objects.
>> It's quite possible that you are running into false pointer issues.  This
>> might be helped with a more precise GC (which there is a patch for in
>> bugzilla).  David Simcha has also checked into github improvements for
>> allocation of large objects.  This may also help with your benchmark.
>
> The data=void can be removed, but it actually slows down the benchmark a little
> bit on my machine.

Yes, that's because you are using char data, whose default is 0xff :)

data = void doesn't actually make that data random, because it still has to initialize the other parts of the class data.  The run time doesn't have some way to "skip" certain parts of the initializer data.  So it likely just sets it to 0.

If you changed your data type to ubyte[], it would become fast again, even without the =void, because it's memsetting to 0 vs copying an initializer with some bytes set to 0xff.

-Steve
1 2 3 4 5 6 7
Next ›   Last »