Jump to page: 1 2
Thread overview
Class size and performance
Jan 21, 2008
Daniel Lewis
Jan 21, 2008
Bill Baxter
Jan 21, 2008
Daniel Lewis
Jan 21, 2008
Sean Kelly
Jan 21, 2008
torhu
Jan 21, 2008
Sean Kelly
January 20, 2008
For some code I'm writing, I have a parser that creates instances of a class.  For a test case I have, it creates 728,050 instances.  That's a lot, and takes a lot of time, but this is life.

I've got a few ideas on optimizing it, but while testing some of them out I noticed something interesting.  If I comment out a single, unused from commented out code, member... I get double the speed.

The number and types of members do not matter, just their size.

I wrote a reduced test case.  To compile it, use either of these two commands:

dmd -version=fast -O -inline -run weird.d
dmd -version=slow -O -inline -run weird.d

The same effect happens with gdc.  It also happens with optimization and inlining on or off (they make no difference.)

import std.stdio, std.perf;

class TestClass
{
	ubyte[56] member1;

	version (slow)
		ubyte[4] member2;
}

void main()
{
	// This is just how many my real example used.
	const test_instances = 728_050;

	// Let's store them for accuracy to the problem.
	TestClass[] example = null;
	example.length = test_instances;

	auto counter = new PerformanceCounter();

	counter.start();
	for (int i = 0; i < test_instances; i++)
		example[i] = new TestClass();
	counter.stop();

	PerformanceCounter.interval_type ms = counter.microseconds();

	writefln("%f seconds", cast(double) ms / 1_000_000.0);
}

Also of note, if I put a custom allocator on to determine the size, it's slow until I remove another 4 bytes of members.

Is there some sort of threshold in the gc?  Why is this happening?

Thanks,
-[Unknown]
January 21, 2008
Unknown W. Brackets Wrote:

> For some code I'm writing, I have a parser that creates instances of a class.  For a test case I have, it creates 728,050 instances.  That's a lot, and takes a lot of time, but this is life.
> 
> I've got a few ideas on optimizing it, but while testing some of them out I noticed something interesting.  If I comment out a single, unused from commented out code, member... I get double the speed.
> 
> The number and types of members do not matter, just their size.
> 
> I wrote a reduced test case.  To compile it, use either of these two commands:
> 
> dmd -version=fast -O -inline -run weird.d
> dmd -version=slow -O -inline -run weird.d
> 
> The same effect happens with gdc.  It also happens with optimization and inlining on or off (they make no difference.)
> 
> import std.stdio, std.perf;
> 
> class TestClass
> {
> 	ubyte[56] member1;
> 
> 	version (slow)
> 		ubyte[4] member2;
> }
> 
> void main()
> {
> 	// This is just how many my real example used.
> 	const test_instances = 728_050;
> 
> 	// Let's store them for accuracy to the problem.
> 	TestClass[] example = null;
> 	example.length = test_instances;
> 
> 	auto counter = new PerformanceCounter();
> 
> 	counter.start();
> 	for (int i = 0; i < test_instances; i++)
> 		example[i] = new TestClass();
> 	counter.stop();
> 
> 	PerformanceCounter.interval_type ms = counter.microseconds();
> 
> 	writefln("%f seconds", cast(double) ms / 1_000_000.0);
> }
> 
> Also of note, if I put a custom allocator on to determine the size, it's slow until I remove another 4 bytes of members.
> 
> Is there some sort of threshold in the gc?  Why is this happening?
> 
> Thanks,
> -[Unknown]

Sir, I'm not sure about the GC, but I can say that your program probably doesn't need to create 728,000 objects to parse an input stream.

I'm writing a parser too at the moment; it's a rather experimental, oddly written one though.  It [should] perform tokenization and parsing in a single O(1) pass for the ECMAScript language.

Take a look if you're interested; http://dsource.org/projects/walnut/browser/branches/1.9/source/interpreter.d

Regards,
Dan
January 21, 2008
Unknown W. Brackets wrote:
> For some code I'm writing, I have a parser that creates instances of a class.  For a test case I have, it creates 728,050 instances.  That's a lot, and takes a lot of time, but this is life.
> 
> I've got a few ideas on optimizing it, but while testing some of them out I noticed something interesting.  If I comment out a single, unused from commented out code, member... I get double the speed.
> 
> The number and types of members do not matter, just their size.
> 
> I wrote a reduced test case.  To compile it, use either of these two commands:
> 
> dmd -version=fast -O -inline -run weird.d
> dmd -version=slow -O -inline -run weird.d
> 
> The same effect happens with gdc.  It also happens with optimization and inlining on or off (they make no difference.)
> 
> import std.stdio, std.perf;
> 
> class TestClass
> {
>     ubyte[56] member1;
> 
>     version (slow)
>         ubyte[4] member2;
> }

Without version(slow), TestClass will occupy 64 bytes, 4 bytes for the vtbl ptr, 4 bytes for the monitor ptr, and 56 bytes for the data.  With version(slow), the GC will reserve 128 bytes for the same class, because it will need 68 bytes of storage.


Sean
January 21, 2008
Daniel Lewis,

Actually, I made a mistake.  That's running the parsing 50 or 75 times (I can't remember now.)  So it was actually only like 14,500 objects per parsing.

Yes, it's true that most parsing does not require such a large number of objects (and a 200 kilobyte file as I'm talking about here really shouldn't need so many.)  However, I'm not parsing a configuration file or even a command file.  In my case it does really need that many objects (this is the purpose of parsing.)

My parser is actually designed to work on an incoming stream, parse on the fly in chunks, and according to my benchmarks against similar parsers, is quite fast already.

Even so, that is interesting.  My parser works similarly in some ways to yours (switches and parseXYZ functions.)

-[Unknown]


Daniel Lewis wrote:
> Sir, I'm not sure about the GC, but I can say that your program probably doesn't need to create 728,000 objects to parse an input stream.
> 
> I'm writing a parser too at the moment; it's a rather experimental, oddly written one though.  It [should] perform tokenization and parsing in a single O(1) pass for the ECMAScript language.
> 
> Take a look if you're interested;
> http://dsource.org/projects/walnut/browser/branches/1.9/source/interpreter.d
> 
> Regards,
> Dan
January 21, 2008
Ah, thank you.  That makes a lot of sense.  I thought it was interesting that it was near the 64 mark, but didn't realize there were both vtbl and monitor pointers.

Sounds like this would probably be resolved by creating a custom allocator, which preallocates space for a number of instances in chunks, which was already an idea I had to test for optimizing performance (the objects themselves will most likely be deleted all at once, or not, so this should be a good optimization.)

Thanks.

-[Unknown]


Sean Kelly wrote:
> Without version(slow), TestClass will occupy 64 bytes, 4 bytes for the
> vtbl ptr, 4 bytes for the monitor ptr, and 56 bytes for the data.  With
> version(slow), the GC will reserve 128 bytes for the same class, because
> it will need 68 bytes of storage.
> 
> 
> Sean
January 21, 2008
> I'm writing a parser too at the moment; it's a rather experimental, oddly written one though.  It [should] perform tokenization and parsing in a single O(1) pass for the ECMAScript language.

I call shenanigans.  Your output will be O(n), no?  Your parser can't possibly generate O(n) output in O(1) time.

--bb
January 21, 2008
Bill Baxter Wrote:

> > I'm writing a parser too at the moment; it's a rather experimental, oddly written one though.  It [should] perform tokenization and parsing in a single O(1) pass for the ECMAScript language.
> 
> I call shenanigans.  Your output will be O(n), no?  Your parser can't possibly generate O(n) output in O(1) time.
> 
> --bb

I meant for each given token; same as LL(1) or LALR(k) notation, but yes, it's a misusage of big-O.

The idea I was trying to express is this;

In most parsers, you run a giant all-you-can-eat tokenizer switch, which returns a token, and then you run some autogenerated parser switch which generates the AST.  From there, D generates an IR tree which is much flatter than the original AST, and then interprets that.

Mine runs by performing context-sensitive tokenization switches which generate the AST directly from the input stream.  It's not working for alot of different cases; there are 26 BUGS mentioned in the comments; but all of them seem solveable from within that paradigm.


January 21, 2008
Yes, this is how mine works and definitely a good solution imho. Faster, more efficient, easier to debug, and cleaner.

Personally I am no fan of the auto generated parsers.  But then again, it depends on what you're parsing... I'm sure you'll be able to get it to parse ECMA cleanly and efficiently.

-[Unknown]


Daniel Lewis wrote:
> Bill Baxter Wrote:
> 
>>> I'm writing a parser too at the moment; it's a rather experimental, oddly written one though.  It [should] perform tokenization and parsing in a single O(1) pass for the ECMAScript language.
>> I call shenanigans.  Your output will be O(n), no?  Your parser can't possibly generate O(n) output in O(1) time.
>>
>> --bb
> 
> I meant for each given token; same as LL(1) or LALR(k) notation, but yes, it's a misusage of big-O.
> 
> The idea I was trying to express is this;
> 
> In most parsers, you run a giant all-you-can-eat tokenizer switch, which returns a token, and then you run some autogenerated parser switch which generates the AST.  From there, D generates an IR tree which is much flatter than the original AST, and then interprets that.
> 
> Mine runs by performing context-sensitive tokenization switches which generate the AST directly from the input stream.  It's not working for alot of different cases; there are 26 BUGS mentioned in the comments; but all of them seem solveable from within that paradigm.
> 
> 
January 21, 2008
Unknown W. Brackets wrote:
> Ah, thank you.  That makes a lot of sense.  I thought it was interesting that it was near the 64 mark, but didn't realize there were both vtbl and monitor pointers.
> 
> Sounds like this would probably be resolved by creating a custom allocator, which preallocates space for a number of instances in chunks, which was already an idea I had to test for optimizing performance (the objects themselves will most likely be deleted all at once, or not, so this should be a good optimization.)

One trick Walter uses in DMD is free lists, did you check out that? Obviously it won't work if you need all of the objects allocated at the same time.

http://www.digitalmars.com/d/memory.html#freelists
January 21, 2008
That's a good trick, but it's probably not helpful in this case.  My problem is, I know that I'm going to have 1000 of these objects. There's no reason to allocate each one individually.  Rather I should create a pool for them, and shove the instances into the pool.

Free lists are great when I'm going to be trashing objects frequently. My problem is I have 1000 objects I need to create, and end with.  And, if I ever delete any of those, the largest possibility is that I'm deleting all of them.

Since my problem (of this thread) appears to be doubling the size of allocation due to passing 64 bytes, this should allow me to better manage the memory.  I'm not sure if I want to pack it within my buffer or do it every 128 bytes (which might be better alignment), but either way less allocations will - I'm hoping - improve speed.

I'll have to play with a few different things.

-[Unknown]


torhu wrote:
> One trick Walter uses in DMD is free lists, did you check out that? Obviously it won't work if you need all of the objects allocated at the same time.
> 
> http://www.digitalmars.com/d/memory.html#freelists
« First   ‹ Prev
1 2