April 27, 2006
>Couldn't you make come of your parent nodes an allocators, so that all of their >children get their memory from it?

I was in a hurry.  Let me revise.

Couldn't you make some of your parent nodes allocators, so that all of their children get their memory from their parent allocator nodes?



April 27, 2006
Craig Black wrote:
>>Couldn't you make come of your parent nodes an allocators, so that all of their >children get their memory from it?
> 
> 
> I was in a hurry.  Let me revise.
> 
> Couldn't you make some of your parent nodes allocators, so that all of their children get their memory from their parent allocator nodes?
> 
> 

Hehe.  That's okay.  I understand text through typos.

I'm not entirely familiar with the concept of allocators as you describe.  Is this similar to memory pooling?  I always skipped over that portion of the C++ STL; way too much complexity for my small brain =P.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M--@ V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne
April 27, 2006
> I'm not entirely familiar with the concept of allocators as you describe. Is this similar to memory pooling?  I always skipped over that portion of the C++ STL; way too much complexity for my small brain =P.

Instead of calling malloc for each allocation unit, you could write an allocator class that would allocate memory in large chunks.  Then when you want an allocation unit, you request it from the allocator.  The allocator simply hands out the memory, an marks it as used.

The hierarchical memory system you describe lends itself to using such custom allocators, because you would never have to call free for each unit. You could call free at the allocator level instead.  In this way, you would only call malloc and free for large chunks of memory.

You could also inform each allocator how much memory to reserve when they are created, since you may have a good idea of how much memory a given task will require.

Did that help?

-Craig


April 27, 2006
I ran some quick tests this morning using the word count example code and here are my results so far:


D version (wc):

Execution time: 0.179 s
Execution time: 0.183 s
Execution time: 0.149 s

C++ version 1 (wccpp2):

Execution time: 0.216 s
Execution time: 0.263 s
Execution time: 0.183 s

C++ version 2 (wccpp3):

Execution time: 0.221 s
Execution time: 0.168 s
Execution time: 0.185 s

C++ version 3 (wccpp4):

Execution time: 0.191 s
Execution time: 0.243 s
Execution time: 0.142 s


The apps were compiled according to the word count description and timed using ptime (I had some applications open during the tests so the results may be a tad inaccurate).  wccpp2 is straight from the web page, wccpp3 replaces the C++ IO code with a C version quite similar to the D code in std.file, and wccpp4 replaces std::string with slice<char>.

You'll notice that there is practically no difference between the std::string and slice tests (wccpp3 and wccpp4), and I think this is to be expected.  Typical C++ string implementations have a "small string optimization" to avoid dynamic allocations for strings of less than 16 characters.  And since there are likely very few words in the sample text that are larger than this, there are few if any dynamic allocations being performed by std::string.  Thus I think the real notable performance difference between C++ and D is that C++ uses a balanced binary tree while D uses a hash table.

I tried replacing std::map with a hash table implementation (technically an unordered_map implemented according to C++ TR1 that I did for a project a while back), but DMC choked on it and I don't have any more time to spend on this right now.  When I get a chance, I'll try to verify that the unordered_map code conforms to the C++ spec (it compiles cleanly at WL:4 in Visual Studio 2005) and will file a bug report for DMC if it does.  For reference, the bulk of the errors I saw were related to symbol lookup--a template class using static and non-static data from a parent class without fully qualifying the names--adding some 'using' statements took care of these.  The final one was a bit more subtle and offered me almost no useful information to go on, so that one may take some investigation.


Sean
April 27, 2006
Quick follow-up.  I decided to re-run the tests using the full text of the King James Bible (http://www.gutenberg.org/dirs/etext90/kjv10.txt) in an attempt to get better timing resolution, as it's 4,342 KB vs. Alice in Wonderland's 160 KB.  Here are my results:


D version (wc):

Execution time: 1.409 s
Execution time: 1.473 s
Execution time: 1.295 s

C++ version 1 (wccpp2):

Execution time: 1.779 s
Execution time: 1.762 s
Execution time: 1.792 s

C++ version 2 (wccpp3):

Execution time: 1.758 s
Execution time: 1.309 s
Execution time: 1.297 s

C++ version 3 (wccpp4):

Execution time: 0.693 s
Execution time: 0.767 s
Execution time: 0.704 s


Surprisingly, the slice code does substantially better than both the other C++ versions and the D code.  Perhaps this evening I'll re-run the C++ tests under Visual Studio to give the hash table a test drive.


Sean
April 27, 2006
> Surprisingly, the slice code does substantially better than both the other C++ versions and the D code.  Perhaps this evening I'll re-run the C++ tests under Visual Studio to give the hash table a test drive.

Wow!  It's over twice as fast!  How difficult would it be to use the same approach for wordcount in D by bypassing the GC?

So what do you think we should do about GC to make D faster?  Do you favor making GC optional or writing a more optimized, compacting GC?  I personally think that even optimized GC won't come close to code that has been optimized with manual memory management.

-Craig


April 28, 2006
Craig Black wrote:
>> Surprisingly, the slice code does substantially better than both the other C++ versions and the D code.  Perhaps this evening I'll re-run the C++ tests under Visual Studio to give the hash table a test drive.
> 
> Wow!  It's over twice as fast!  How difficult would it be to use the same approach for wordcount in D by bypassing the GC?

The C++ slice code should mirror what is already happening in D--as far as I know, the D code should allocate no memory at all for strings, only for the AA.  I suspect the problem may be GC scans of the rather large (4 megabyte) data area where the book contents are stored.  If the D code were modified to allocate this memory via malloc instead of new, the D program may speed up substantially.  I'll see about dropping the C++ file code into the D app this evening.

> So what do you think we should do about GC to make D faster?  Do you favor making GC optional or writing a more optimized, compacting GC?  I personally think that even optimized GC won't come close to code that has been optimized with manual memory management.

I'm not sure either approach could be said to be "better" than the other insofar as performance is concerned as it really comes down to application design.  I'll admit that, until D, I'd always dismissed garbage collection for one reason or another.  But after using D for a while I suddenly realized how much time and code I used to invest in managing data ownership issues.  I won't go so far as to say that data ownership is not important, but being freed from the yoke of smart pointer code and such I've found both that I'm more productive and that the result is more elegant than its C++ equivalent.  My only remaining issue with GC is that it makes it too easy to ignore what's actually going on at a low level.  This probably isn't an issue for experienced programmers who can't help but think about these things, but in I still feel that learning programmers should be forced to deal with this stuff simply so they're aware of it.


Sean
April 28, 2006
Craig Black wrote:
> 
> So what do you think we should do about GC to make D faster?  Do you favor making GC optional or writing a more optimized, compacting GC?  I personally think that even optimized GC won't come close to code that has been optimized with manual memory management.

By the way, I do think that GC performance can be improved significantly over where it is now, so don't take my other reply as an admission that GC is slow.  Also, I suspect that by the time D hits 1.0, GC performance will be much improved over what it is now.  This can be achieved a number of different ways, but the most obvious would be for the GC to ignore memory blocks that don't contain pointers.


Sean
April 28, 2006
Last follow-up.  I re-ran all tests with a bit less stuff going on in the background, and I ran the C++ tests with MSVC as well, including the hash table version (though I have no DMC run for this particular app). Here are the results.


D version 1 (wc) : the original example
D version 2 (wc2): using malloc for the buffer

C++ version 1 (wccpp2): the original example
C++ version 2 (wccpp3): replaced C++ stream code with C routines
C++ version 3 (wccpp4): replaced std::string with slice<char>
C++ version 4 (wccpp5): replaced std::map with unordered_map


D version 1 (wc):

Execution time: 0.380 s
Execution time: 0.423 s
Execution time: 0.446 s

D version 2 (wc2):

Execution time: 0.659 s
Execution time: 0.424 s
Execution time: 0.377 s

----------

DMC version 1 (wccpp2):

Execution time: 1.656 s
Execution time: 1.652 s
Execution time: 1.656 s

DMC version 2 (wccpp3):

Execution time: 1.230 s
Execution time: 1.257 s
Execution time: 1.290 s

DMC version 3 (wccpp4):

Execution time: 0.674 s
Execution time: 0.729 s
Execution time: 0.697 s

----------

MSVC version 1 (wccpp2):

Execution time: 1.093 s
Execution time: 1.120 s
Execution time: 1.042 s

MSCV version 2 (wccpp3):

Execution time: 0.797 s
Execution time: 0.833 s
Execution time: 0.799 s

MSVC version 3 (wccpp4):

Execution time: 0.761 s
Execution time: 0.744 s
Execution time: 0.726 s

MSVC version 4 (wccpp5):

Execution time: 0.438 s
Execution time: 0.436 s
Execution time: 0.406 s


The most amazing thing here is the dramatic performance difference between these D runs and the previous ones.  I won't speculate on why, bu the exact same app appears to be running 3 times as fast as it did this morning.  Beyond that, we can see that the C++ app using slices and a hash table performs similarly to the D implementation, though it's difficult to draw any solid conclusions without a test run under DMC. Also, I should note for the record that the hash table implementation doesn't rehash automatically (it wasn't a feature I needed at the time), so I preallocated 4096 buckets before running the word count.  There are 793875 words in the KJ bible so this results in somewhat long chains, but I saw roughly similar results with double the number of buckets so I think the bulk of processing at this speed may be taken up by IO.  The only other notable difference is that the C++ code uses standard C calls for the file operations while the D code uses Windows calls.  I can't see this resulting in a noticeable speed difference, but I figured it was worth mentioning.

Finally, as these tests were quite informal I don't suggest reading too much into them.  However I think a reasonable conclusion would be that D is faster than C++ for typical user applications (assuming a use pattern similar to the word count application) and while C++ can be quite competitive, making it so requires the use of hand-written code over standard library code.  C++ will fare a bit better in 2009 when unordered_map is an official part of the standard, but that still leaves std::string as the suggested way to store string data.  This may not be an issue if you're dealing with English text where most words are under 16 characters, but it could be for other data sets where allocations will occur.


Sean
April 28, 2006
Craig Black wrote:
>>I'm not entirely familiar with the concept of allocators as you describe. Is this similar to memory pooling?  I always skipped over that portion of the C++ STL; way too much complexity for my small brain =P.
> 
> 
> Instead of calling malloc for each allocation unit, you could write an allocator class that would allocate memory in large chunks.  Then when you want an allocation unit, you request it from the allocator.  The allocator simply hands out the memory, an marks it as used.
> 
> The hierarchical memory system you describe lends itself to using such custom allocators, because you would never have to call free for each unit. You could call free at the allocator level instead.  In this way, you would only call malloc and free for large chunks of memory.
> 
> You could also inform each allocator how much memory to reserve when they are created, since you may have a good idea of how much memory a given task will require.
> 
> Did that help?
> 
> -Craig 
> 
> 

Yes I understand.  Thank you.  Very interesting concept.

I don't think in my current situation that calling malloc is a definite bottleneck.  When I see a performance concern later on, I'll definitely look into it.

-- 
Regards,
James Dunne