Locality and Custom Allocations

May 14, 2006

Walter (and other compiler geniuses)- I've got two questions for you at the bottom of the post, if you want to skip the bulk of it. ------- In order to decrease page faults (and thus increase performance) of my LinkedList, I implemented a block allocate scheme using the below snippet in a node struct: struct Node { static void[] block; /* If we allocate large blocks in advance for the nodes, then we increase our locality and thus cut down on page-faults and increase our cache-hit percentage */ new (size_t sz, int length) { if (block.length < sz) block = new void[sz*(length+1)]; void* p = block.ptr; block = block[sz..$]; return p; } ... } Then, I used some variants of the following benchmark code to test ArrayList, LinkedList with block allocation and LinkedList without block allocations. You can find ArrayList and LinkedList (w/o block allocations) in the mango.containers SVN. ArrayList is backed by a D dynamic array, and LinkedList uses struct nodes. void main() { MutableList!(int) list = new LinkedList!(int)(); for (int i=0; i<1_000_000; i++) { list ~= i; } for (int j=0; j<1; j++) { foreach (int index, int i; list) { if (i != index) Stdout("Error: "c)(i)(" != "c)(index)(CR); } } } I found the results quite puzzling. I used the "time" command in unix to get the results. Here they are: LinkedList, block allocations, compiler optimizations (-O -release -inline) , 1_000_000 elements, 1_000 read loops real 0m35.673s user 0m34.274s sys 0m0.240s LinkedList, no block allocations, compiler optimizations (-O -release -inline) , 1_000_000 elements, 1_000 read loops real 1m2.546s user 0m57.824s sys 0m0.412s ArrayList, compiler optimizations (-O -release -inline) , 1_000_000 elements, 1_000 read loops real 0m48.569s user 0m46.559s sys 0m0.288s LinkedList, block allocations, compiler optimizations (-O -release -inline) , 1_000_000 elements, 1 read loop real 0m2.597s user 0m2.500s sys 0m0.016s LinkedList, no block allocations, compiler optimizations (-O -release -inline) , 1_000_000 elements, 1 read loop real 0m0.988s user 0m0.944s sys 0m0.028s ArrayList, compiler optimizations (-O -release -inline) , 1_000_000 elements, 1 read loop real 0m0.203s user 0m0.148s sys 0m0.028s Interpretation: Clearly, the block allocations are very helpful in the reads. And surprise: the LinkedList performs better than the ArrayList!!! How? I suspect that it has to do with the method of iteration. mango.container Containers are iterated through using Iterator objects (the opApplys use them as well) and the generic ListIterator calls .get(index) on the array in addition to the .next() call to the iterator. The LinkedListIterator, however, is specific to LinkedList and uses a reference to the current node to avoid the obvious inefficiency of .get(index) on a LinkedList. Assuming that the extra call to .get(index) on the ArrayList was causing some strain, I built an ArrayListIterator to avoid it, and came up with the following results: ArrayList, compiler optimizations (-O -release -inline) , 1_000_000 elements, 1_000 read loops, using ArrayListIterator real 0m30.966s user 0m29.774s sys 0m0.180s OK, looks like I was right. What does this tell us? That virtual function calls suck, but we knew this. The LinkedList with block allocations only runs about 15% longer than the ArrayList- that's not too shabby. The bulk of my surpise came when only one read loop was done, so the allocation part of the test program has far more weight. The LinkedList without block allocations does better by an order of magnitude!!! This blew me away. I figured that in terms of allocations, the block allocations would do no harm, and probably do better. In fact, I expected them to beat out the ArrayList-- the whole point of a LinkedList! Here is what I surmise from this result: Custom Allocators are Slow. Is my reasoning faulty, or am I right? If I'm right, here's my question to Walter: why? Are calls to these allocators not inlined? If not, can they be? In addition, to what extent can virtual function calls be sped up via optimization? Under what circumstances can calls to virtual methods be inlined? Sorry for the long post.... I may just be compensating for not posting much lately. ~John Demme

May 14, 2006

Re: Locality and Custom Allocations

Posted by John Demme
in reply to John Demme

Permalink

John Demme

Posted in reply to John Demme

Permalink

John Demme wrote:

> Walter (and other compiler geniuses)- I've got two questions for you at the bottom of the post, if you want to skip the bulk of it.
> 
> -------
> 
> In order to decrease page faults (and thus increase performance) of my
> LinkedList, I implemented a block allocate scheme using the below snippet
> in a node struct:
> struct Node
> {
>         static void[] block;
>         /*
>           If we allocate large blocks in advance for the nodes, then
>           we increase our locality and thus cut down on page-faults
>           and increase our cache-hit percentage
>          */
>         new (size_t sz, int length)
>         {
>                 if (block.length < sz)
>                         block = new void[sz*(length+1)];
> 
>                 void* p = block.ptr;
>                 block = block[sz..$];
>                 return p;
>         }
> ...
> }
> 
> Then, I used some variants of the following benchmark code to test
> ArrayList, LinkedList with block allocation and LinkedList without block
> allocations.  You can find ArrayList and LinkedList (w/o block
> allocations)
> in the mango.containers SVN.  ArrayList is backed by a D dynamic array,
> and LinkedList uses struct nodes.
> 
> void main()
> {
>         MutableList!(int) list = new LinkedList!(int)();
> 
>         for (int i=0; i<1_000_000; i++)
>         {
>                 list ~= i;
>         }
> 
>         for (int j=0; j<1; j++)
>         {
>                 foreach (int index, int i; list)
>                 {
>                         if (i != index)
>                                 Stdout("Error: "c)(i)(" != "c)(index)(CR);
>                 }
>         }
> }
> 
> I found the results quite puzzling.  I used the "time" command in unix to get the results. Here they are:
> 
> LinkedList, block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1_000 read loops
> real    0m35.673s
> user    0m34.274s
> sys     0m0.240s
> 
> LinkedList, no block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1_000 read loops
> real    1m2.546s
> user    0m57.824s
> sys     0m0.412s
> 
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1_000 read loops
> real    0m48.569s
> user    0m46.559s
> sys     0m0.288s
> 
> LinkedList, block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1 read loop
> real    0m2.597s
> user    0m2.500s
> sys     0m0.016s
> 
> LinkedList, no block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1 read loop
> real    0m0.988s
> user    0m0.944s
> sys     0m0.028s
> 
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1 read loop
> real    0m0.203s
> user    0m0.148s
> sys     0m0.028s
> 
> 
> Interpretation:
> Clearly, the block allocations are very helpful in the reads.  And
> surprise:
> the LinkedList performs better than the ArrayList!!!  How?  I suspect that
> it has to do with the method of iteration.  mango.container Containers are
> iterated through using Iterator objects (the opApplys use them as well)
> and the generic ListIterator calls .get(index) on the array in addition to
> the .next() call to the iterator.  The LinkedListIterator, however, is
> specific to LinkedList and uses a reference to the current node to avoid
> the obvious inefficiency of .get(index) on a LinkedList.  Assuming that
> the extra call to .get(index) on the ArrayList was causing some strain, I
> built an ArrayListIterator to avoid it, and came up with the following
> results:
> 
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1_000 read loops, using ArrayListIterator
> real    0m30.966s
> user    0m29.774s
> sys     0m0.180s
> 
> OK, looks like I was right.  What does this tell us?  That virtual
> function
> calls suck, but we knew this.  The LinkedList with block allocations only
> runs about 15% longer than the ArrayList- that's not too shabby.
> 
> The bulk of my surpise came when only one read loop was done, so the allocation part of the test program has far more weight.  The LinkedList without block allocations does better by an order of magnitude!!!  This blew me away.  I figured that in terms of allocations, the block allocations would do no harm, and probably do better.  In fact, I expected them to beat out the ArrayList-- the whole point of a LinkedList!  Here is what I surmise from this result:
> 
> Custom Allocators are Slow.
> 
> Is my reasoning faulty, or am I right?  If I'm right, here's my question
> to
> Walter:  why?  Are calls to these allocators not inlined?  If not, can
> they
> be?  In addition, to what extent can virtual function calls be sped up via
> optimization?  Under what circumstances can calls to virtual methods be
> inlined?
> 
> Sorry for the long post.... I may just be compensating for not posting much lately.
> 
> ~John Demme

Sorry to reply to myself, but I though I kinda sounded like an ass in the post.  I'm not at all upset with DMD's performance- I'd rather see the debugging stuff we're getting now over optimizations... I'm just wondering what we might see in the future, and how to improve my code now.

~John Demme (again)

John Demme wrote: > > The bulk of my surpise came when only one read loop was done, so the > allocation part of the test program has far more weight. The LinkedList > without block allocations does better by an order of magnitude!!! This > blew me away. I figured that in terms of allocations, the block > allocations would do no harm, and probably do better. In fact, I expected > them to beat out the ArrayList-- the whole point of a LinkedList! Here is > what I surmise from this result: > > Custom Allocators are Slow. > > Is my reasoning faulty, or am I right? If I'm right, here's my question to > Walter: why? Are calls to these allocators not inlined? If not, can they > be? In addition, to what extent can virtual function calls be sped up via > optimization? Under what circumstances can calls to virtual methods be > inlined? This was the conclusion in some research papers I've read as well, the most relevant being "Reconsidering Custom Memory Allocation": http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf I'll leave the paper to outline why. Sean

Forums