Asking about performance of std concatenation vs. Appender vs. custom class

Apr 01, 2021

ludo

Apr 01, 2021

Steven Schveighoffer

Apr 09, 2021

ludo

April 01, 2021

Asking about performance of std concatenation vs. Appender vs. custom class

Posted by ludo

Permalink

ludo

Permalink

Hi fellow Dlangers,

I am working on an old codebase, and found a re-implementation of standard dynamic arrays as a struct called ArrayBuilder, visible on my github

The comment says:

/**
 * Behaves the same as built-in arrays, except about 6x faster with concatenation at the expense of the base pointer being 4 system words instead of two (16 instead of 8 bytes on a 32-bit system).
 */

So I wrote a unittest to verify that claim, using for benchmark 10_000_000 concatenations :

trace("*** ArrayBuilder Concatenation benchmark ***");
import std.datetime.stopwatch : benchmark, Duration;
import std.array : appender;

		void concat_withStd()
		{
			int[] array;
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withStdReserve()
		{
			int[] array;
			array.reserve(1000);
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withStdLength()
		{
			int[] array;
			array.length = 1000;
			for (int j = 0; j < 1000; j++)
				array[j] = j;
		}

		void concat_withAppenderReserve()
		{
			auto array = appender!(int[]);
			//trace ("array is ", array);
			array.reserve(1000);
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withArrayBuilder()
		{
			ArrayBuilder!(int) array;
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		auto r = benchmark!(concat_withStd,
		(...)//stuff
		concat_withArrayBuilder)(10_000);

		tracef("with std: %s", bench1);
		tracef("with stdReserve: %s", bench2);
		tracef("with stdLength: %s", bench3);
		tracef("with AppenderReserve: %s", bench4);
		tracef("with Arraybuilder: %s \n", bench5);

The results are below (you can also git clone the repo + dub test):

Concatenation method	benchmark in ms
with std:	385 ms
with stdReserve:	327 ms
with stdLength:	29 ms
with AppenderReserve:	256 ms
with Arraybuilder:	118 ms

The use of reserve does not seem to improve much the standard concatenation performance. Would you know why?
Increasing the length of the array before-hand is by far the best solution, as expected.
AppenderReserve outperforms a basic concatenation, but the gain is not completely obvious, even with a call to reserve. Is it expected behavior?
ArrayBuilder is 3 times faster than standard concatenation, and twice as fast as AppenderReserve! Which explanation would you give? I looked at the ArrayBuilder struct, it's a very basic code with no magic trick. I also looked at the phobos Appender code on github, in particular the code for put() line 3460, it's obviously more complex:

           import core.internal.lifetime : emplaceRef;

            ensureAddable(1);
            immutable len = _data.arr.length;

            auto bigData = (() @trusted => _data.arr.ptr[0 .. len + 1])();
            emplaceRef!(Unqual!T)(bigData[len], cast() item);
            //We do this at the end, in case of exceptions
            _data.arr = bigData;

There are some functions calls... Maybe the drag comes from there. I am really out of my league here, any help appreciated. I am not even sure which part of the Appender code is activated.

cheers

On 4/1/21 10:53 AM, ludo wrote: > The results are below (you can also git clone the repo + dub test): > > | Concatenation method | benchmark in ms| > |---------------------|---------------------| > |with std: | 385 ms| > |with stdReserve: | 327 ms| > |with stdLength: | 29 ms| > |with AppenderReserve:| 256 ms| > |with Arraybuilder: | 118 ms| > > - The use of reserve does not seem to improve much the standard concatenation performance. **Would you know why?** My guess: reserving cuts down on the reallocations, but that only takes some of the time. Appending a 1000-element int array is going to go from a 16-byte block, to a 32-byte block, etc. up to a 4096 byte block. This involves roughly 8 reallocations per test. But every append requires an opaque function call into the runtime to check if the allocated length is big enough, and reallocate if necessary. So if the reallocations aren't too expensive, then appending performance would not be too much different. > - AppenderReserve outperforms a basic concatenation, but the gain is not completely obvious, even with a call to reserve. **Is it expected behavior?** Yes, because Appender has direct access to the allocated length, rather than having to go through an opaque call. > > - ArrayBuilder is 3 times faster than standard concatenation, and twice as fast as AppenderReserve! **Which explanation would you give?** This I have no idea on. There are probably a lot of explanations that could be true. Have you tried profiling the code? > There are some functions calls... Maybe the drag comes from there. I am really out of my league here, any help appreciated. I am not even sure which part of the Appender code is activated. Are you compiling with -inline -O? -Steve

> reserving cuts down on the reallocations, but that only takes some of the time. Appending a 1000-element int array is going to go from a 16-byte block, to a 32-byte block, etc. up to a 4096 byte block. This involves roughly 8 reallocations per test. > > But every append requires an opaque function call into the runtime to check if the allocated length is big enough, and reallocate if necessary. > > So if the reallocations aren't too expensive, then appending performance would not be too much different. For ArrayBuilder, the function call to check length is there too, it is the grow() function. Sure reallocations take time. But the comment of the class talks about an increased performance thanks to *the base pointer being 4 system words instead of two* What did the dev mean by that? > Yes, because Appender has direct access to the allocated length, rather than having to go through an opaque call. > Alright, ArrayBuilder definitely benefits from the same. > > This I have no idea on. There are probably a lot of explanations that could be true. Have you tried profiling the code? I just tried to run dub test --build=profile on dterrent, but I get some obscure error message. Not working. Also tried to dmd -unittest -profile arraybuilder.d with an empty main(), not working, I don't get a trace.log. Am I supposed to create a main() test program to have access to profiling, that is to say there is no possible profiling of unittest? > > Are you compiling with -inline -O? > > -Steve -inline does not work (segfault) but optimize does not change anything except for stdLength (10ms) and ... ArrayBuilder, which gain even more performance (90ms). Conclusion, ArrayBuilder is still much more efficient, 15 years after this code was written :) It behaves closer to a regular array, maybe because of space reallocation strategy? Any comment appreciated

Forums