Hi,
As part of SAoC 2021, I changed the lowering of _d_arrayctor
from using the runtime hook to a template, via theses PRs:
We'll call the former approach the hook approach and the latter, the template approach.
This template approach required me to declare the returned array inside a union. This was needed because by declaring this array directly, the compiler was inserting a call to __ArrayDtor
before exiting the function. This call was in contradiction with the error handling already performed by _d_arrayctor
. I gave a few more details about this issue in this post. However, this came with the drawback of not using NRVO, because of the cast at end of the function. This additional copy causes a performance decrease, compared to the hook approach, as I'm going to show later in this post.
A third approach, which I call the hack approach, is very close to the one using the template. The difference is that the hack adds a third, unused, pointer-type parameter to _d_arrayctor
, in order to change the function's strong purity to weak purity. The reasons for this unorthodox approach are better explained in this post, and the PR for it is here. Due to this hack, there is no need to declare the returned array inside the scope of _d_arrayctor
, which also removes the need for the union trick. As a result, no additional copying is performed and this approach has increased performance when compared to the hook approach.
In order to measure the performance of each lowering, I designed this benchmark, which I compiled with dmd and ran, using all 3 approaches. It measures the time taken by the call to _d_arrayctor
for 3 struct sizes: small (empty), medium (64 bytes) and large (256 bytes), as well as 3 array lengths: small (1 element), medium (64 elements) and large (256 elements). 256 elements is a large enough array, because the lowering to _d_arrayctor
is only performed for static arrays, which are unlikely to be much larger. The numerical results can be seen below:
Running benchmark on branch master (_d_arrayctor as a hook):
_0B_Struct_1Elem @ 1000000 runs: average time = 13ms; std dev = 9.53674e-06
_0B_Struct_64Elems @ 1000000 runs: average time = 280.4ms; std dev = 0.489898
_0B_Struct_256Elems @ 1000000 runs: average time = 1062.18ms; std dev = 0.433128
_64B_Struct_1Elem @ 1000000 runs: average time = 13.01ms; std dev = 0.0994988
_64B_Struct_64Elems @ 1000000 runs: average time = 309.99ms; std dev = 0.0994992
_64B_Struct_256Elems @ 1000000 runs: average time = 1194.11ms; std dev = 0.31289
_256B_Struct_1Elem @ 1000000 runs: average time = 15ms; std dev = 1.43051e-05
_256B_Struct_64Elems @ 1000000 runs: average time = 482.23ms; std dev = 1.73698
_256B_Struct_256Elems @ 1000000 runs: average time = 2960.72ms; std dev = 1.20067
Running benchmark with _d_arrayctor as a template:
0B_Struct_1Elem @ 1000000 runs: average time = 9.6934ms; std dev = 0.019658
0B_Struct_64Elems @ 1000000 runs: average time = 219.086ms; std dev = 0.467551
0B_Struct_256Elems @ 1000000 runs: average time = 820.951ms; std dev = 1.88878
64B_Struct_1Elem @ 1000000 runs: average time = 15.6967ms; std dev = 0.046066
64B_Struct_64Elems @ 1000000 runs: average time = 287.54ms; std dev = 0.979795
64B_Struct_256Elems @ 1000000 runs: average time = 1177.63ms; std dev = 1.06447
256B_Struct_1Elem @ 1000000 runs: average time = 20.8763ms; std dev = 0.259867
256B_Struct_64Elems @ 1000000 runs: average time = 981.405ms; std dev = 0.67177
256B_Struct_256Elems @ 1000000 runs: average time = 4029.59ms; std dev = 0.511771
Running benchmark with _d_arrayctor as a template (with the hack-y 3rd parameter):
_0B_Struct_1Elem @ 1000000 runs: average time = 9.00001ms; std dev = 6.67572e-06
_0B_Struct_64Elems @ 1000000 runs: average time = 183.94ms; std dev = 0.525738
_0B_Struct_256Elems @ 1000000 runs: average time = 689.091ms; std dev = 0.286183
_64B_Struct_1Elem @ 1000000 runs: average time = 10.01ms; std dev = 0.0994987
_64B_Struct_64Elems @ 1000000 runs: average time = 227.23ms; std dev = 0.645833
_64B_Struct_256Elems @ 1000000 runs: average time = 925.25ms; std dev = 0.766486
_256B_Struct_1Elem @ 1000000 runs: average time = 11.07ms; std dev = 0.255147
_256B_Struct_64Elems @ 1000000 runs: average time = 417.92ms; std dev = 0.271293
_256B_Struct_256Elems @ 1000000 runs: average time = 2576.1ms; std dev = 8.36003
In order to better visualise the numbers above, I also plotted the bar charts below:
As you can see, the hack is faster than the hook. Its running times are about 15-25% lower than those of the hook for larger structs (64 and 256 bytes) and as much as 35% lower for the empty struct. On the other hand, the template approach, while being 20-25% faster than the hook when called on the empty struct, is as much as 35% slower than the hook, on larger (256-byte) structs.
Due to this discrepancy in performance, I think the best approach is the hack, despite adding an unused parameter to _d_arrayctor
. What do you think?
Thanks,
Teodor