|Posted by wolframw||PermalinkReply|
Chapter 12.15.2 of the spec explains that void initialization of a static array can be faster than default initialization. This seems logical because the array entries don't need to be set to NaN. However, when I ran some tests for my matrix implementation, it seemed that the default-initialized array is quite a bit faster.
The code (and the disassembly) is at https://gist.github.com/wolframw/73f94f73a822c7593e0a7af411fa97ac
I compiled with dmd -O -inline -release -noboundscheck -mcpu=avx2 and ran the tests with the m array being default-initialized in one run and void-initialized in another run.
Default-initialized: 245 ms, 495 μs, and 2 hnsecs
Void-initialized: 324 ms, 697 μs, and 2 hnsecs
What the heck?
I've also inspected the disassembly and found an interesting difference in the benchmark loop (annotated with "start of loop" and "end of loop" in both disassemblies). It seems to me like the compiler partially unrolled the loop in both cases, but in the default-allocation case it discards every second result of the multiplication and saves each other result to the sink matrix. In the void-initialized version, it seems like each result is stored in the sink matrix.
I don't see how such a difference can be caused by the different initialization strategies. Is there something I'm not considering?
Also, if the compiler is smart enough to figure out that it can discard some of the results, why doesn't it just do away with the entire loop and run the multiplication only once? Since both input matrices are immutable and opBinary is pure, it is guaranteed that the result is always the same, isn't it?