On Tuesday, 29 March 2022 at 12:06:54 UTC, deadalnix wrote:
>No, I picked the exact same toolchain on purpose so that the approach themselves can be compared.
both of the example above you THE EXACT SAME toolchain, the only different is the approach to 128 bits integers in the frontend.
To me it looked like you did a C++ vs D somewhere in the way. But instead you're just doing the backend calls.
Also speaks of a bunch of my own inexperience in the deeper matters, and why i usually abstain from a number of these conversations I'm not well versed in.
>The CPU has a lot of instructions to help handle large integers. When you let the backend do its magic, it can leverage them. When you instead give it good old small integers code and it has to infer the meaning from it and reconstruct the large integers ops you meant to be doing and optimize that, you introduce so many failure point that it's practically impossible to get a competitive result.
Reminded of reading of specific ways you had to write bswap in C/C++ for GCC to recognize it and reduce it to a single instruction without getting into ASM, while still working with architectures that didn't have that specific instruction.
>There is no correctness vs speed here, both code are correct. One is going to be significantly faster, but, in addition, one is going to optimize better with it surroundings, so what you'll see in practice is an even wider gap that what is presented above.
Still that example does suggest I'd get a bit of a boost by manually writing the 128bit multiply manually which would cut out overhead and a loop that's unneeded. Probably get similar results with a divide (with long or smaller divisor). Course those are rather case specific.
I wouldn't look forward to re-writing the same code for all the different compilers back ends though.