On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:>
Are you making the compiler aware of your machine? Although the obvious point here is vector width (you have AVX-512 from what I can see, however I'm not sure if this is actually a win or not on Skylake W), the compiler is also forced to use a completely generic scheduling model which may or may not yield good code for your procesor. For LDC, you'll want
-mcpu=native. Also use cross module inlining if you aren't already.
I haven't tried this yet but will give it a go. Ideally I'd want performance independent of compilers but could sprinkle this into the build config for when LDC is used.>
I also notice in your hot code, you are using function contracts: When contracts are being checked, LDC actually does use them to optimize the code, however in release builds when they are turned off this is not the case. If you are happy to assume that they are true in release builds (I believe you should at least), you can give the compiler this additional information via the use of an intrinsic or similar (with LDC I just use inline IR, on GDC you have __builtin_unreachable()). LDC will assume asserts in release builds as far as I have tested, however this still invokes a branch (just not to a proper assert handler)...
Thanks for shining light on a level of optimization that I did not know existed. I've got a lot of learning to do in this area. Do you have any resources you'd recommend?