April 08, 2021

On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:

>

Are you making the compiler aware of your machine? Although the obvious point here is vector width (you have AVX-512 from what I can see, however I'm not sure if this is actually a win or not on Skylake W), the compiler is also forced to use a completely generic scheduling model which may or may not yield good code for your procesor. For LDC, you'll want -mcpu=native. Also use cross module inlining if you aren't already.

I haven't tried this yet but will give it a go. Ideally I'd want performance independent of compilers but could sprinkle this into the build config for when LDC is used.

>

I also notice in your hot code, you are using function contracts: When contracts are being checked, LDC actually does use them to optimize the code, however in release builds when they are turned off this is not the case. If you are happy to assume that they are true in release builds (I believe you should at least), you can give the compiler this additional information via the use of an intrinsic or similar (with LDC I just use inline IR, on GDC you have __builtin_unreachable()). LDC will assume asserts in release builds as far as I have tested, however this still invokes a branch (just not to a proper assert handler)...

Thanks for shining light on a level of optimization that I did not know existed. I've got a lot of learning to do in this area. Do you have any resources you'd recommend?

April 08, 2021

On Thursday, 8 April 2021 at 14:20:27 UTC, Guillaume Piolat wrote:

>

On Thursday, 8 April 2021 at 14:17:09 UTC, Guillaume Piolat wrote:

>

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:

>

Is there anything else I can do to improve performance?

  • profiling?
  • you can use _mm_pow_ps in intel-intrinsics package to have 4x pow at once for the price of one.

Also if you don't need the precision: always use powf instead of pow for double, use expf instead of exp for double, etc. Else you pay extra.
In particular, llvm_pow with a float argument is a lot faster.

Great tips here. I wasn't aware of powf and its cousins. I also like that intel-intrisics works across compilers and CPU architectures. I'll see what I can do with them.

April 08, 2021

On Thursday, 8 April 2021 at 16:59:07 UTC, Bastiaan Veelo wrote:

>

On Thursday, 8 April 2021 at 16:37:57 UTC, Kyle Ingraham wrote:

>

Are compilers able to take loops and parallelize them?

No, but you can quite easily:
...
https://dlang.org/phobos/std_parallelism.html#.parallel

In keeping with my earlier comment about structuring nested loops optimally, the parallelism should be applied at the level of the "for each pixel" loop, so that each pixel's memory only needs to be touched by one CPU core.

April 08, 2021

On Thursday, 8 April 2021 at 17:00:31 UTC, Kyle Ingraham wrote:

>

On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:

>

Are you making the compiler aware of your machine? Although the obvious point here is vector width (you have AVX-512 from what I can see, however I'm not sure if this is actually a win or not on Skylake W), the compiler is also forced to use a completely generic scheduling model which may or may not yield good code for your procesor. For LDC, you'll want -mcpu=native. Also use cross module inlining if you aren't already.

I haven't tried this yet but will give it a go. Ideally I'd want performance independent of compilers but could sprinkle this into the build config for when LDC is used.

>

I also notice in your hot code, you are using function contracts: When contracts are being checked, LDC actually does use them to optimize the code, however in release builds when they are turned off this is not the case. If you are happy to assume that they are true in release builds (I believe you should at least), you can give the compiler this additional information via the use of an intrinsic or similar (with LDC I just use inline IR, on GDC you have __builtin_unreachable()). LDC will assume asserts in release builds as far as I have tested, however this still invokes a branch (just not to a proper assert handler)...

Thanks for shining light on a level of optimization that I did not know existed. I've got a lot of learning to do in this area. Do you have any resources you'd recommend?

GDC has equivalent options - dmd does not have any equivalent however.

As for resources in this particular area there isn't really an exact science to it beyond not assuming that the compiler can read your mind. These optimizations can be quite ad-hoc i.e. the propagation of this information through the function is not always intuitive (or possible).

One trick is to declare a method as inline true, then assume something in that method - the compiler will then assume (say) that the return value is aligned or whatever. But proceed with caution.

April 08, 2021

On Thursday, 8 April 2021 at 17:00:31 UTC, Kyle Ingraham wrote:

>

Thanks for shining light on a level of optimization that I did not know existed. I've got a lot of learning to do in this area. Do you have any resources you'd recommend?

I learned a lot about micro-optimization from studying the ASM output of my inner loop code in Godbolt's Compiler Explorer, which supports D:
https://godbolt.org/

Other complementary resources:
https://www.felixcloutier.com/x86/
https://www.agner.org/optimize/

Understanding the basics of how prefetchers and cache memory hierarchies work is important, too.

1 2
Next ›   Last »