August 08, 2022

For much of the code that I throw at it, LDC auto vectorization is comparable to GDC (both very good) but LDC lags in a few situations. The link below shows one of those, a gather of the two primaries from a simple Bayer mosaic pixel:

https://godbolt.org/z/sfd8e4hqe

The throughput difference on my (aging) 2.4GhZ zen1 is 21+GB/sec/core for GDC vs ~9GB/sec for LDC. The expected (hoped for) auto vec loop starts at .L6 in the GDC output. (I realize there are other ways to "skin" the Bayer pixel unpacking cat, this is just an example of where one approach breaks down)

Additionally I'll note that GDC can auto vectorize in the face of multiple outputs (low degree kernel fusion). I've not seen LDC do that in the few cases that I've tried.

Both LDC and GDC auto vectorization are sufficiently advanced that I've been able to eliminate a good deal of __vector code and the attendant difficulties in target specialization, source code expansion, and testing. D's __vector capabilities helps out with the remainder.

Thanks again for providing a very useful tool.