toy windowing auto-vec miss

Nov 07, 2022

Bruce Carneal

Nov 07, 2022

rikki cattermole

Nov 07, 2022

Nov 07, 2022

Nov 07, 2022

Nov 07, 2022

Nov 07, 2022

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:

Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather
https://godbolt.org/z/ox1vvxd8s

Compile time target adaptive manual __vector-ization is an answer here if you have no access to SIMT, so not a show stopper, but the code is less readable.

I'm not sure what the data parallel future should look like wrt language/IR but I'm pretty sure we can do better than praying that the auto vectorizer can dig patterns out of for loops, or throwing ourselves on the manual vectorization grenade, repeatedly.

My "grenade" phrasing above was fun to write but overly dramatic. Manual __vector-ization is more tedious than dangerous and D ldc/gdc give you quite a bit of help there including 1) __vector types 2) CT max vector length introspection.

Also, auto vectorization does work nicely against simple/and-or conditioned inputs/outputs.

I believe there is a lot more to be had in the programmer-friendly-data-parallelism department, perhaps involving a (major) pivot to MLIR, but I give my considered thanks to those involved in providing what is already the best option in that arena from my point of view. Introspection, __vector, auto-vec, dcompute, ... it's a potent toolkit.

November 07, 2022

Re: toy windowing auto-vec miss

Posted by Johan
in reply to Bruce Carneal

Permalink

Johan

Posted in reply to Bruce Carneal

Permalink

On Monday, 7 November 2022 at 18:14:44 UTC, Bruce Carneal wrote:

On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:

Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather
https://godbolt.org/z/ox1vvxd8s

Don't have time to dive deeper but I found that:
Removing @restrict results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM).

-Johan

That's very interesting.

This is the first time I've heard of @restrict making things worse wrt auto vectorization. From what I've seen in other experiments, @restrict provides a minor benefit (code size reduction) frequently while occasionally enabling vectorization of otherwise complex dependency graphs.

Yeah, this is an LLVM bug.

If you're interested in digging around a bit further, you can look at how the individual optimization passes change the IR code:
https://godbolt.org/z/e9nqPfeKn

Loop vectorization pass does nothing for the @restrict case. Note that the input for that pass is slightly different: the @restrict case has a more complex forbody.preheader and 3 phi nodes in the for body (compared to 1 in the non-restrict case)

-Johan

Forums