Thread overview
toy windowing auto-vec miss
Nov 07, 2022
Bruce Carneal
Nov 07, 2022
rikki cattermole
Nov 07, 2022
Bruce Carneal
Nov 07, 2022
Bruce Carneal
Nov 07, 2022
Johan
Nov 07, 2022
Bruce Carneal
Nov 07, 2022
Johan
November 07, 2022

Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather
https://godbolt.org/z/ox1vvxd8s

Compile time target adaptive manual __vector-ization is an answer here if you have no access to SIMT, so not a show stopper, but the code is less readable.

I'm not sure what the data parallel future should look like wrt language/IR but I'm pretty sure we can do better than praying that the auto vectorizer can dig patterns out of for loops, or throwing ourselves on the manual vectorization grenade, repeatedly.

November 07, 2022
This might be a bit naive, but ldc's output is about a quarter smaller, it uses significantly less jumps.

Is gdc actually faster?
November 07, 2022

On Monday, 7 November 2022 at 09:56:13 UTC, rikki cattermole wrote:

>

This might be a bit naive, but ldc's output is about a quarter smaller, it uses significantly less jumps.

Is gdc actually faster?

If you have long enough inputs, yes. A vectorized version overcomes the instruction stream overhead quickly after which the performance advantage trends to N/1.

As you imply, measurement trumps in-ones-head modelling. I'll measure and report on the exact toy code later today but real world code with the same "simple but not trivial" operand pattern, involving Bayer/CFA data, has been measured and the performance gap verified. For that code the workaround was manual __vector-ization and use of a shuffle intrinsic.

November 07, 2022

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:

>

Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather
https://godbolt.org/z/ox1vvxd8s

Compile time target adaptive manual __vector-ization is an answer here if you have no access to SIMT, so not a show stopper, but the code is less readable.

I'm not sure what the data parallel future should look like wrt language/IR but I'm pretty sure we can do better than praying that the auto vectorizer can dig patterns out of for loops, or throwing ourselves on the manual vectorization grenade, repeatedly.

My "grenade" phrasing above was fun to write but overly dramatic. Manual __vector-ization is more tedious than dangerous and D ldc/gdc give you quite a bit of help there including 1) __vector types 2) CT max vector length introspection.

Also, auto vectorization does work nicely against simple/and-or conditioned inputs/outputs.

I believe there is a lot more to be had in the programmer-friendly-data-parallelism department, perhaps involving a (major) pivot to MLIR, but I give my considered thanks to those involved in providing what is already the best option in that arena from my point of view. Introspection, __vector, auto-vec, dcompute, ... it's a potent toolkit.

November 07, 2022

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:

>

Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather
https://godbolt.org/z/ox1vvxd8s

Don't have time to dive deeper but I found that:
Removing @restrict results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM).

-Johan

November 07, 2022

On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:

>

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:

>

Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather
https://godbolt.org/z/ox1vvxd8s

Don't have time to dive deeper but I found that:
Removing @restrict results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM).

-Johan

That's very interesting.

This is the first time I've heard of @restrict making things worse wrt auto vectorization. From what I've seen in other experiments, @restrict provides a minor benefit (code size reduction) frequently while occasionally enabling vectorization of otherwise complex dependency graphs.

Thanks for the heads up.

November 07, 2022

On Monday, 7 November 2022 at 18:14:44 UTC, Bruce Carneal wrote:

>

On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:

>

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:

>

Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather
https://godbolt.org/z/ox1vvxd8s

Don't have time to dive deeper but I found that:
Removing @restrict results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM).

-Johan

That's very interesting.

This is the first time I've heard of @restrict making things worse wrt auto vectorization. From what I've seen in other experiments, @restrict provides a minor benefit (code size reduction) frequently while occasionally enabling vectorization of otherwise complex dependency graphs.

Yeah, this is an LLVM bug.

If you're interested in digging around a bit further, you can look at how the individual optimization passes change the IR code:
https://godbolt.org/z/e9nqPfeKn

Loop vectorization pass does nothing for the @restrict case. Note that the input for that pass is slightly different: the @restrict case has a more complex forbody.preheader and 3 phi nodes in the for body (compared to 1 in the non-restrict case)

-Johan