June 07, 2022

On Tuesday, 7 June 2022 at 22:22:22 UTC, max haughton wrote:

>

Which instructions are they supposed to lower to? Wouldn't that require the intrinsic-in-name-only intel math library stuff?

I've ony seen that with Intel C++ compiler => https://godbolt.org/z/E39rc77fj

It will replace transcendental function with a vectorized version that does 4 log/exp/pow/sin/cos at once, and yes that is faster.

June 07, 2022

On Tuesday, 7 June 2022 at 23:36:20 UTC, Guillaume Piolat wrote:

>

On Tuesday, 7 June 2022 at 22:22:22 UTC, max haughton wrote:

>

Which instructions are they supposed to lower to? Wouldn't that require the intrinsic-in-name-only intel math library stuff?

I've ony seen that with Intel C++ compiler => https://godbolt.org/z/E39rc77fj

It will replace transcendental function with a vectorized version that does 4 log/exp/pow/sin/cos at once, and yes that is faster.

In LLVM, builtins that map to instructions that can be done with normal code are eventually removed.

However, some of them stay forever in ldc.gcc_builtinsx86 (it has 894 builtins) because you can't cause the codegen to generate them else.

For example it is impossible to have PMADDWD otherwise in LLVM.
https://d.godbolt.org/z/vvqP5vzvo

But, Intel C++ compiler can do it:
https://godbolt.org/z/f37dzKzT3

that's why intel-intrinsics exist, not all autovectorizers are created equals.

June 07, 2022
On 6/7/2022 11:36 AM, matheus wrote:
> What was the magazine? I'd like to search and read them.

I wish I remember. I was so angry about it I didn't save a copy, and eventually forgot which one it was :-(
June 07, 2022
On 6/7/2022 2:16 PM, Guillaume Piolat wrote:
> - The usage of D_SIMD and core.simd is not yet enabled in intel-intrinsics with DMD because it disrupts Core and PR, so until now I've been doing it in several attempts to test the water and not distract too much people. But it will enabled be next time probably, probably in August.
> 
> - Once core.simd is enabled for good, D_SIMD usage can be ramped up intel-intrinsics (currently, it is only SSE 1 instructions that use the D_SIMD _builtins_).
>    I expect doing it for all other SIMD instructions will take a few months of spare work, and will expose about 4-5 new bugs and that will be it. If DMD test suite follows it should be hopefully set in stone. wishful.gif
> 
>    Until now, most of the issues came from regular vector types usage, not really D_SIMD builtins so I expect that last part to be less contentious.
> 
> I'm curious how close we can get to optimized LDC codegen with DMD codegen in an entirely SIMD codebase, it will be somewhere between 0.5x and 1.0x. In small godbolt examples it can get pretty close.

I'm looking forward to the progress you're making!

As an aside, I am thinking about making D's array operations work more like APL does:

https://en.wikipedia.org/wiki/APL_(programming_language)

APL seems a good fit for SIMD. Look, Ma, no loops!
June 08, 2022

On Tuesday, 7 June 2022 at 21:48:10 UTC, Guillaume Piolat wrote:

>

On Tuesday, 7 June 2022 at 21:09:08 UTC, Bruce Carneal wrote:

>
  1. It's measurably (slightly) faster in many instances (it helps that I can shape the operand flows for this app)

Depends on the loop, for example LLVM can vectorize llvm_sqrt
https://d.godbolt.org/z/6xaTKnn9z

but not llvm_exp / llvm_cos / llvm_sin / llvm_log

https://d.godbolt.org/z/Pc34967vc

for such loops I have to go __vector

Yes. I too definitely want __vector as a backstop when autovec isn't up to the job and SIMT math is not available. I also reach for __vector when conditionals become problematic.

June 08, 2022

On Wednesday, 8 June 2022 at 00:07:14 UTC, Walter Bright wrote:

>

On 6/7/2022 2:16 PM, Guillaume Piolat wrote:

>
  • The usage of D_SIMD and core.simd is not yet enabled in intel-intrinsics with DMD because it disrupts Core and PR, so until now I've been doing it in several attempts to test the water and not distract too much people. But it will enabled be next time probably, probably in August.

  • Once core.simd is enabled for good, D_SIMD usage can be ramped up intel-intrinsics (currently, it is only SSE 1 instructions that use the D_SIMD builtins).
      I expect doing it for all other SIMD instructions will take a few months of spare work, and will expose about 4-5 new bugs and that will be it. If DMD test suite follows it should be hopefully set in stone. wishful.gif

  Until now, most of the issues came from regular vector types usage, not really D_SIMD builtins so I expect that last part to be less contentious.

I'm curious how close we can get to optimized LDC codegen with DMD codegen in an entirely SIMD codebase, it will be somewhere between 0.5x and 1.0x. In small godbolt examples it can get pretty close.

I'm looking forward to the progress you're making!

As an aside, I am thinking about making D's array operations work more like APL does:

https://en.wikipedia.org/wiki/APL_(programming_language)

APL seems a good fit for SIMD. Look, Ma, no loops!

... also has no loops.

June 07, 2022
On 6/7/2022 2:15 PM, max haughton wrote:
> They should already be filed I can find the  issues if required but they should just a bugzilla search away.

Generally speaking, when one writes about bugs, having a bugzilla reference handy says a lot.


> Realistic for anyone other than you then ;)

I've never looked at the code generator for gdc or ldc. I doubt either is easy to contribute to. The x86 line is *extremely* complex and making a change that will work with all the various models is never ever going to be simple.


> With the register allocator as a specific example my attention span is rather short but it was pretty impenetrable when I tried to explain to myself what algorithm it actually uses.

It uses the usual graph coloring algorithm for register allocation. The kludginess in it comes from the first 8 registers all being special cases.


>> Triples *are* SSA, in that each node can be considered an assignment to a temporary.
> 
> Meanwhile I can assign to a variable in a loop? Or maybe you aren't supposed to be able to and that was a bug in some IR I generated? The trees locally are SSA but do you actually enforce SSA in overall dataflow?
> 
> Similarly SSA applies across control flow boundaries, phi nodes?
> 
> Anyway my point about SSA is more that one of the reasons it's prevalent is that it makes writing passes quite a lot easier.

I don't know how you're trying to do things, but the optimizer generates variables all the time. Each node in the elem tree gets assigned exactly once and used exactly once.

> Implementing address sanitizer should be a doddle but it was not with the dmd backend so I had to give up.
> 
> This is actually a really useful feature that we already link with at the moment on Linux so I can tell you exactly the lowering needed if required. You'll be able to do it within minutes I'm sure but I can't devote the time to working out and testing all the bitvectors for the DFA etc.

I'm sure it's not trivial, but I'm also sure it isn't trivial in ldc, either.


> In the same way that structured programming has nothing to do with avoiding bugs. I'm not saying it's impossible to do without SSA it's just indicative of an IR that isn't very good compared to even simple modern ones.

I looked into doing SSA, and realized it was pointless because the binary tree does the same thing.

> They look like special cases, and maybe they are fundamentally, but other compilers simply do not have this rate of mistakes when it comes to SIMD instruction selection or code generation.

Other compilers have a massive test suite that detects problems up front. We don't. I also do not work full time on the back end. This is not a structural problem, it's a resource problem.

> I want to help with this and just today I have fixed a backend issue but I just wanted to say some things about it.

That's fine, that's what the forums are for! And it gives me an opportunity to help out.

The only structural problem with the backend I've discovered is it cannot embed goto nodes in the tree structure, so it can't inline loops. I doubt it is that big a problem, because loops intrinsically are hardly worth inlining.
June 11, 2022
On Wednesday, 8 June 2022 at 00:02:16 UTC, Walter Bright wrote:
> On 6/7/2022 11:36 AM, matheus wrote:
>> What was the magazine? I'd like to search and read them.
>
> I wish I remember. I was so angry about it I didn't save a copy, and eventually forgot which one it was :-(

Oh I see, do you happen to know at least in what year was this?

Matheus.
June 12, 2022
On 6/10/2022 5:41 PM, matheus wrote:
> On Wednesday, 8 June 2022 at 00:02:16 UTC, Walter Bright wrote:
>> On 6/7/2022 11:36 AM, matheus wrote:
>>> What was the magazine? I'd like to search and read them.
>>
>> I wish I remember. I was so angry about it I didn't save a copy, and eventually forgot which one it was :-(
> 
> Oh I see, do you happen to know at least in what year was this?
> 
> Matheus.

Pretty sure 1985.
June 12, 2022
On 6/8/22 03:30, Walter Bright wrote:
> 
>> I want to help with this and just today I have fixed a backend issue but I just wanted to say some things about it.
> 
> That's fine, that's what the forums are for! And it gives me an opportunity to help out.
> 
> The only structural problem with the backend I've discovered is it cannot embed goto nodes in the tree structure, so it can't inline loops. I doubt it is that big a problem, because loops intrinsically are hardly worth inlining.

Inlining a function (even with a loop) can enable further optimizations that depend on additional information that's available at the call site.