June 07, 2022
On 6/7/2022 6:16 AM, Ali Çehreli wrote:
> Anything more to the story? You should get more sales when they hopefully published the explanation with an apology. (?)

They refused to publish a correction. They didn't give a fig, because I wasn't Microsoft and a big advertiser.
June 07, 2022
On 6/7/2022 1:44 AM, max haughton wrote:
> Unfortunately the dmd optimizer and inliner are somewhat buggy so most "enterprise" users actually avoid them like the plague or have fairly hard-won scars. At least one of our libraries doesn't work with -O -inline, many others cases have similar issues.

I am not aware of this. Bugzilla issues, please?


> optimization isn't just about adding special cases, the sheer amount of optimizations dmd would have to learn on top of an already horrific codebase in the backend (note that to start with the register allocator is pretty bad, but isn't easily replaced) means while it's technically possible it's practically just not realistic.

Everybody says that, but I just go and add the optimizations people want anyway.


> The IR used by the backend is also quite annoying to work with. SSA is utterly dominant in compilers (and has been since the late 90s) for a reason. It's not it's fault because it's so old but the infrastructure and invariants you get from it are just nothing even close to being easy to work on versus even GCC before they realized LLVM was about to completely destroy them (almost no one does new research with GCC anymore mostly because LLVM is easier to work on).

Triples *are* SSA, in that each node can be considered an assignment to a temporary.


> This IR and the lack of structure around operations on it is why dmd has so many bugs wrt things like SIMD code generation.

SSA has nothing whatsoever to do with SIMD code generation.

The SIMD problems have all turned out to be missed special cases (SIMD is very complex). If you look at the PRs I made to fix them, they're all small. Not indicative of any fundamental problem.

June 07, 2022
I thought I fixed the SIMD issues you were having? What is left undone?
June 07, 2022
On Tuesday, 7 June 2022 at 18:22:55 UTC, Walter Bright wrote:
> On 6/7/2022 6:16 AM, Ali Çehreli wrote:
>> Anything more to the story? You should get more sales when they hopefully published the explanation with an apology. (?)
>
> They refused to publish a correction. They didn't give a fig, because I wasn't Microsoft and a big advertiser.

What was the magazine? I'd like to search and read them.

By the way, today those reviewers would have a nice lesson from the internet, like we see from time to time and those people make dumb affirmations/mistakes.

Matheus.
June 07, 2022

On Tuesday, 7 June 2022 at 18:21:57 UTC, Walter Bright wrote:

>

On 6/7/2022 2:23 AM, Bruce Carneal wrote:
...

I've never much liked autovectorization:

Same here, which is why my initial CPU-side implementation was all explicit __vector/intrinsics code (with corresponding static arrays to get a sane unaligned load/store capability).

>
  1. you never know if it is going to vectorize or not. The vector instruction sets vary all over the place, and whether they line up with your loops or not is not determinable in general - you have to look at the assembler dump.

I now take this as an argument for auto vectorization.

>
  1. when autovectorization doesn't happen, the compiler reverts to non-vectorized slow code. Often, you're not aware this has happened, and the expected performance doesn't happen. You can usually refactor the loop so it will autovectorize, but that's something only an expert programmer can accomplish, but he can't do it if he doesn't realize the autovectorization didn't happen. You said it yourself: "if perf drops"!

Well, presumably you're "unittesting" performance to know where the hot spots are so... It's always nicer to know things at compile time but for me it's acceptable at "unittest time" since the measurements will be part of any performance code development setup.

>
  1. it's fundamentally a backwards thing. The programmer writes low level code (explicit loops) and the compiler tries to work backwards to create high level code (vectors) for it! This is completely backwards to how compilers normally work - specify a high level construct, and the compiler converts it into low level.

I see it as a choice on the "time to develop" <==> "performance achieved" axis. Fortunately autovectorization can be a win here: develop simple/correct code with an eye to compiler-visible indexing and hand-vectorize if there's a problem. (I actually went the other way, starting with hand optimized core functions, and discovered that auto-vectorization worked as well or better for many of those functions).

>
  1. with vector code, the compiler will tell you when the instruction set won't map onto it, so you have a chance to refactor it so it will.

Yes, better to know things at compile time but OK to know them at perf "unittest" time.

Here are some of the reasons I'm migrating much of my code to auto-vectorization with perf regression tests from the initial __vector/intrinsic implementation:

  1. It's more readable.

  2. It is auto upgradeable (with @target meta programming for the multi-target deployability)

  3. It's measurably (slightly) faster in many instances (it helps that I can shape the operand flows for this app)

  4. It fits more readily with upcoming CPU-centric vector arch (SVE, SVE2, RVV...), Cray vectors ride again! :-)

  5. It aligns stylistically with SIMT (I think in terms of index spaces and memory subsystem blocking rather than HW details). SIMT is where I believe we should be looking for future, significant performance gains (the PCIe bottleneck is a stumbling block but SoCs and consoles have the right idea).

The mid-range goal is to develop in an it-just-works, no-big-deal SIMT environment where the traditional SIMD awkwardness is in the rear view mirror and where we can surf the improving HW performance wave (clock increases were nice while they lasted but ...). dcompute is already a good ways down that road but it can be friendlier and more capable. As I've mentioned elsewhere, I already prefer it to CUDA.

Finally, thanks for creating D. It's great.

June 07, 2022
On Tuesday, 7 June 2022 at 18:27:32 UTC, Walter Bright wrote:
> On 6/7/2022 1:44 AM, max haughton wrote:
>> Unfortunately the dmd optimizer and inliner are somewhat buggy so most "enterprise" users actually avoid them like the plague or have fairly hard-won scars. At least one of our libraries doesn't work with -O -inline, many others cases have similar issues.
>
> I am not aware of this. Bugzilla issues, please?

The main one we hit is NRVO issues with -inline. Mathias Lang also me showed an issue with it but I can't remember which project it was.

They should already be filed I can find the  issues if required but they should just a bugzilla search away.

>
>> optimization isn't just about adding special cases, the sheer amount of optimizations dmd would have to learn on top of an already horrific codebase in the backend (note that to start with the register allocator is pretty bad, but isn't easily replaced) means while it's technically possible it's practically just not realistic.
>
> Everybody says that, but I just go and add the optimizations people want anyway.

Realistic for anyone other than you then ;)

With the register allocator as a specific example my attention span is rather short but it was pretty impenetrable when I tried to explain to myself what algorithm it actually uses.

>
>> The IR used by the backend is also quite annoying to work with. SSA is utterly dominant in compilers (and has been since the late 90s) for a reason. It's not it's fault because it's so old but the infrastructure and invariants you get from it are just nothing even close to being easy to work on versus even GCC before they realized LLVM was about to completely destroy them (almost no one does new research with GCC anymore mostly because LLVM is easier to work on).
>
> Triples *are* SSA, in that each node can be considered an assignment to a temporary.

Meanwhile I can assign to a variable in a loop? Or maybe you aren't supposed to be able to and that was a bug in some IR I generated? The trees locally are SSA but do you actually enforce SSA in overall dataflow?

Similarly SSA applies across control flow boundaries, phi nodes?

Anyway my point about SSA is more that one of the reasons it's prevalent is that it makes writing passes quite a lot easier.

Implementing address sanitizer should be a doddle but it was not with the dmd backend so I had to give up.

This is actually a really useful feature that we already link with at the moment on Linux so I can tell you exactly the lowering needed if required. You'll be able to do it within minutes I'm sure but I can't devote the time to working out and testing all the bitvectors for the DFA etc.

>
>> This IR and the lack of structure around operations on it is why dmd has so many bugs wrt things like SIMD code generation.
>
> SSA has nothing whatsoever to do with SIMD code generation.

In the same way that structured programming has nothing to do with avoiding bugs. I'm not saying it's impossible to do without SSA it's just indicative of an IR that isn't very good compared to even simple modern ones.

Within the backend I actually do think there is a elegant codebase trying to escape just that to achieve that it would need a few snips in the right place to make it feasible to extract.

> The SIMD problems have all turned out to be missed special cases (SIMD is very complex). If you look at the PRs I made to fix them, they're all small. Not indicative of any fundamental problem.

They look like special cases, and maybe they are fundamentally, but other compilers simply do not have this rate of mistakes when it comes to SIMD instruction selection or code generation.

It just smells really bad compared to GCC. LLVM tries to be fancy in this area so I wouldn't copy.

I want to help with this and just today I have fixed a backend issue but I just wanted to say some things about it.
June 07, 2022

On Tuesday, 7 June 2022 at 18:28:29 UTC, Walter Bright wrote:

>

I thought I fixed the SIMD issues you were having? What is left undone?

I checked and my issues have been fixed indeed, thanks for that.
Because it's an ongoing process, I'll come back later with a new set of Issues.

The current status:

  • The usage of D_SIMD and core.simd is not yet enabled in intel-intrinsics with DMD because it disrupts Core and PR, so until now I've been doing it in several attempts to test the water and not distract too much people. But it will enabled be next time probably, probably in August.

  • Once core.simd is enabled for good, D_SIMD usage can be ramped up intel-intrinsics (currently, it is only SSE 1 instructions that use the D_SIMD builtins).
    I expect doing it for all other SIMD instructions will take a few months of spare work, and will expose about 4-5 new bugs and that will be it. If DMD test suite follows it should be hopefully set in stone. wishful.gif

    Until now, most of the issues came from regular vector types usage, not really D_SIMD builtins so I expect that last part to be less contentious.

I'm curious how close we can get to optimized LDC codegen with DMD codegen in an entirely SIMD codebase, it will be somewhere between 0.5x and 1.0x. In small godbolt examples it can get pretty close.

June 07, 2022

On Tuesday, 7 June 2022 at 21:16:22 UTC, Guillaume Piolat wrote:

>

On Tuesday, 7 June 2022 at 18:28:29 UTC, Walter Bright wrote:

>

[...]

I checked and my issues have been fixed indeed, thanks for that.
Because it's an ongoing process, I'll come back later with a new set of Issues.

[...]

It should be very close as long as the register allocator doesn't get it your way.

June 07, 2022

On Tuesday, 7 June 2022 at 21:09:08 UTC, Bruce Carneal wrote:

>
  1. It's measurably (slightly) faster in many instances (it helps that I can shape the operand flows for this app)

Depends on the loop, for example LLVM can vectorize llvm_sqrt
https://d.godbolt.org/z/6xaTKnn9z

but not llvm_exp / llvm_cos / llvm_sin / llvm_log

https://d.godbolt.org/z/Pc34967vc

for such loops I have to go __vector

June 07, 2022

On Tuesday, 7 June 2022 at 21:48:10 UTC, Guillaume Piolat wrote:

>

On Tuesday, 7 June 2022 at 21:09:08 UTC, Bruce Carneal wrote:

>
  1. It's measurably (slightly) faster in many instances (it helps that I can shape the operand flows for this app)

Depends on the loop, for example LLVM can vectorize llvm_sqrt
https://d.godbolt.org/z/6xaTKnn9z

but not llvm_exp / llvm_cos / llvm_sin / llvm_log

https://d.godbolt.org/z/Pc34967vc

for such loops I have to go __vector

Which instructions are they supposed to lower to? Wouldn't that require the intrinsic-in-name-only intel math library stuff?