December 30, 2020
On Wednesday, 30 December 2020 at 16:53:41 UTC, Iain Buclaw wrote:
> On Wednesday, 30 December 2020 at 16:03:56 UTC, Max Haughton wrote:
>> On Wednesday, 30 December 2020 at 15:56:49 UTC, Ola Fosheim Grøstad wrote:
>>> On Wednesday, 30 December 2020 at 15:00:15 UTC, Max Haughton wrote:
>>>> Cranelift already has basic ARM support too; I can't comment on the quality of code generated.
>>>
>>> Are you thinking porting?
>>
>> No, just guessing how much work it would be. I would quite like to get a basic backend going but it's much easier said than done (i.e. most optimisations are fairly simple but generating proper code and debug info at the end takes ages to test let alone write)
>
> The dmd back-end is surprisingly more segregated than you think, there are a bunch of entrypoint methods you can override (class Obj if I remember correctly), and from there, modules can act as their own encapsulation for emitting code for different CPUs - just swap the x87 modules for ARM modules in a hypothetical build.
>
> Someone already did 70% of the work in untangling the back-end and made a toy ARMv4 backend several years ago.  Though it may not be in a salvageable state now, especially if the ultimate aim is to generate code for the Apple M1.

I was more thinking of how (say) cgsched.d seems to basically assume either a basic OoO or a pentium 5 UV pipeline. I don't know empirically how sensitive a modern core is to instruction scheduling given that a full Tomasulo-core actually makes it relatively hard to force some kind of data hazards - as opposed to an in-order pipeline.

Making that work with a generic machine is probably harder than making a simple one from scratch (I'm not sure if cranelift actually has a proper instruction scheduler)

If the actual object code and debug output could be reused reused it would save a huge amount of work.

It would also be interesting to try and have a hard cpu-instructions per IR-insn upper bound for debug builds.



December 30, 2020
On Wednesday, 30 December 2020 at 16:22:27 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 30 December 2020 at 16:18:34 UTC, Max Haughton wrote:
>> Space optimizing still requires quite a lot of time eliminating work, although LLVM is still pretty bad at it specifically - if you give a recursive factorial implementation to LLVM, it can't see the overflow's going to happen to so  it will (at O3) give you about 100 SIMD instructions rather than a simple loop.
>
> :-D I wasn't aware of that.
>
> Yes, but I guess one could start with a non-optimizing SSA based backend with the intent of improving it later, but a focus on space optimization. At least it could be competitive in a niche rather than a lesser version of llvm...

SSA basically gets you a lot of optimisations for "free" (amortized) if you maintain the property but other than that a simple backend is hugely faster than LLVM as evidenced by languages moving away from LLVM for debug builds.
December 30, 2020
On Wednesday, 30 December 2020 at 17:49:43 UTC, Max Haughton wrote:
> SSA basically gets you a lot of optimisations for "free" (amortized) if you maintain the property but other than that a simple backend is hugely faster than LLVM as evidenced by languages moving away from LLVM for debug builds.

Maybe it is possible to design a SSA that is very close to WASM? I think that tight code gen for WASM can be competitive as there is a desire to keep downloads small on the web.


December 30, 2020
On Wednesday, 30 December 2020 at 18:04:25 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 30 December 2020 at 17:49:43 UTC, Max Haughton wrote:
>> SSA basically gets you a lot of optimisations for "free" (amortized) if you maintain the property but other than that a simple backend is hugely faster than LLVM as evidenced by languages moving away from LLVM for debug builds.
>
> Maybe it is possible to design a SSA that is very close to WASM? I think that tight code gen for WASM can be competitive as there is a desire to keep downloads small on the web.

WASM is a stack machine IIRC whereas most compiler optimisations are traditionally stated as register machines so it would be more productive to have WASM as a traditional backend.

I think any algorithms that rely on DAGs are also more efficient with register IRs, so again it would be easy to have a traditional SSA IR (probably with Block arguments instead of phi because it's 2020) and convert to a stack machine afterwards.
December 30, 2020
On Wednesday, 30 December 2020 at 18:25:41 UTC, Max Haughton wrote:
> WASM is a stack machine IIRC whereas most compiler optimisations are traditionally stated as register machines so it would be more productive to have WASM as a traditional backend.

I was more thinking having the same SSA instructions/limitations. If one can do fast SSA->WASM then one might be able to use whatever WASM execution engine Chrome uses for JIT? Just a thought...

> I think any algorithms that rely on DAGs are also more efficient with register IRs, so again it would be easy to have a traditional SSA IR (probably with Block arguments instead of phi because it's 2020) and convert to a stack machine afterwards.

Oh, yes, for sure use a simple register IR.

December 30, 2020
On Wednesday, 30 December 2020 at 18:35:03 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 30 December 2020 at 18:25:41 UTC, Max Haughton wrote:
>> WASM is a stack machine IIRC whereas most compiler optimisations are traditionally stated as register machines so it would be more productive to have WASM as a traditional backend.
>
> I was more thinking having the same SSA instructions/limitations. If one can do fast SSA->WASM then one might be able to use whatever WASM execution engine Chrome uses for JIT? Just a thought...

https://v8.dev/blog/liftoff

Seems to ignore code quality for fast code-gen. Maybe some ideas in there.

December 30, 2020
On Tuesday, 15 December 2020 at 02:10:26 UTC, RSY wrote:
> As for LDC as default, i disagree, compilation time with LDC is very slow, even in debug mode, so DMD should stay default for the sake of quick iteration during development

At least until LLVM has as fast code generation as DMD.
December 30, 2020
On 12/30/2020 7:00 AM, Max Haughton wrote:
> Re: cost of DMD backend for ARM, the existing backend is loaded with implementation details from the pentium 5 and 6 (pro), and is generally not very nice to read or write

The overall design of it is pretty simple. The complexity comes from the complexity of the instruction set. There's no way to wish that complexity away. You'll even see it in the inline assembler. I'm actually rather surprised that the initial design of it has proven effective for nearly 40 years despite vast changes in the Intel architecture.

The CPU architecture that didn't fit in too well with the code generator design was the wacky x87 FPU, which used a primitive stack architecture. I never did a good job with that, but the point is moot these days as the x87 is effectively dead.


> - it would probably be easier to do a basic retargetable code generator from scratch

Everyone thinks that. But not a chance. The last 1% will take 500% of the time, and you'll be chasing bugs for years that the old one already solved.


> but keep the existing backend for x86 in the meantime.
> 
> For a material estimate of size, the cranelift backend rust has is about 87k lines (inc. tests IIRC) so somewhere on that order (and we can generate huge amounts of code for free because D) - I think the our backend is a bit bigger than that.

The backend of DMD is 121,000 lines, but that includes the optimizer, symbolic debug info, exception handling table generation, a lot for various object file formats, etc., all of which is pretty generically written. The actual code gen is around 39,000 lines.
December 30, 2020
On 12/30/2020 9:47 AM, Max Haughton wrote:
> I was more thinking of how (say) cgsched.d seems to basically assume either a basic OoO or a pentium 5 UV pipeline.

Not only does it seem to, it specifically was designed for the Pentium and later the P6 architectures. However, scheduling is not so important for modern CPUs. The scheduler can also be simply turned off, it is totally optional.
December 31, 2020
On Thursday, 31 December 2020 at 01:11:10 UTC, Walter Bright wrote:
>> - it would probably be easier to do a basic retargetable code generator from scratch
>
> Everyone thinks that. But not a chance. The last 1% will take 500% of the time, and you'll be chasing bugs for years that the old one already solved.


That was specifically referring to properly adding ARM support rather than replacing the backend for the sake of it. I'm not in any way under the impression that either would be easy.