Update on the D-to-Jai guy: We have real problems with the language

Update on the D-to-Jai guy: We have real problems with the language
Nov 27, 2022 FeepingCreature
Nov 27, 2022 rikki cattermole
Nov 27, 2022 rikki cattermole
Nov 28, 2022 Basile B.
Nov 28, 2022 rikki cattermole
Nov 28, 2022 H. S. Teoh
Nov 29, 2022 Basile B.
Nov 28, 2022 rikki cattermole
Nov 28, 2022 rikki cattermole
Nov 27, 2022 Hipreme
Nov 27, 2022 ryuukk_
Nov 27, 2022 ryuukk_
Nov 27, 2022 ryuukk_
Nov 27, 2022 Steven Schveighoffer
Nov 28, 2022 Walter Bright
Nov 28, 2022 Per Nordlöw
Nov 29, 2022 Walter Bright
Nov 28, 2022 rikki cattermole
Nov 29, 2022 Steven Schveighoffer
Nov 29, 2022 zjh
Nov 29, 2022 Steven Schveighoffer
Nov 29, 2022 zjh
Nov 29, 2022 Paulo Pinto
Nov 29, 2022 Walter Bright
Nov 28, 2022 TheGag96
Nov 29, 2022 zjh
Nov 29, 2022 zjh
Nov 29, 2022 rikki cattermole
Nov 29, 2022 Walter Bright
Nov 29, 2022 Walter Bright
Nov 29, 2022 FeepingCreature
Nov 29, 2022 rikki cattermole
Nov 29, 2022 Guillaume Lathoud

November 27, 2022

Posted by FeepingCreature

Permalink

FeepingCreature

Permalink

I've had an extended Discord call taking a look at the codebase. Now, these are only my own thoughts, but I'd share them anyway:

This is a fairly pedestrian codebase. No CTFE craziness, restrained "normal" use of templates. It's exactly the sort of code that D is supposed to be fast at.
To be fair, his computer isn't the fastest. But it's an 8core AMD, so DMD's lack of internal parallelization hurts it here. This will only get worse in the future.
And sure, there's a bunch of somewhat quadratic templates that explode a bit. But!

But!

It's all "pedestrian" use. Containers with lots of members instantiated with lots of types.
The compiler doesn't surface what is fast and what is slow and doesn't give you a way to notice it, no -vtemplates isn't enough, we need a way to tell the time taken not just the number of instantiations.
But also if we're talking about number of instantiations, hasUDA and getUDA lead the pack. I think the way these work is just bad - I've rewritten all my own hasUDA/getUDA code to be of the form udaIndex!(U, __traits(getAttributes, T)) - instantiating a unique copy for every combination of field and UDA is borderline quadratic - but that didn't help much even though -vtemplates hinted that it should. -vtemplates needs compiler time attributed to template recursively.
LLVM is painful. Unavoidable, but painful. Probably twice the compile time of the ldc2 run was in the LLVM backend.
There was no smoking gun. It's not like "ah yeah, this thing, just don't do it." It's a lot of code that instantiates a lot of genuine workhorse templates (99% "function with type" or "struct with type"), and it was okay for a long time and then it wasn't.

I really think the primary issue here is just that D gives you a hundred tools to dig yourself in a hole, and has basically no tools to dig yourself out of it, and if you do so you have to go "against the grain" of how the language wants to be used. And like, as an experienced dev I know the tricks of how to optimize templates, and I've sunk probably a hundred hours into this for my two libs at work alone, but this is folk knowledge, it's not part of the stdlib, or the spec, or documented anywhere at all. Like if (__ctfe) return;. Like udaIndex!(__traits). Like is(T : U*, U) instead of isPointer. Like making struct methods templates so they're only compiled when needed. Like moving recursive types out of templates to reduce the compilation time. Like keeping your unique instantiations as low as possible by querying information with traits at the site of instantiation. Like -v to see where time is spent. Like ... and so on. This goes for every part of the language, not just templates.

DMD is fast. DMD is even fast for what it does. But DMD is not as fast as it implicitly promises when templates are advertised, and DMD does not expose enough good ways to make your code fast again when you've fallen in a hole.

November 27, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by rikki cattermole
in reply to FeepingCreature

Permalink

rikki cattermole

Posted in reply to FeepingCreature

Permalink

It is significantly worse than just a few missing tools.

The common solution in the native world to improve compilation time is shared libraries.

Incremental compilation does help, but shared libraries is how you isolate and hide away a ton of details regarding template instantiations that don't need to be exposed.

Only.. we can't do shared libraries cleanly and where it is possible it is pretty limiting (such as a specific compiler).

Yesterday I tried to get the .di generator to produce a .di file for a project. It has somehow started to produce a ton of garbage at the bottom of the file that certainly isn't valid D code. Even if that wasn't there, how on earth is the compiler going to -I them when they are not in directories? Yikes.

Needless to say, we have a ton of implementation details that are both low hanging and high value which have no alternatives.

November 27, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by Hipreme
in reply to FeepingCreature

Permalink

Hipreme

Posted in reply to FeepingCreature

Permalink

On Sunday, 27 November 2022 at 09:29:29 UTC, FeepingCreature wrote:

I've had an extended Discord call taking a look at the codebase. Now, these are only my own thoughts, but I'd share them anyway:

This is a fairly pedestrian codebase. No CTFE craziness, restrained "normal" use of templates. It's exactly the sort of code that D is supposed to be fast at.
To be fair, his computer isn't the fastest. But it's an 8core AMD, so DMD's lack of internal parallelization hurts it here. This will only get worse in the future.
And sure, there's a bunch of somewhat quadratic templates that explode a bit. But!

But!

It's all "pedestrian" use. Containers with lots of members instantiated with lots of types.
The compiler doesn't surface what is fast and what is slow and doesn't give you a way to notice it, no -vtemplates isn't enough, we need a way to tell the time taken not just the number of instantiations.
But also if we're talking about number of instantiations, hasUDA and getUDA lead the pack. I think the way these work is just bad - I've rewritten all my own hasUDA/getUDA code to be of the form udaIndex!(U, __traits(getAttributes, T)) - instantiating a unique copy for every combination of field and UDA is borderline quadratic - but that didn't help much even though -vtemplates hinted that it should. -vtemplates needs compiler time attributed to template recursively.
LLVM is painful. Unavoidable, but painful. Probably twice the compile time of the ldc2 run was in the LLVM backend.
There was no smoking gun. It's not like "ah yeah, this thing, just don't do it." It's a lot of code that instantiates a lot of genuine workhorse templates (99% "function with type" or "struct with type"), and it was okay for a long time and then it wasn't.

Totally agreed, specially with the part basically no tools to dig yourself out of it.

I would like to refer to some PR's which I think it could be game changer for D.

WIP in DMD that both Per and Stefan has done for better build times profiling: https://github.com/dlang/dmd/pull/14635

Having talked with Stefan, there isn't much hope into this getting merged, thought it's so important

CTFECache, caching the CTFE
https://github.com/dlang/dmd/pull/7843

This one been a bit more inactive recently, I think it may need a help

Those 2 PRs should have more attention than other things right now, specially I think there have been an increasing number of people unsatisfied with D compilation times (see reggae).

I have been having problem with compilation times have been some time right now.

I have almost wiped stdlib usage from my project due to its immense imports, template usages and some choices that breaks compilation speed (looking at you to!string(float).
From ldc build profile, importing core.sys.windows did take too much time, so, I rewrote only the part that I needed for making the build times slightly faster (I think I got like 0.3 seconds)
My projects have been completely modularized, and there is like 2 modules that are included by all other modules, yet it didn't make much difference modularizing it or not.

November 27, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by ryuukk_
in reply to FeepingCreature

Permalink

ryuukk_

Posted in reply to FeepingCreature

Permalink

On Sunday, 27 November 2022 at 09:29:29 UTC, FeepingCreature wrote:

I've had an extended Discord call taking a look at the codebase. Now, these are only my own thoughts, but I'd share them anyway:

This is a fairly pedestrian codebase. No CTFE craziness, restrained "normal" use of templates. It's exactly the sort of code that D is supposed to be fast at.
To be fair, his computer isn't the fastest. But it's an 8core AMD, so DMD's lack of internal parallelization hurts it here. This will only get worse in the future.
And sure, there's a bunch of somewhat quadratic templates that explode a bit. But!

But!

It's all "pedestrian" use. Containers with lots of members instantiated with lots of types.
The compiler doesn't surface what is fast and what is slow and doesn't give you a way to notice it, no -vtemplates isn't enough, we need a way to tell the time taken not just the number of instantiations.
But also if we're talking about number of instantiations, hasUDA and getUDA lead the pack. I think the way these work is just bad - I've rewritten all my own hasUDA/getUDA code to be of the form udaIndex!(U, __traits(getAttributes, T)) - instantiating a unique copy for every combination of field and UDA is borderline quadratic - but that didn't help much even though -vtemplates hinted that it should. -vtemplates needs compiler time attributed to template recursively.
LLVM is painful. Unavoidable, but painful. Probably twice the compile time of the ldc2 run was in the LLVM backend.
There was no smoking gun. It's not like "ah yeah, this thing, just don't do it." It's a lot of code that instantiates a lot of genuine workhorse templates (99% "function with type" or "struct with type"), and it was okay for a long time and then it wasn't.

So he is using std.meta and std.traits, then no wonder why, he should nuke these two imports

These two modules should be removed from the language plain and simple, and __traits should be improved to accommodate

November 27, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by ryuukk_
in reply to ryuukk_

Permalink

ryuukk_

Posted in reply to ryuukk_

Permalink

Speaking of Jai, it's not fast either when you start to do lot of logic at compile time

Here as you can see, a jai project that takes 5.5 seconds to compile

https://i.imgur.com/weC9ejD.png

November 27, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by Steven Schveighoffer
in reply to FeepingCreature

Permalink

Steven Schveighoffer

Posted in reply to FeepingCreature

Permalink

On 11/27/22 4:29 AM, FeepingCreature wrote:

But also if we're talking about number of instantiations, hasUDA and getUDA lead the pack. I think the way these work is just bad - I've rewritten all my own hasUDA/getUDA code to be of the form udaIndex!(U, __traits(getAttributes, T)) - instantiating a unique copy for every combination of field and UDA is borderline quadratic - but that didn't help much even though -vtemplates hinted that it should.

I avoid hasUDA and getUDA unless it's a simple project. If I'm doing any complex attribute mechanisms, I use an introspection blueprint, i.e. loop over all the attributes once and build a struct that has all the information I need once. There's not a simple abstraction for this, you just have to build it.

But really this is kind of how you have to deal with D templates. I think we are missing a guide on this, because it's easy to write D code that looks nice, and doesn't compile with horrible performance, but will add up to something that is unworkable.

There's bad ways to implement many algorithms. There's also ways to implement algorithms that assist the optimizer, or to help performance by considering the hardware being used. For sure there's a lot less attention paid to what is "bad" in a template and CTFE, and what performs well. The wisdom there is not conventional and is not the same as regular code wisdom. I think we can do better here.

Yes, I think we need more tools to inspect what is taking the time, and we need more guides on how to avoid those. Understanding where the cost goes when instantiating a template is kind of key knowledge if you are going to use a lot of them.

Phobos does not make this easy either. Things like std.format are so insanely complex because you can just reach for a bunch of sub-templates. It's easy to write the code, but it increases compile times significantly.

I still have some hope that there are ways to decrease the template cost that will just improve performance across the board. Maybe that needs a new frontend compiler, I don't know.

-Steve

November 27, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by ryuukk_
in reply to ryuukk_

Permalink

ryuukk_

Posted in reply to ryuukk_

Permalink

Sorry, Jai takes not 5.5sec, actually up to 7.8sec, as you do more compile time logic

https://i.imgur.com/SbF2lP1.png

No language is immune to bad code

However i agree with posts above, that tracing PR is important, i made the remark about the lack of tracing/benchmark in the DMD codebase few months ago, it is important to have

Integrating tracy should be useful: https://github.com/wolfpld/tracy

November 28, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by rikki cattermole
in reply to rikki cattermole

Permalink

rikki cattermole

Posted in reply to rikki cattermole

Permalink

Today I got some timings after (some how?) fixing dmd builds for my code.

1) ldc is ~45s
2) ldc --link-internally is ~30s
3) dmd is ~16s

Note it takes about ~3s to ``dub run dub@~master -- build`` due to needing latest.

So what is interesting about this is MSVC link is taking about ~15s by itself, LLVM is 15s which means that the frontend is actually taking only like 1s at most.

Pretty rough estimates, but all my attempts to speed up my codebase had very little effect as it turns out (including removing hasUDA!).

November 28, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by Basile B.
in reply to rikki cattermole

Permalink

Basile B.

Posted in reply to rikki cattermole

Permalink

On Sunday, 27 November 2022 at 16:38:40 UTC, rikki cattermole wrote:

Today I got some timings after (some how?) fixing dmd builds for my code.

ldc is ~45s
ldc --link-internally is ~30s
dmd is ~16s

Note it takes about ~3s to dub run dub@~master -- build due to needing latest.

So what is interesting about this is MSVC link is taking about ~15s by itself, LLVM is 15s which means that the frontend is actually taking only like 1s at most.

Pretty rough estimates, but all my attempts to speed up my codebase had very little effect as it turns out (including removing hasUDA!).

For better compile times and with LDC people should also always use the undocumented option
--disable-verify (for a DUB recipe this would go in the dlags-ldc2 array for example).

By default ldc2 verifies the IR produced but that verification is mostly useful to detect bugs in the AST to IR translation, so unlikely to detected any problems for a project like LDC, that's well settled, and has the main drawback to be very slow, especially with functions with bad a cyclomatic complexity. For example for my old iz library, 12KSLOCs of D (per D-Scanner critetions), the gain measured with --disable-verify goes from 150 to 300ms, depending on the run.

November 29, 2022

Re: Update on the D-to-Jai guy: We have real problems with the language

Posted by rikki cattermole
in reply to Basile B.

Permalink

rikki cattermole

Posted in reply to Basile B.

Permalink

Okay that is a pretty impressive speed boost.

ldc2: --disable-verify ~35s

Except:

ldc2: --disable-verify --link-internally ~30s

A cost that is still worth paying given that its within the margin of error for my case anyway.

Top | Forum index | About this forum

Forums