On Sunday, 24 September 2023 at 21:11:34 UTC, deadalnix wrote:
> On Friday, 22 September 2023 at 07:12:02 UTC, Prajwal S N wrote:
> Hi everyone,
For SAOC 2023, I'm working on refactoring dfmt to use the AST from DMD-as-a-library instead of libdparse.
The past week has been very interesting. I got up to speed with the dfmt codebase, and managed to do a 1-to-1 port of the lexer dependency from libdparse to DMD-as-a-library. Most parts were pretty straightforward, and the bulk of the work was replacing every tok!"<token>"
instance with TOK.<token>
and making sure the token coming from DMD was the same as what was previously being used. So far so good!
You can see the draft PR tracking the work here.
Going forward, my mentor and I have decided that it would be impractical to try and replace the parser directly, for multiple reasons:
- It's a lot of work to replace the parser and use the DMD AST instead of libdparse's, and all of this work will happen without a working version of dfmt. If, at the end of this, dfmt is broken or refuses to compile, it could very well mean that all that effort went down the drain.
- Doing a brute force replacement of the parser will prevent us from testing the transformation passes in dfmt individually, and also brings us back to the point above.
Hence, we've decided to do an incremental rewrite of the files that use the parser, initially with no passes (just to ensure the AST is being built in the first place), and then adding each pass along with relevant unit tests.
I don't want to sound alarming or anything, but an AST is not really what you want to work with as a formatter.
The main reason is that you want to carry around a lot of information that the AST generally doesn't care about (comments, informations about layout, etc...). Consider the following example:
int a; // this is an int.
int b;
We immediately recognize that the comment refers to a. However:
int a;
// this is an int.
int b;
Now we recognize that the comment refers to b.
There is a lot of subtle semantic in there that is very hard to convey through an AST and are very hard to work with in that form.
But dmd attaches comments to declarations in the parsing phase. You can get the comments from the AST. Why is that not a good way to represent the data?
> There is a lot of prior art on the matter of code formatting, and the best explanation is probably the one from dartfmt's author: https://journal.stuffwithstuff.com/2015/09/08/the-hardest-program-ive-ever-written/ . clang-format and many others do use this approach.
The article states the following:
"At this point, you’re probably thinking, “Wait. What’s so hard about formatting?” After you’ve parsed, can’t you just walk the AST and pretty-print it with some whitespace?
If every statement fit within the column limit of the page, yup. It’s a piece of cake. (I think that’s what gofmt does.) But our formatter also keeps your code within the line length limit. That means adding line breaks (or “splits” as the formatter calls them), and determining the best place to add those is famously hard."
So the fundamental problem comes from the fact that the author wants to do smart stuff with line/column limits. dfmt does not even have a line limit option, so that problem does not apply here. Even if we were to add it, that could be viewed as a separate project of transitioning dfmt to a better representation.
> Shameless plug: sdfmt uses that approach. You can get it there: https://code.dlang.org/packages/sdc%3Asdfmt .
I understand this is probably out of scope to turn things around at this time, but holly hell, do we really need, as a community, to redo all the mistake other communities have done instead of learning from them, and, to add insult to injury, involve junior devs in that madness?
As things stand dfmt already exists and is implemented by using an AST to output the formatted code. Maybe that is not the ideal approach and could be improved, but that has nothing to do with this project. This project is all about replacing a dependency (libdparse) with another one (dmd-as-a-lib) so that when dmd is upgraded, dfmt ideally would not be affected.
Re-engineering the formatter such that it uses a different technique represents a separate project that is orthogonal to the current effort.