Mihaela Chirea - SAOC 2020 Milestone 2 Update 4 - Improving DMD as a Library

Nov 19, 2020

Mihaela Chirea

Nov 19, 2020

Mihaela Chirea

Nov 20, 2020

Jacob Carlborg

Nov 26, 2020

Mihaela Chirea

Nov 27, 2020

Jacob Carlborg

Hello! Here is the status of this project at the end of Milestone 2 of #SAOC202. Since the last update, I worked on adding location to the StorageClassDeclaration[1] and UserAttributeDeclaration[2] nodes (the only ones remaining if I didn't miss anything during the last check). Doing this I noticed a few of issues with both these nodes and a couple others and I need some advice on how to proceed in this regard: - The location should start at the first byte of the AST node to preserve the context. Some nodes already had locations before I started working on this project, but I noticed that not all of them follow this rule. VarDeclaration and FuncDeclaration, for example, start at the name of the variable/function, skipping the type and the storage classes. As these locations are used for errors, I was thinking of adding another variable to these classes instead of replacing the value of this one. - In the case of UserAttributeDeclaration, consecutive expressions are all put into the same node. While the expressions already have their location set, none contain the `@` symbol. In a situation like `@uda1 @uda2 @uda3` moving this location just a bit to the left wouldn't be a problem. However, `@(uda1, uda2, uda3)` is represented in the exact same way by the parser: one UserAttributeDeclaration node with 3 expressions. So who gets the `@` in this case? My idea would be to set the start location of UserAttributeDeclaration to the first `@` symbol in the group, and, later, set the end location to wherever the last symbol is. The problem with this approach is that all attributes, both storage classes and user defined, can be applied on the same variable/function, so the UserAttributeDeclaration range can contain more than just user defined attributes. - A StorageClassDeclaration node is not always created when meeting a storage class. Most nodes have a StorageClass field where this information is stored in the form of a ulong variable, and therefore there is no node to attach its location to. For example, `const a = 1;` generates a VarDeclaration node with the location at `a` and the storage class with the const bit set. Putting consecutive storage classes into the same node happens here as well. To deal with these situations, I need to know a bit more about the desired usecases. What information should we be able to get from these nodes? For now, I will put these issues on hold. For the next milestone, I will focus on: - Adding the possibility of analyzing source code that is only in memory - Stopping the generation of TypeInfo nodes when not needed - Adding end location where possible I will get back to the start locations as soon as I find a suitable solution. The tasks planned for milestone 2 can be found here: [3] [1] https://github.com/AsterMiha/dmd/tree/StorageClassDeclarationLoc [2] https://github.com/AsterMiha/dmd/tree/uadLoc [3] https://forum.dlang.org/post/kfcxcjauafpcrmfuajam@forum.dlang.org

November 20, 2020

Re: Mihaela Chirea - SAOC 2020 Milestone 2 Update 4 - Improving DMD as a Library

Posted by Jacob Carlborg
in reply to Mihaela Chirea

Permalink

Jacob Carlborg

Posted in reply to Mihaela Chirea

Permalink

On Thursday, 19 November 2020 at 22:07:23 UTC, Mihaela Chirea wrote:

> - The location should start at the first byte of the AST node to preserve the context. Some nodes already had locations before I started working on this project, but I noticed that not all of them follow this rule. VarDeclaration and FuncDeclaration, for example, start at the name of the variable/function, skipping the type and the storage classes. As these locations are used for errors, I was thinking of adding another variable to these classes instead of replacing the value of this one.

Yes, I think the simplest would be to add another field. Ideally the compiler would only store only one field, the actual start of the declaration, then it would extract the necessary location of the identifiers if it needs to report an error.

> - In the case of UserAttributeDeclaration, consecutive expressions are all put into the same node.

Does that apply to these forms as well:

@uda1
{
    int a = 3;
}

@uda2:

int b = 4;

> While the expressions already have their location set, none contain the `@` symbol. In a situation like `@uda1 @uda2 @uda3` moving this location just a bit to the left wouldn't be a problem. However, `@(uda1, uda2, uda3)` is represented in the exact same way by the parser: one UserAttributeDeclaration node with 3 expressions. So who gets the `@` in this case? My idea would be to set the start location of UserAttributeDeclaration to the first `@` symbol in the group, and, later, set the end location to wherever the last symbol is. The problem with this approach is that all attributes, both storage classes and user defined, can be applied on the same variable/function, so the UserAttributeDeclaration range can contain more than just user defined attributes.

First I have to say, I do not fully understand how the AST works for UDAs. When I tried to solve this, I noticed that `UserAttributeDeclaration` contains an array of expression. The natural thing would be to store each UDA in an element of that array. But that's not how it works, IIRC. Instead the UDAs are stored in a tuple expression (which itself contains an array of expressions) as the first element. Then, if enough UDAs are declared on a declaration it will eventually start using the second element of the array in `UserAttributeDeclaration`. I never understood the logic around this.

Yes, that's problematic. I think the what needs to happen is to add another AST node. I'm thinking one node type that covers all UDA's for a single declaration, this is what `UserAttributeDeclaration` is now, as far as I understand. Then each individual UDA needs to have its own node type as well, let's call it `UserAttributeDeclarationItem`. The outer node would not need a location, but the inner nodes would.

For your above examples:

`@uda1 @uda2 @uda3 void foo();`

Would be represented as one `UserAttributeDeclaration`, containing three `UserAttributeDeclarationItem`. Each `UserAttributeDeclarationItem` containing one `IdentifierExp`.

`@(uda1, uda2, uda3) nothrow @(uda4, uda5, uda6) void foo();`

Would be represented as one `UserAttributeDeclaration`, containing two `UserAttributeDeclarationItem`. Each `UserAttributeDeclarationItem` containing three `IdentifierExp`.

> - A StorageClassDeclaration node is not always created when meeting a storage class. Most nodes have a StorageClass field where this information is stored in the form of a ulong variable, and therefore there is no node to attach its location to. For example, `const a = 1;` generates a VarDeclaration node with the location at `a` and the storage class with the const bit set. Putting consecutive storage classes into the same node happens here as well.

Hmm, I see. In my opinion, either the parser needs to be modified to output a `StorageClassDeclaration` for that variable declaration, or the location needs to move for the variable declaration to start from `c`.

> To deal with these situations, I need to know a bit more about the desired usecases. What information should we be able to get from these nodes?

I can't say in detail of what exactly which information should be available. I'm thinking more on a high level on which tools need to be possible to build. For example, I think it's reasonable of a refactoring tool to be able to automatically change all local variables to `const`:

auto a = 3;

Would be turned into:

const a = 3;

If `a` is never modified.

As for the UDAs. A refactoring tool that can rename symbols (these do exist in several languages today):

enum uda1;

@uda1 int a = 3;
@(uda1) void foo();

Say that for the above code, you want to rename the enum. Then the tool needs to rename all usages of that symbol. In this case, the UDAs.

> For now, I will put these issues on hold.

Understandable.

--
/Jacob Carlborg

Thanks for the advice and examples! I'll get back to these issues as soon as I finish the first tasks for milestone 3. On Friday, 20 November 2020 at 14:24:14 UTC, Jacob Carlborg wrote: > On Thursday, 19 November 2020 at 22:07:23 UTC, Mihaela Chirea wrote: >> - In the case of UserAttributeDeclaration, consecutive expressions are all put into the same node. > > Does that apply to these forms as well: > > @uda1 > { > int a = 3; > } > > @uda2: > > int b = 4; No. If there is a ':' symbol or a '{' in the way, each will get their own node. In your example there is 1 node for @uda1 and another one for @uda2. @uda1 @uda2 { int a = 3; } Here @uda1 and @uda2 share the same node. @uda1 { @uda2 { int a = 3; } } But not here. This case generates one node for @uda1 and one for @uda2. @uda1: @uda2: int a = 3; Same here, two nodes. > For your above examples: > > `@uda1 @uda2 @uda3 void foo();` > > Would be represented as one `UserAttributeDeclaration`, containing three `UserAttributeDeclarationItem`. Each `UserAttributeDeclarationItem` containing one `IdentifierExp`. > > `@(uda1, uda2, uda3) nothrow @(uda4, uda5, uda6) void foo();` > > Would be represented as one `UserAttributeDeclaration`, containing two `UserAttributeDeclarationItem`. Each `UserAttributeDeclarationItem` containing three `IdentifierExp`. That's a great solution! However since it will introduce some overhead I think I'll only make this available under version DMDLIB. >> - A StorageClassDeclaration node is not always created when meeting a storage class. Most nodes have a StorageClass field where this information is stored in the form of a ulong variable, and therefore there is no node to attach its location to. For example, `const a = 1;` generates a VarDeclaration node with the location at `a` and the storage class with the const bit set. Putting consecutive storage classes into the same node happens here as well. > > Hmm, I see. In my opinion, either the parser needs to be modified to output a `StorageClassDeclaration` for that variable declaration, or the location needs to move for the variable declaration to start from `c`. The main problem here is that there are quite a lot of node types using stc this way so both of these options would require significant changes. I'll check the locations of the rest of the nodes as well to see if any of them include the storage classes.

On 2020-11-26 20:04, Mihaela Chirea wrote: > The main problem here is that there are quite a lot of node types using stc this way so both of these options would require significant changes. I'll check the locations of the rest of the nodes as well to see if any of them include the storage classes. I did some tests with the parser. For code like this: `const a = 3;`, the parser creates a `VarDeclaration` node, with the storage class set in `storage_class`. If the code looks like this: `const int a = 3;`, it creates the exact same node, but it also wraps that in a `StorageClassDeclaration` node. If the parse is already doing that, the semantic analyzer need to be able to handle both cases. So you should be able to modify the first case so that it wraps the `VarDeclaration` inside a `StorageClassDeclaration`. Pass the same storage class to `StorageClassDeclaration` as the existing code passes to `VarDeclaration`. It only requires to add couple of lines here [1]. I've tested this. It works to parse and run the semantic analyzer on both of the above code snippets with the change to the parser applied. But I haven't run the full test suite. [1] https://github.com/dlang/dmd/blob/860d0b2d2f24f3d0ef5cc12e4f43f12aa0e406c8/src/dmd/parse.d#L1270 -- /Jacob Carlborg

Forums