May 20, 2019 Re: [spec] Phases of translation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Max Haughton | On Monday, 20 May 2019 at 20:22:24 UTC, Max Haughton wrote: > > You're right, that is definitely an issue with the current specification. One of the really annoying issues is that dmd is understandable locally (e.g. I was able to add a little feature without looking much up) but the global structure of the code is quite disjoint e.g. lots of the analysis code is lumped together and fairly difficult to grok without watching a debugger go through it (Unless you already know) As a newcomer here, I agree. After the parsing stage, things start to get a little bit unclear. While I agree that the spec is not supposed to be a tutorial, well, there isn't any tutorial either. I'm not suggesting to make the spec a tutorial, but to make some tutorial. Compilers are of great interest to me and I think that DMD is great study material if you want to look at a non-trivial compiler. But on the same time, I think it is not attractive for new users. It maybe me that I'm not a compiler expert but I think it's not easy for most people. The source code guide [1] did not help much. It was way too high-level for me to understand any important parts. I think that the video referenced above [2] way more helpful. In that video, you said that you could talk for a month about the compiler. Well, I would be glad to do that for you with some help. :) After GSoC, I was planning to start diving into the compiler and writing about it. But, I may write a lot of incorrect stuff. So, if any experienced compiler dev wants to help / review those, I would be very happy. And I think it would help other not-compiler-jedis understand it. [1] https://wiki.dlang.org/DMD_Source_Guide [2] https://www.youtube.com/watch?v=l_96Crl998E |
May 20, 2019 Re: [spec] Phases of translation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote: > The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis. > I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast? For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage? > I.e. the definition of a token does not change depending on what construct is being parsed, and the AST generated by the parser can be created without doing any semantic analysis (unlike C++). > > These consequences fall out of the rest of the spec, hence they should be more of a clarification in the introduction. The idea is to head off attempts to add changes to D that introduce dependencies. Such proposals do crop up from time to time, for example, user-defined operator tokens. I agree - hence I think we need to be explicit about what D requires of each phase. That way any change to the language can be subjected to a test - does it break some of the fundamental requirements for parsing or semantic analysis etc. Thanks and Regards Dibyendu |
May 20, 2019 Re: [spec] Phases of translation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dibyendu Majumdar | On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote: > On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote: > >> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis. >> > > I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast? > For the first part: "tokenizing is independent of parsing", one way to think of it is that to write a lexer (the sub-program that does the tokenizing), you don't need a parser. You could write it as an independent program. That makes sense, to split any token from the input ('int', '+', '123', etc.) you don't every need to build any AST. Or differently, you don't need to know any info about the structure of the program. For the second part, a parser can build an AST (or more correctly, can decide in what grammatical rule it "falls") just by the kinds of tokens (e.g. you have a number token, like 123, then '+', so you must be parsing a binary expression). I think that somewhere was mentioned that C++ is _dependent_ on the semantic analysis. To understand this better, think that semantic analysis as the act of giving meaning to the entities of the program (hence the term semantic). For example, if I write: int a; Well, 'a' is a token, whose kind we could say is an identifier. But it doesn't have any meaning. Knowing that 'a' is a variable with type 'int' does now give us more info. We can now know whether it can be part of this expression for example: a + 1. If 'a' was a string, that would be invalid, and we found the error because we knew the meaning of 'a'. Now, what is an example of parsing being dependent to the meaning of tokens? Well, C (and C++) of course: Consider this expression: (a)*b If 'a' is a variable, then that is really this: a*b But, if I have done this: typedef int a; somewhere above, then now this expression is suddenly "cast the dereference of 'b' to the type 'a' (which is int in this case)". This is known as the lexer hack [1] (actually, the lexer hack is the solution to the problem). The important thing to understand here is that we can't parse the expression just by knowing what kind of token 'a' is. We have to know additional info about it, info that is provided by the semantic analysis (in the form of the symbol table, a table which contains info about the symbols of the program). These grammars are known as context-sensitive (i.e. not context-free) because you have to have some context to deduce the grammatical rule. Note that now suddenly there is no clear distinction between the parsing phase and the semantic phase. Last but not least, phases being separated clearly has other important implications beyond comprehensibility. For example, the compiler is paralellized more easily. While the lexer tokenizes file A, the parser could be parsing file B while a semantic analysis is run on file C. [1] https://en.wikipedia.org/wiki/The_lexer_hack |
May 21, 2019 Re: [spec] Phases of translation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dibyendu Majumdar | On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>
>> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>>
>
> I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
>
> For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
>
Currently the Intro says:
The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form.
I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.
For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.
Thanks and Regards
Dibyendu
|
May 21, 2019 Re: [spec] Phases of translation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dibyendu Majumdar | On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
> It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.
AFAIK, you should be able to parse C++ with a GLR parser.
Ola.
|
June 01, 2019 Re: [spec] Phases of translation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dibyendu Majumdar | On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
>> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>>
>>> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>>>
>>
>> I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
>>
>> For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
>>
>
> Currently the Intro says:
>
> The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form.
>
> I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.
>
> For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.
>
Hi Walter, please would you share any insights regarding above?
Thanks and Regards
Dibyendu
|
June 10, 2019 Re: [spec] Phases of translation | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dibyendu Majumdar | On Saturday, 1 June 2019 at 13:34:07 UTC, Dibyendu Majumdar wrote:
> On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
>>> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>>>
>>>> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>>>>
>>>
>>> I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
>>>
>>> For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
>>>
>>
>> Currently the Intro says:
>>
>> The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form.
>>
>> I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.
>>
>> For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.
>>
>
> Hi Walter, please would you share any insights regarding above?
>
Hi I am still waiting to hear your views on above. The pull request I submitted for revising the intro is stuck because of this.
Regards
|
Copyright © 1999-2021 by the D Language Foundation