May 20, 2019
On Monday, 20 May 2019 at 20:22:24 UTC, Max Haughton wrote:
>
> You're right, that is definitely an issue with the current specification. One of the really annoying issues is that dmd is understandable locally (e.g. I was able to add a little feature without looking much up) but the global structure of the code is quite disjoint e.g. lots of the analysis code is lumped together and fairly difficult to grok without watching a debugger go through it (Unless you already know)

As a newcomer here, I agree. After the parsing stage, things start to get
a little bit unclear.

While I agree that the spec is not supposed to be a tutorial, well, there isn't
any tutorial either. I'm not suggesting to make the spec a tutorial, but to make
some tutorial.

Compilers are of great interest to me and I think that DMD is great study material
if you want to look at a non-trivial compiler. But on the same time, I think
it is not attractive for new users. It maybe me that I'm not a compiler expert
but I think it's not easy for most people.

The source code guide [1] did not help much. It was way too high-level for me
to understand any important parts. I think that the video referenced above [2]
way more helpful. In that video, you said that you could talk for a month
about the compiler. Well, I would be glad to do that for you with some help. :)

After GSoC, I was planning to start diving into the compiler and writing about
it. But, I may write a lot of incorrect stuff.
So, if any experienced compiler dev wants to help / review those, I would be
very happy. And I think it would help other not-compiler-jedis understand it.


[1] https://wiki.dlang.org/DMD_Source_Guide
[2] https://www.youtube.com/watch?v=l_96Crl998E
May 20, 2019
On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:

> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>

I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?

For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?

> I.e. the definition of a token does not change depending on what construct is being parsed, and the AST generated by the parser can be created without doing any semantic analysis (unlike C++).
>
> These consequences fall out of the rest of the spec, hence they should be more of a clarification in the introduction. The idea is to head off attempts to add changes to D that introduce dependencies. Such proposals do crop up from time to time, for example, user-defined operator tokens.

I agree - hence I think we need to be explicit about what D requires of each phase. That way any change to the language can be subjected to a test - does it break some of the fundamental requirements for parsing or semantic analysis etc.

Thanks and Regards
Dibyendu


May 20, 2019
On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>
>> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>>
>
> I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
>

For the first part: "tokenizing is independent of parsing", one way to think of it
is that to write a lexer (the sub-program that does the tokenizing), you
don't need a parser. You could write it as an independent program. That makes sense, to split any token from the input ('int', '+', '123', etc.) you don't every need to build any AST. Or differently, you don't need to know any info about the structure of the program.

For the second part, a parser can build an AST (or more correctly, can decide
in what grammatical rule it "falls") just by the kinds of tokens (e.g. you have a number token, like 123, then '+', so you must be parsing a binary expression).

I think that somewhere was mentioned that C++ is _dependent_ on the semantic
analysis. To understand this better, think that semantic analysis as the act
of giving meaning to the entities of the program (hence the term semantic).

For example, if I write: int a;
Well, 'a' is a token, whose kind we could say is an identifier. But it doesn't
have any meaning. Knowing that 'a' is a variable with type 'int' does now give
us more info. We can now know whether it can be part of this expression for example: a + 1.
If 'a' was a string, that would be invalid, and we found the error because
we knew the meaning of 'a'.

Now, what is an example of parsing being dependent to the meaning of tokens?
Well, C (and C++) of course:
Consider this expression: (a)*b
If 'a' is a variable, then that is really this: a*b
But, if I have done this:
typedef int a;
somewhere above, then now this expression is suddenly "cast the dereference
of 'b' to the type 'a' (which is int in this case)".

This is known as the lexer hack [1] (actually, the lexer hack is the solution
to the problem).

The important thing to understand here is that we can't parse the expression
just by knowing what kind of token 'a' is. We have to know additional info
about it, info that is provided by the semantic analysis (in the form of the
symbol table, a table which contains info about the symbols of the program).
These grammars are known as context-sensitive (i.e. not context-free) because
you have to have some context to deduce the grammatical rule.

Note that now suddenly there is no clear distinction between the parsing
phase and the semantic phase.

Last but not least, phases being separated clearly has other
important implications beyond comprehensibility. For example, the compiler
is paralellized more easily. While the lexer tokenizes file A, the parser
could be parsing file B while a semantic analysis is run on file C.

[1] https://en.wikipedia.org/wiki/The_lexer_hack
May 21, 2019
On Monday, 20 May 2019 at 22:17:07 UTC, Dibyendu Majumdar wrote:
> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>
>> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>>
>
> I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
>
> For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
>

Currently the Intro says:

The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form.

I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.

For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.

Thanks and Regards
Dibyendu


May 21, 2019
On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
> It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.

AFAIK, you should be able to parse C++ with a GLR parser.

Ola.

June 01, 2019
On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
>> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>>
>>> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>>>
>>
>> I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
>>
>> For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
>>
>
> Currently the Intro says:
>
> The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form.
>
> I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.
>
> For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.
>

Hi Walter, please would you share any insights regarding above?

Thanks and Regards
Dibyendu

June 10, 2019
On Saturday, 1 June 2019 at 13:34:07 UTC, Dibyendu Majumdar wrote:
> On Tuesday, 21 May 2019 at 13:24:11 UTC, Dibyendu Majumdar wrote:
>>> On Monday, 20 May 2019 at 16:13:39 UTC, Walter Bright wrote:
>>>
>>>> The crucial thing to know is that the tokenizing is independent of parsing, and parsing is independent of semantic analysis.
>>>>
>>>
>>> I am trying to understand this aspect - I found the small write up in the Intro section not very clear. Would be great if you could so a session on how things work - maybe video cast?
>>>
>>> For example, the mixin declaration has to convert a string to AST I guess? When does this happen? Does it not need to invoke the lexer on the generated string and build AST while already in the semantic stage?
>>>
>>
>> Currently the Intro says:
>>
>> The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyzer. This separation of the passes makes language tools like syntax directed editors relatively easy to produce. It also is possible to compress D source by storing it in ‘tokenized’ form.
>>
>> I feel this description is unclear, and it might just reflect how DMD is implemented. I haven't implemented a C++ parser but parsers I have worked with - such as for C - it is always the case that lexer doesn't get impacted by the semantic analysis. The standard process is to get a stream of tokens from the lexer and work with that. It is also conceivable that someone could create a "dumb" AST first for C++, and as a subsequent phase add semantic meaning to the AST, just as is done for D.
>>
>> For now I propose to remove this paragraph until there is a better description available. Please would you review my pull request as it it is blocking me from doing further work.
>>
>
> Hi Walter, please would you share any insights regarding above?
>

Hi I am still waiting to hear your views on above. The pull request I submitted for revising the intro is stuck because of this.

Regards

1 2
Next ›   Last »