April 01, 2013 Official D Grammar | ||||
---|---|---|---|---|
| ||||
I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it. In the meantime I've started some work on an AST module for Phobos that contains the data types necessary to build up a parser module so that we can have a standard set of code build D dev tools off of. I decided to work directly from the standard on dlang.org for this to make sure that my module is correct and that the standard is actually correct. I've seen several threads on this newsgroup complaining about the state of the standard and unfortunately this will be another one. 1) Grammar defined in terms of things that aren't tokens. Take, for example, PropertyDeclaration. It's defined as an "@" token followed by... what? "safe"? It's not a real token. It's an identifier. You can't parse this based on checking the token type. You have to check the type and the value. 2) Grammar references rules that don't exist. UserDefinedAttribute is defined in terms of CallExpression, but CallExpression doesn't exist elsewhere in the grammar. BaseInterfaceList is defined in terms of InterfaceClasses, but that rule is never defined. 3) Unnecessary rules. KeyExpression, ValueExpression, ScopeBlockStatement, DeclarationStatement, ThenStatement, ElseStatement, Test, Increment, Aggregate, LwrExpression, UprExpression, FirstExp, LastExp, StructAllocator, StructDeallocator, EnumTag, EnumBaseType, EmptyEnumBody, ConstraintExpression, MixinIdentifier, etc... are all defined in terms of only one other rule. I think that we need to be able to create a grammar description that: * Fits in to a single file, so that a tool implementer does not need to collect bits of the grammar from the various pages on dlang.org. * Can be verified to be correct by an existing tool such as Bison, Goldie, JavaCC, <your favorite here> with a small number of changes. * Is part of the dmd/dlang repositories on github and gets updated every time the language changes. I'm willing to work on this if there's a good chance it will actually be implemented. Thoughts? |
April 02, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 4/1/2013 4:18 PM, Brian Schott wrote: > I've pretty much finished up my work on the std.d.lexer module. I am waiting for > the review queue to make some progress on the other (three?) modules being > reviewed before starting a thread on it. > > In the meantime I've started some work on an AST module for Phobos that contains > the data types necessary to build up a parser module so that we can have a > standard set of code build D dev tools off of. I decided to work directly from > the standard on dlang.org for this to make sure that my module is correct and > that the standard is actually correct. > > I've seen several threads on this newsgroup complaining about the state of the > standard and unfortunately this will be another one. > > 1) Grammar defined in terms of things that aren't tokens. Take, for example, > PropertyDeclaration. It's defined as an "@" token followed by... what? "safe"? > It's not a real token. It's an identifier. You can't parse this based on > checking the token type. You have to check the type and the value. True, do you have a suggestion? > > 2) Grammar references rules that don't exist. UserDefinedAttribute is defined in > terms of CallExpression, but CallExpression doesn't exist elsewhere in the > grammar. BaseInterfaceList is defined in terms of InterfaceClasses, but that > rule is never defined. Yes, this needs to be fixed. > > 3) Unnecessary rules. KeyExpression, ValueExpression, ScopeBlockStatement, > DeclarationStatement, ThenStatement, ElseStatement, Test, Increment, Aggregate, > LwrExpression, UprExpression, FirstExp, LastExp, StructAllocator, > StructDeallocator, EnumTag, EnumBaseType, EmptyEnumBody, ConstraintExpression, > MixinIdentifier, etc... are all defined in terms of only one other rule. Using these makes documentation easier, and I don't think it harms anything. > I think that we need to be able to create a grammar description that: > * Fits in to a single file, so that a tool implementer does not need to collect > bits of the grammar from the various pages on dlang.org. > * Can be verified to be correct by an existing tool such as Bison, Goldie, > JavaCC, <your favorite here> with a small number of changes. > * Is part of the dmd/dlang repositories on github and gets updated every time > the language changes. > > I'm willing to work on this if there's a good chance it will actually be > implemented. Thoughts? I suggest doing this as a sequence of pull requests, not doing just one big one. |
April 02, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 02/04/2013 00:18, Brian Schott wrote:
<snip>
> I think that we need to be able to create a grammar description that:
> * Fits in to a single file, so that a tool implementer does not need to
> collect bits of the grammar from the various pages on dlang.org.
> * Can be verified to be correct by an existing tool such as Bison,
> Goldie, JavaCC, <your favorite here> with a small number of changes.
> * Is part of the dmd/dlang repositories on github and gets updated every
> time the language changes.
<snip>
Indeed, the published grammar needs to be thoroughly checked against what DMD is actually doing, and any discrepancies fixed (or filed in Bugzilla to be fixed in due course). And then they need to be kept in sync.
Has the idea of using a parser generator to build D's parsing code been rejected in the past, or is hand-coding just the way Walter decided to do it? Is the code any more efficient than what a typical parser generator would generate?
And all disambiguation rules (such as "if it's parseable as a DeclarationStatement, it's a DeclarationStatement") need to be made explicit as part of the grammar. I suppose this is where using Bison or similar would help, as it would point out any ambiguities in the grammar that need rules to resolve them.
Stewart.
|
April 02, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | On 2013-04-02 15:21, Stewart Gordon wrote: > Indeed, the published grammar needs to be thoroughly checked against > what DMD is actually doing, and any discrepancies fixed (or filed in > Bugzilla to be fixed in due course). And then they need to be kept in > sync. > > Has the idea of using a parser generator to build D's parsing code been > rejected in the past, or is hand-coding just the way Walter decided to > do it? Is the code any more efficient than what a typical parser > generator would generate? > > And all disambiguation rules (such as "if it's parseable as a > DeclarationStatement, it's a DeclarationStatement") need to be made > explicit as part of the grammar. I suppose this is where using Bison or > similar would help, as it would point out any ambiguities in the grammar > that need rules to resolve them. I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves. -- /Jacob Carlborg |
April 02, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg |
> I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves.
Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though.
This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
|
April 06, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 02/04/2013 00:18, Brian Schott wrote: > I've pretty much finished up my work on the std.d.lexer module. I am > waiting for the review queue to make some progress on the other (three?) > modules being reviewed before starting a thread on it. > > In the meantime I've started some work on an AST module for Phobos that > contains the data types necessary to build up a parser module so that we > can have a standard set of code build D dev tools off of. I decided to > work directly from the standard on dlang.org for this to make sure that > my module is correct and that the standard is actually correct. > > I've seen several threads on this newsgroup complaining about the state > of the standard and unfortunately this will be another one. > > 1) Grammar defined in terms of things that aren't tokens. Take, for > example, PropertyDeclaration. It's defined as an "@" token followed > by... what? "safe"? It's not a real token. It's an identifier. You can't > parse this based on checking the token type. You have to check the type > and the value. > > 2) Grammar references rules that don't exist. UserDefinedAttribute is > defined in terms of CallExpression, but CallExpression doesn't exist > elsewhere in the grammar. BaseInterfaceList is defined in terms of > InterfaceClasses, but that rule is never defined. > > 3) Unnecessary rules. KeyExpression, ValueExpression, > ScopeBlockStatement, DeclarationStatement, ThenStatement, ElseStatement, > Test, Increment, Aggregate, LwrExpression, UprExpression, FirstExp, > LastExp, StructAllocator, StructDeallocator, EnumTag, EnumBaseType, > EmptyEnumBody, ConstraintExpression, MixinIdentifier, etc... are all > defined in terms of only one other rule. > > I think that we need to be able to create a grammar description that: > * Fits in to a single file, so that a tool implementer does not need to > collect bits of the grammar from the various pages on dlang.org. > * Can be verified to be correct by an existing tool such as Bison, > Goldie, JavaCC, <your favorite here> with a small number of changes. > * Is part of the dmd/dlang repositories on github and gets updated every > time the language changes. > > I'm willing to work on this if there's a good chance it will actually be > implemented. Thoughts? Interesting thread. I've been working on a hand-written D parser (in Java, for the DDT IDE) and I too have found a slew of grammar spec issues. Some of them more serious than the ones you mentioned above. In same cases it's actually not clear, or downright wrong what the grammar spec says. For example, here's one off of my notes: void func(int foo() { } ); The spec says that is parsable (basically a function declaration in the parameter list), which makes no sense, and DMD doesn't accept. Some cases are a bit trickier, since it's not clear if the syntax should be accepted or not (sometimes they might make sense but not be allowed). These issues make things a bit harder for tools development that require D language parsers. But the whole grammar spec is so messy, I've been unsure whether it's worth filling bug reports or not (would they be addressed?). There is also the problem that even if those issues are fixed now, the spec could very easily fall out of date in the future, unless we have some system to test the spec. Like you mentioned, ideally we would have a grammar spec for a grammar/PG tool so that correctness could more easily be verified. (it doesn't guarantee no spec bugs, but it makes it much harder for them to be there) -- Bruno Medeiros - Software Engineer |
April 06, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 02/04/2013 00:18, Brian Schott wrote: > I've pretty much finished up my work on the std.d.lexer module. I am > waiting for the review queue to make some progress on the other (three?) > modules being reviewed before starting a thread on it. > BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified. -- Bruno Medeiros - Software Engineer |
April 06, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bruno Medeiros | On Saturday, April 06, 2013 16:21:12 Bruno Medeiros wrote:
> On 02/04/2013 00:18, Brian Schott wrote:
> > I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it.
>
> BTW, even in the lexer spec I've found an issue. How does this parse:
> 5.blah
> According to the spec (maximal munch technique), it should be FLOAT then
> IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming
> the lastest is the correct behavior, so you can write stuff like
> 123.init, but that should be clarified.
It would definitely have to be INTEGER DOT IDENTIFIER due to UFCS, so it sounds like the spec wasn't updated like it should have been.
- Jonathan M Davis
|
April 06, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bruno Medeiros | On 04/06/13 17:21, Bruno Medeiros wrote: > On 02/04/2013 00:18, Brian Schott wrote: >> I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it. >> > > BTW, even in the lexer spec I've found an issue. How does this parse: > 5.blah > According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified. "1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these. Actually, that last case is even more "interesting"; http://dlang.org/lex.html has "1_2_3_4_5_6_._5_6_7_8" as a valid example, which of course it's not ("_5_6_7_8" is a valid identifier), but there is no reason do disallow "1_2_3_4_5_6_.5_6_7_8". > that should be clarified. These are just grammar bugs, that could easily be fixed. Then there are some things that can be less obvious, but shouldn't really be controversial like allowing empty HexString literals. Then there's the enhancement category. Looking through my comments, I think the only deliberate change from dlang.org that I have is in DelimitedString -- there is no reason to forbid q"/abc/def/"; there are no back-compat issues, as it couldn't have existed in legacy D code. artur |
April 06, 2013 Re: Official D Grammar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Artur Skawina | On 06/04/2013 20:52, Artur Skawina wrote: > On 04/06/13 17:21, Bruno Medeiros wrote: >> On 02/04/2013 00:18, Brian Schott wrote: >>> I've pretty much finished up my work on the std.d.lexer module. I am >>> waiting for the review queue to make some progress on the other (three?) >>> modules being reviewed before starting a thread on it. >>> >> >> BTW, even in the lexer spec I've found an issue. How does this parse: >> 5.blah >> According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified. > > "1..2", "1.ident" and a float literal with '_' after the '.' are the > DecimalFloat cases that I immediately ran into when doing a lexer based on > the dlang grammar. It's obvious to a human how these should be handled, but > code generators aren't that smart... But they are good at catching mistakes > like these. The "1..2" is actually mentioned in the spec: "An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer." so it's there, even if it can be missed. But unless I missed it, the spec is incorrect for the "1.ident" or "1_2_3_4_5_6_._5_6_7_8" cases as there is no exception mentioned there... and it's not always 100% obvious to a human how these should be handled. Or maybe that's just me :) -- Bruno Medeiros - Software Engineer |
Copyright © 1999-2021 by the D Language Foundation