April 07, 2013
On 04/07/13 00:35, Bruno Medeiros wrote:
> On 06/04/2013 20:52, Artur Skawina wrote:
>> On 04/06/13 17:21, Bruno Medeiros wrote:
>>> On 02/04/2013 00:18, Brian Schott wrote:
>>>> I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it.
>>>>
>>>
>>> BTW, even in the lexer spec I've found an issue. How does this parse:
>>>    5.blah
>>> According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.
>>
>> "1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these.
> 
> The "1..2" is actually mentioned in the spec:
> "An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer."
> so it's there, even if it can be missed.

I know, but documenting a (grammar) bug does not make it go away.

> But unless I missed it, the spec is incorrect for the "1.ident" or "1_2_3_4_5_6_._5_6_7_8" cases as there is no exception mentioned there... and it's not always 100% obvious to a human how these should be handled. Or maybe that's just me :)

What does the "spec" currently say about ".001"?..

It's been a while since I did a d lexer, based on the dlang grammar - it (the lexer) was supposed to be dmd compatible. Took a closer look at the actual dlang.org rules today while writing this message...


Will try to find some time to clean up and convert a working D lexical grammar to PEG; what i have should be 1:1 translatable, except one rule (DelimitedString) and put it on the wiki. Maybe it will help someone avoid these issues.

artur
April 08, 2013
On Tuesday, 2 April 2013 at 19:00:21 UTC, Tobias Pankrath wrote:
>
>> I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves.
>
> Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though.
>
> This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.

An alternative idea for ensuring that documentation and implementation are in sync might be to list the full grammar definition as a data structure that can both be used as input for the parser and as input for a tool that generates the documentation.  Theoretically possible, :) just look at Philippe Sigaud's Pegged.
April 08, 2013
On Monday, 8 April 2013 at 21:50:12 UTC, Christopher Bergqvist wrote:
> On Tuesday, 2 April 2013 at 19:00:21 UTC, Tobias Pankrath wrote:
>>
>>> I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves.
>>
>> Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though.
>>
>> This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
>
> An alternative idea for ensuring that documentation and implementation are in sync might be to list the full grammar definition as a data structure that can both be used as input for the parser and as input for a tool that generates the documentation.  Theoretically possible, :) just look at Philippe Sigaud's Pegged.

I know but the parser is currently hand written and I think Walter will only accept an auto generated parser if it is as fast as the current solution.

However in an old discussion someone said that the D grammar isn't LALR(1) or LR(1), so I don't think that is possible with current D parser generators. Do we have a pegged grammar for D?

Another think is that for documentation purposes you might want to have an more readable but ambiguous grammar than your generator of choice will accept.


April 09, 2013
> However in an old discussion someone said that the D grammar isn't LALR(1)
> or LR(1), so I don't think that is possible with current D parser
> generators. Do we have a pegged grammar for D?
>

Yes, it comes with the project. But, it's still buggy (sometimes due to my own mistakes, sometimes due to plain errors in the online D grammar). And the generated parser is quite slow, halas.


April 09, 2013
On 02/04/2013 03:13, Walter Bright wrote:
>>
>> 1) Grammar defined in terms of things that aren't tokens. Take, for
>> example,
>> PropertyDeclaration. It's defined as an "@" token followed by... what?
>> "safe"?
>> It's not a real token. It's an identifier. You can't parse this based on
>> checking the token type. You have to check the type and the value.
>
> True, do you have a suggestion?

I don't think that kind of grammar issue is too annoying, since it's easy to understand what the intended behavior is (in this case at least).
But to fix it, well, we can have just that: have the grammar say it should parse an identifier after the @, and then issue a semantic error of sorts if the value is not one of the expected special values (safe, etc.).
Parsing an identifier here is in any case the best error recovery strategy anyways.

A bit more annoying is the case with the extern declaration, with the C++ parameter:
  extern(C++)
here you have to look at a special identifier (the C, D, PASCAL part) and see if there is a ++ token ahead, it's a bit more of special-casing in the parser. Here I think it would have been better to change the the language itself and use "CPP" instead of "C++". A minor simplification.

-- 
Bruno Medeiros - Software Engineer
April 09, 2013
On 07/04/2013 16:14, Artur Skawina wrote:
>> The "1..2" is actually mentioned in the spec:
>> >"An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer."
>> >so it's there, even if it can be missed.
> I know, but documenting a (grammar) bug does not make it go away.
>

Who says its a bug? From my understanding, this exception is there on purpose, to make it easier to use the DOT_DOT operator in the slice expresions:

  foo[1..2] // It would be silly to have to put a space after the 1

At most you could make a case that "1." shouldn't ever parse as float, that the decimal part should be required if the dot is present.

-- 
Bruno Medeiros - Software Engineer
April 09, 2013
On 04/09/13 12:24, Bruno Medeiros wrote:
> On 07/04/2013 16:14, Artur Skawina wrote:
>>> The "1..2" is actually mentioned in the spec:
>>> >"An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer."
>>> >so it's there, even if it can be missed.
>> I know, but documenting a (grammar) bug does not make it go away.
>>
> 
> Who says its a bug? From my understanding, this exception is there on purpose, to make it easier to use the DOT_DOT operator in the slice expresions:
> 
>   foo[1..2] // It would be silly to have to put a space after the 1
> 
> At most you could make a case that "1." shouldn't ever parse as float, that the decimal part should be required if the dot is present.

It's a bug, because the grammar does not correctly describe the rules. A parser/lexer based on the grammar alone will not work. Documenting the exception(s) helps the human, but doesn't make the grammar correct.

I've started the PEG conversion of my lexer rules and the relevant one looks like this:

   DecimalFloat:
             (LeadingDecimal "." !"." !IdentifierStart DecimalDigitsNoSingleUS DecimalExponent)
           / (LeadingDecimal "." !"." !IdentifierStart DigitUS*)
           / ("." LeadingDecimal DecimalExponent?)

This works as-is, w/o any extra info - the working lexer is mechanically
generated from aot this rule.
(It differs from the dlang definition in at least four ways - "1..2",
"1.ident", "1.000_1" and ".001")

I'll try to finish the conversion and post the whole lexical grammar in a couple days (have almost no D-time right now).

artur
April 09, 2013
On 2013-04-02 04:13, Walter Bright wrote:

>> 1) Grammar defined in terms of things that aren't tokens. Take, for
>> example,
>> PropertyDeclaration. It's defined as an "@" token followed by... what?
>> "safe"?
>> It's not a real token. It's an identifier. You can't parse this based on
>> checking the token type. You have to check the type and the value.
>
> True, do you have a suggestion?

Just define that @safe should be a token?

-- 
/Jacob Carlborg
April 20, 2013
I've moved my work on the grammar to the following location on Github:

https://github.com/Hackerpilot/DGrammar

This uses ANTLR, as the other parser generators can't handle D's grammar. Several rules from the official grammar were removed, and several others were added (Such as an actual rule for a function declaration...) I also tried to fix any inacuracies or omissions I came across in the online documentation.

Comments, issues, and pull requests welcome.
April 20, 2013
20-Apr-2013 12:31, Brian Schott пишет:
> I've moved my work on the grammar to the following location on Github:
>
> https://github.com/Hackerpilot/DGrammar
>
> This uses ANTLR, as the other parser generators can't handle D's
> grammar.

Great. IMHO ANTLR is one of the sanest.

> Several rules from the official grammar were removed, and
> several others were added (Such as an actual rule for a function
> declaration...) I also tried to fix any inacuracies or omissions I came
> across in the online documentation.
>
> Comments, issues, and pull requests welcome.

Bookmarked for now ;)

-- 
Dmitry Olshansky