D parsing (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » D parsing (page 5)

November 05, 2013

Posted by Martin Nowak
in reply to Timothee Cour

Martin Nowak

Posted in reply to Timothee Cour

On 11/03/2013 02:45 AM, Timothee Cour wrote:
> 1)
> The main issue I see with pegged is PEG grammars don't support left
> recursion, so for example will fail on foo[1].bar(2).fun().
> Unless there's a plan to accomodate those, I sense a dead end.
> One can eliminate left recursion but this has issues.
>
> 2)
> There is some material on extending PEG to support those, eg "Left
> Recursion in Parsing Expression Grammars", or code
> https://github.com/orlandohill/peg-left-recursion but I don't know how
> well they work in practice.
>
Scala extended their PEG generator to handle left recursion.
But then there is also a Scala GLL implementation.
https://github.com/djspiewak/gll-combinators

November 05, 2013

Posted by Martin Nowak
in reply to jgetner

Martin Nowak

Posted in reply to jgetner

On 10/30/2013 11:39 PM, jgetner wrote:
> Is there a function in D for evaluating expressions before compile time?.

I want to throw two more ideas in the parser generator topic which
I carry around for quite some time. I think combined they'd enable AST generation from clean and simple grammars. Maybe someone find this interesting enough to give it a try, I hardly will ever get to this.

There is an interesting paper [MTP] that describes how a slight modification of the ENBF grammar can be used to generate a usable AST
from the grammar definition. Each AST node is named like it's production. This is possible because the modified grammar disallows to mix alternates and sequences which forces one to name all productions.

The other paper is [GLL] parsing which is a new general parser generation technique. The main advantage over GLR, it's a top-down parser. This allows to modularize grammars, e.g. you can define a grammar that reuses/imports an existing grammar for expressions.

The reason to choose generalized parsers instead of say LALR(1)
is maintainability of the grammars. The needed modifications to make
a context free grammar LR(1) or LALR(1)-compatible often make them extremely hard to read and change.
Parse forests (many parse trees) are not a problem I think because most CFG only have local ambiguities so they resolve to a single parse tree.

### A motivating example (desiderata)

grammar.d
```
import number;

Identifier = "[_a-zA-Z][a-zA-Z]*";

// This still needs some syntax to annotate precedence and associativity
MulExpr = Expr "*" Expr;
DivExpr = Expr "/" Expr;
AddExpr = Expr "+" Expr;
SubExpr = Expr "-" Expr;
PowExpr = Expr "^^" Expr;

Expr = MulExpr | DivExpr | AddExpr | SubExpr | PowExpr | Number | Identifer;

Arguments = Expr % ",";

ExprStmt = Expr ";";
CallStmt = Identifier "(" Arguments ")";

Stmt = ExprStmt | CallStmt;
Stmts = Stmt*;
```
```d
auto p = genParser(import("grammar.d"));
auto stmts = p.parseStmts(input);
autp expr = p.parseExpr(input);
foreach (arg; p.parseArguments(input))
{}

final switch (expr)
{
case MulExpr.tag:
case DivExpr.tag:
case AddExpr.tag:
case SubExpr.tag:
case PowExpr.tag:
}

static assert(typeof(MulExpr == Tuple!(Expr, "expr0", Expr, "expr1")));
static assert(typeof(Stmt == Variant!(ExprStmt, CallStmt)));
static assert(typeof(Identifier) == string));
```

[MTP]: http://babel.ls.fi.upm.es/research/mtp/
       http://babel.ls.fi.upm.es/~angel/papers/finalcopy-2005prole.pdf

[GLL]: http://www.rhul.ac.uk/computerscience/research/csle/gllparsers.aspx
       http://dotat.at/tmp/gll.pdf
       for implementers - Modelling GLL Parser Implementations (http://dx.doi.org/10.1007/978-3-642-19440-5_4)

November 05, 2013

Posted by Manfred Nowak
in reply to Martin Nowak

Manfred Nowak

Posted in reply to Martin Nowak

Martin Nowak wrote:
> AFAIK Parser expression grammars handle ambiguity through prioritization.

My statement about ambiguity is taken from the first sentence of chapter 7 of "Parsing Expression Grammars: A Recognition-Based Syntactic Foundation" by Brian Ford: http://pdos.csail.mit.edu/papers/parsing%3Apopl04.pdf:

| Parsing expression grammars provide a [...] foundation for [...] | languages that are designed to be unambiguous.

-manfred

November 05, 2013

Posted by Martin Nowak
in reply to Philippe Sigaud

Martin Nowak

Posted in reply to Philippe Sigaud

On 11/04/2013 06:52 AM, Philippe Sigaud wrote:
> Do you know what part of the D grammar makes it non-LALR(1)?

At least function and template function declarations are not even LR(1).
You need to look for a second left parenthesis to distinguish both.

It's fairly easy to solve this though.
1. Add a new token r")[ ]*("
2. Add an alternate rule that never matches but forces the parser to look behind the closing paren of the function arguments.
...

November 05, 2013

Posted by Dmitry Olshansky
in reply to Philippe Sigaud

Dmitry Olshansky

Posted in reply to Philippe Sigaud

05-Nov-2013 20:55, Philippe Sigaud пишет:
> On Tue, Nov 5, 2013 at 3:54 PM, Dmitry Olshansky <dmitry.olsh@gmail.com
> <mailto:dmitry.olsh@gmail.com>> wrote:
>
>
>     I was also toying with the idea of exposing Builder interface for
>     std.regex. But push/pop IMHO are better be implicitly designed-out:
>
>     auto re =
>     atom('x').star(charClass(__unicode.Letter),atom('y')).__build();
>
>     ... and letting the nesting be explicit.
>
>     Is the same as:
>     auto re = regex(`x(?:\p{L}y)*`);
>
>     Aimed for apps/libs that build regular expressions anyway and have
>     no need in textual parser.
>
> Another possible advantage is to reference external names inside your
> construction, thus naming other regexen or refencing external variables
> to deposit backreferences inside them.

Actually it's a bad, bad idea. It has nice potential to destroy all optimization opportunities and performance guarantees of it (like being linear in time, and that only works today w/o funky extensions used).

After all I'm in a curious position of having to do some work at R-T as well where you can't always just generate some D code ;)

What would be real nice though is to let users register their own dictionary of 'tokens' from that. Then things like Ipv4 pattern or domain name pattern as simple as `\d` pieces they use today (say with \i{user-defined-name}).

> All in all, to get a regex
> construct that can interact with the external word.

Well, I think of some rather interesting ways to do it even w/o tying in some external stuff as building blocks. It's rather making std.regex itself less rigid and more lean (as in cheap to invoke). Then external modules may slice and dice its primitives as seen fit.

>
>     What ANTLR does is similar technique - a regular lookahead to
>     resolve ambiguity in the grammar (implicitly). A lot like LL(k) but
>     with unlimited length (so called LL(*)). Of course, it generates
>     LL(k) disambiguation where possible, then LL(*), failing that the
>     usual backtracking.
>
> I liked that idea since the author added it to ANTLR, but I never used
> it since.
> I wonder whether that can be implemented inside another parser generator
> or if it uses some specific-to-ANTLR internal machinery.

I don't think there is much of specific in it. You would though have to accept it's no longer a PEG but rather some hybrid top-down EBNF parser that resolves ambiguities.

>         I worry that the greater threat to good AST manipulation tools
>         in D is a
>         lack of free time, and not the DMD bugs as much.
>
>
>     Good for you I guess, my developments in related area are blocked
>     still :(
>
> Walter is far from convinced that AST manipulation is a good thing. You
> would have to convince him first.

I thought it was about tools that work with D code like say lints, refactoring, etc.

-- 
Dmitry Olshansky

November 05, 2013

Posted by Andrei Alexandrescu
in reply to Brian Schott

Andrei Alexandrescu

Posted in reply to Brian Schott

On 11/4/13 11:27 PM, Brian Schott wrote:
> On Monday, 4 November 2013 at 13:20:01 UTC, Timothee Cour wrote:
>> for lexing there's already dscanner we could use (while we wait for
>> perhaps
>> a autogenerated lexer);
>> so I think priority is on the autogenerated parser (dscanner has one but
>> hand designed), where it's still unknown what will work well.
>
> Yes, that tool has two properties:
> 1) It works now. Not Soon(tm). You can download it, compile it, and use
> it to dump the AST of your D code in just a minute or two.
> 2) It wasn't built THE ONE TRUE WAY.
>
> But we should take a step back first. Before we try to implement a
> parser for D's grammar, we need to figure out what exactly D's grammar is.
>
> Seriously. We don't have a real grammar for D. We have the language spec
> on dlang.org, but it isn't complete, consistent, or updated when the
> language changes. Want examples? I have a tracker for them here:
> http://d.puremagic.com/issues/show_bug.cgi?id=10233
>
> There's also my project here: https://github.com/Hackerpilot/DGrammar,
> but it's not official and I keep finding differences between it and the
> behavior of DMD.
>
> Why am I the only one who thinks this is a problem?

I agree it's a problem, in fact three problems in one. In decreasing difficulty order:

1. Semantic changes for working code (e.g. order of evaluation etc) are subtle enough to be very difficult to track and require sheer attention and careful manual verification and maintenance of documentation.

2. Semantic analysis changes (i.e. compiles/doesn't compile) are also difficult and require attention, but at least can be to a good extent verified automatically (by means of test suites and runnable examples). In TDPL I have two categories of examples - visible and invisible. The invisible ones do not occur in the printed text but are present in the book source and are used to check whether the claims made by the book are true. It would be really cool if we had something like that for the online documentation. We should be able to intersperse freely documentation text with invisible unittests that ensure the documentation is correct.

3. Grammar changes are the simplest ones and in a way the most embarrassing if they happen. The best solution I see to that is deriving the documentation and the actual parser from the same source. This is part of why I'm so keen on parser generators.

Andrei

P.S. I haven't forgotten about the lexer - it's still on the back burner but I will publish it as soon as I get a chance.

November 05, 2013

Posted by cal
in reply to Martin Nowak

cal

Posted in reply to Martin Nowak

On Tuesday, 5 November 2013 at 17:23:23 UTC, Martin Nowak wrote:
> On 11/01/2013 09:59 PM, Philippe Sigaud wrote:
>> The examples directory shows different grammars, from JSON to XML to C to D.
>
> Sounds promising.
> I think I found a way to implement a D REPL but are lacking a reliable way to classify code snippets (expressions, declarations, statements) and get introduced symbol names.
> Do you think it will be able to handle that (performance is not an issue)?

I did similar using Pegged little while back
(https://github.com/callumenator/dabble), I tuned the PEG grammar
quite a lot, but it worked ok.

November 06, 2013

Posted by Chad Joan
in reply to Dmitry Olshansky

Chad Joan

Posted in reply to Dmitry Olshansky

On Tuesday, 5 November 2013 at 14:54:34 UTC, Dmitry Olshansky wrote:
>
> I was also toying with the idea of exposing Builder interface for std.regex. But push/pop IMHO are better be implicitly designed-out:
>
> auto re = atom('x').star(charClass(unicode.Letter),atom('y')).build();
>
> ... and letting the nesting be explicit.
>
> Is the same as:
> auto re = regex(`x(?:\p{L}y)*`);
>
> Aimed for apps/libs that build regular expressions anyway and have no need in textual parser.
>

Interesting.  I like how it induces some amount of static verification, though I worry that it could harm procedural generation of grammars.  It would be difficult, for instance, to use that API to do the equivalent of pushing an atom in one function and popping it in another.

I wonder if we are at different levels of abstraction.  The example you give seems like it requires the API to remember, in a structured way, all of the information presented by the call-chaining and call-nesting.  I might implement something like that with a stateful "builder" object under the hood.  However, the example you give seems like it is closer to what a regex engine would morph an expression into, thus making it a higher abstraction.

>> That snippet would create a parser that recognizes the grammar 'x' (
>> 'y'? ).
>> The current fledgling implementation creates this parser:
>> http://pastebin.com/MgSqWXE2
>>
>> Of course, no one would be expected to write grammars like that. It
>> would be the job of tools like Pegged or std.regex to package it up in
>> nice syntax that is easy to use.
>
> I thought to provide some building blocks for that with new std.uni. Not quite everything I wanted, but now at least there is one set of wheels less to reinvent.
>

I haven't looked at std.uni earnestly yet, but if it succeeds at making that unicode/utf jungle manageable, then I will be incredibly thankful.

> [snip]
>
>> Another fun thought: PEGs can have look-behind that includes any regular
>> elements without any additional algorithmic complexity. Just take all of
>> the look-behinds in the grammar, mash them together into one big
>> regular-expression using regular alternation (|), and then have the
>> resulting automaton consume in lock-step with the PEG parser.  Whenever
>> the PEG parser needs to do a lookbehind, it just checks to see if the
>> companion automaton is in a matching state for the capture it needs.
>
> Sounds quite slow to do it "just in case". Also complete DFAs tend to be mm quite big.
>

I was envisioning it being done lazily and strictly as-needed, if I even got around to doing it at all.

> What ANTLR does is similar technique - a regular lookahead to resolve ambiguity in the grammar (implicitly). A lot like LL(k) but with unlimited length (so called LL(*)). Of course, it generates LL(k) disambiguation where possible, then LL(*), failing that the usual backtracking.
>

Neat.

>>
>> *sigh*, I feel like I could write a paper on this stuff if I were in
>> grad school right now.  Alas, I am stuck doing 50-60 hours a week of
>> soul-sucking business programming.
>
> I heard Sociomantic is hiring D programmers for coding some awesome stuff, you may as well apply :)
>

Tempting.

And it seems even Facebook is joining the market now, which is news to me.

>> Well, then again, my understanding
>> is that even though I can think of things that seem like they would make
>> interesting topics for publishable papers, reality would have the profs
>> conscript me to do completely different things that are possibly just as
>> inane as the business programming.
>>
> Speaking for my limited experience - at times it's like that.
>

Yay, someone's observations corroborate mine!
...
Crap, someone observations corroborate mine :(

;)

>> I worry that the greater threat to good AST manipulation tools in D is a
>> lack of free time, and not the DMD bugs as much.
>
> Good for you I guess, my developments in related area are blocked still :(
>

Well, hopefully you've got the wind at your back now.

November 06, 2013

Posted by Chad Joan
in reply to Philippe Sigaud

Chad Joan

Posted in reply to Philippe Sigaud

On Tuesday, 5 November 2013 at 16:31:52 UTC, Philippe Sigaud wrote:
> On Tue, Nov 5, 2013 at 5:45 AM, Chad Joan <chadjoan@gmail.com> wrote:
>
>>  Use the repetition operator(s) and turn the resulting array into
>>> whatever tree you like.  In practice, I have never had a problem with this.
>>>
>>> I have used both Pegged and have written an SQL parser at work (not
>>> exhaustive, just what's needed) that uses C macros as PEG elements.
>>>  Tangent: Using C macros for this worked surprisingly well and allowed me
>>> to avoid an extra step in the build process (I do not have access to D for
>>> that project).  Pegged is still much more scalable, optimizable, and the
>>> grammars are a lot less ugly/finicky.
>>>
>>
> That made my day, thanks!

You're very welcome!

Thank you for spending all that time to make a neat thing.

Maybe I'll rub it in:
Pegged taught me about PEGs and how to build reasonably powerful parsers with extremely limited tools and support.  It's like being able to make shelter in the woods when no one gives you a tent.  And my company's codebase is a jungle.  So even when D isn't around, it is still /useful/.  Thanks.

November 06, 2013

Posted by Jacob Carlborg
in reply to Philippe Sigaud

Jacob Carlborg

Posted in reply to Philippe Sigaud

On 2013-11-05 17:55, Philippe Sigaud wrote:

> Walter is far from convinced that AST manipulation is a good thing. You
> would have to convince him first. His fear is that it will lead to
> unreadable code, and everyone using her own personnal version of D.
> AFAICT, nothing of the sort happened in Lisp (I mean, Lispers have
> balkanization, but *not* due to AST manipulation).

You know, Walter did a talk about AST macros at the first D conference. The idea back then was to add AST macros, hence the "macro" keyword.

-- 
/Jacob Carlborg

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation