D parsing (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » D parsing (page 3)

November 04, 2013

Posted by Philippe Sigaud

Philippe Sigaud

Attachments:

text/html part

On Mon, Nov 4, 2013 at 1:58 AM, Timothee Cour <thelastmammoth@gmail.com>wrote:

>
> On Sat, Nov 2, 2013 at 1:19 AM, Philippe Sigaud <philippe.sigaud@gmail.com
> > wrote:
>
>>
>> Note that Pegged has no problem with the official C grammar or the officiel XML specification...
>>
>
> how do you handle left recursion in C ? eg: a[0][1] etc?
>
>
I don't remember. Maybe the grammar I use has no left-recursion? I don't remember whether I sued the official C grammar or one created for PEGs by someone else. I know I tested it on many different programs, with no problem.

November 04, 2013

Posted by Philippe Sigaud

Philippe Sigaud

Attachments:

text/html part

On Sun, Nov 3, 2013 at 7:08 PM, Timothee Cour <thelastmammoth@gmail.com>wrote:

>
> On Sun, Nov 3, 2013 at 1:13 AM, Philippe Sigaud <philippe.sigaud@gmail.com
> > wrote:
>
>>
>> My current plan is to write different engines, and letting either the
>> user select them at compile-time, or to have the parser decide which one to
>> use, depending on the grammar. I'm pretty sure the 'Type 3' parts of a
>> grammar (regular expressions) could be bone by using std.regex, for example.
>>
>
> even lexing can't be done with regex, eg nesting comments : /+ ... +/ Also, although it may seem cleaner at first to combine lexing and parsing in 1 big grammar (as done in pegged), it usually is faster do feed a (separate) lexer output into parser.
>

Lexing, yes. I was imprecise: even in a context-free grammar, some rules are regular and could use std.regex (the ct part) as the underlying engine, just for that rule.

November 04, 2013

Posted by Philippe Sigaud

Philippe Sigaud

Attachments:

text/html part

I *used* the grammar, obviously. Suing could be an option to disallow left-recursion, but let's explore other ways for now.


On Mon, Nov 4, 2013 at 6:46 AM, Philippe Sigaud <philippe.sigaud@gmail.com>wrote:

> On Mon, Nov 4, 2013 at 1:58 AM, Timothee Cour <thelastmammoth@gmail.com>wrote:
>
>>
>> On Sat, Nov 2, 2013 at 1:19 AM, Philippe Sigaud < philippe.sigaud@gmail.com> wrote:
>>
>>>
>>> Note that Pegged has no problem with the official C grammar or the officiel XML specification...
>>>
>>
>> how do you handle left recursion in C ? eg: a[0][1] etc?
>>
>>
> I don't remember. Maybe the grammar I use has no left-recursion? I don't remember whether I sued the official C grammar or one created for PEGs by someone else. I know I tested it on many different programs, with no problem.
>
>
>

November 04, 2013

Posted by Philippe Sigaud

Philippe Sigaud

Attachments:

text/html part

On Mon, Nov 4, 2013 at 1:55 AM, Timothee Cour <thelastmammoth@gmail.com>wrote:

>
> I guess I'll have to write a CT-compatible LALR(1) engine...
>>
>>
>> D does not fit into LALR(1), you need glr.
>>
>
Well, too bad. GLR is also on my plate, but I fear its cubic behavior
(IIRC, it degrades gracefully, though).
Do you know what part of the D grammar makes it non-LALR(1)?

November 04, 2013

Posted by Robert Schadek

Robert Schadek

Attachments:

text/html part

On 11/04/2013 06:48 AM, Philippe Sigaud wrote:
> On Sun, Nov 3, 2013 at 7:08 PM, Timothee Cour <thelastmammoth@gmail.com <mailto:thelastmammoth@gmail.com>> wrote:
>
>
>     On Sun, Nov 3, 2013 at 1:13 AM, Philippe Sigaud
>     <philippe.sigaud@gmail.com <mailto:philippe.sigaud@gmail.com>>wrote:
>
>
>         My current plan is to write different engines, and letting
>         either the user select them at compile-time, or to have the
>         parser decide which one to use, depending on the grammar. I'm
>         pretty sure the 'Type 3' parts of a grammar (regular
>         expressions) could be bone by using std.regex, for example.
>
>
>     even lexing can't be done with regex, eg nesting comments : /+ ... +/
>     Also, although it may seem cleaner at first to combine lexing and
>     parsing in 1 big grammar (as done in pegged), it usually is faster
>     do feed a (separate) lexer output into parser.
>
>
> Lexing, yes. I was imprecise: even in a context-free grammar, some rules are regular and could use std.regex (the ct part) as the underlying engine, just for that rule.
Lexing can not be done with regex. Think myArray[1. ] ! What is next a dot or a number.

November 04, 2013

Posted by Robert Schadek

Robert Schadek

Attachments:

text/html part

On 11/04/2013 06:52 AM, Philippe Sigaud wrote:
>
> On Mon, Nov 4, 2013 at 1:55 AM, Timothee Cour <thelastmammoth@gmail.com <mailto:thelastmammoth@gmail.com>> wrote:
>
>
>>         I guess I'll have to write a CT-compatible LALR(1) engine...
>> 
>         D does not fit into LALR(1), you need glr.
>
>
> Well, too bad. GLR is also on my plate, but I fear its cubic behavior (IIRC, it degrades gracefully, though).
Thats one part, and even worse is that you need a decider function if more than one rule accepts.
> Do you know what part of the D grammar makes it non-LALR(1)?
I had big trouble with the IdentifierList rules.

November 04, 2013

Posted by Dmitry Olshansky
in reply to Robert Schadek

Dmitry Olshansky

Posted in reply to Robert Schadek

04-Nov-2013 12:28, Robert Schadek пишет:
> On 11/04/2013 06:48 AM, Philippe Sigaud wrote:
>> On Sun, Nov 3, 2013 at 7:08 PM, Timothee Cour
>> <thelastmammoth@gmail.com <mailto:thelastmammoth@gmail.com>> wrote:
>>
>>
>>     On Sun, Nov 3, 2013 at 1:13 AM, Philippe Sigaud
>>     <philippe.sigaud@gmail.com <mailto:philippe.sigaud@gmail.com>>wrote:
>>
>>
>>         My current plan is to write different engines, and letting
>>         either the user select them at compile-time, or to have the
>>         parser decide which one to use, depending on the grammar. I'm
>>         pretty sure the 'Type 3' parts of a grammar (regular
>>         expressions) could be bone by using std.regex, for example.
>>
>>
>>     even lexing can't be done with regex, eg nesting comments : /+ ... +/
>>     Also, although it may seem cleaner at first to combine lexing and
>>     parsing in 1 big grammar (as done in pegged), it usually is faster
>>     do feed a (separate) lexer output into parser.
>>
>>
>> Lexing, yes. I was imprecise: even in a context-free grammar, some
>> rules are regular and could use std.regex (the ct part) as the
>> underlying engine, just for that rule.
> Lexing can not be done with regex. Think myArray[1. ] ! What is next a
> dot or a number.

You could use lookahead extension ;)

In general I would not recommend ATM to use std.regex for performance-critical lexer if only because it wasn't tuned for such a use case yet.

I have plans for extending std.regex capabilities in many directions, lexing being one important use case.

-- 
Dmitry Olshansky

November 04, 2013

Posted by Timothee Cour
in reply to Dmitry Olshansky

Timothee Cour

Posted in reply to Dmitry Olshansky

Attachments:

text/html part

for lexing there's already dscanner we could use (while we wait for perhaps
a autogenerated lexer);
so I think priority is on the autogenerated parser (dscanner has one but
hand designed), where it's still unknown what will work well.


On Mon, Nov 4, 2013 at 2:43 AM, Dmitry Olshansky <dmitry.olsh@gmail.com>wrote:

> 04-Nov-2013 12:28, Robert Schadek пишет:
>
>> On 11/04/2013 06:48 AM, Philippe Sigaud wrote:
>>
>>> On Sun, Nov 3, 2013 at 7:08 PM, Timothee Cour <thelastmammoth@gmail.com <mailto:thelastmammoth@gmail.com>> wrote:
>>>
>>>
>>>     On Sun, Nov 3, 2013 at 1:13 AM, Philippe Sigaud
>>>     <philippe.sigaud@gmail.com <mailto:philippe.sigaud@gmail.com>>wrote:
>>>
>>>
>>>
>>>         My current plan is to write different engines, and letting
>>>         either the user select them at compile-time, or to have the
>>>         parser decide which one to use, depending on the grammar. I'm
>>>         pretty sure the 'Type 3' parts of a grammar (regular
>>>         expressions) could be bone by using std.regex, for example.
>>>
>>>
>>>     even lexing can't be done with regex, eg nesting comments : /+ ... +/
>>>     Also, although it may seem cleaner at first to combine lexing and
>>>     parsing in 1 big grammar (as done in pegged), it usually is faster
>>>     do feed a (separate) lexer output into parser.
>>>
>>>
>>> Lexing, yes. I was imprecise: even in a context-free grammar, some rules are regular and could use std.regex (the ct part) as the underlying engine, just for that rule.
>>>
>> Lexing can not be done with regex. Think myArray[1. ] ! What is next a dot or a number.
>>
>
> You could use lookahead extension ;)
>
> In general I would not recommend ATM to use std.regex for performance-critical lexer if only because it wasn't tuned for such a use case yet.
>
> I have plans for extending std.regex capabilities in many directions, lexing being one important use case.
>
> --
> Dmitry Olshansky
>

November 05, 2013

Posted by Chad Joan
in reply to Timothee Cour

Chad Joan

Posted in reply to Timothee Cour

On Sunday, 3 November 2013 at 01:45:23 UTC, Timothee Cour wrote:
> 1)
> The main issue I see with pegged is PEG grammars don't support left
> recursion, so for example will fail on foo[1].bar(2).fun().
> Unless there's a plan to accomodate those, I sense a dead end.
> One can eliminate left recursion but this has issues.
>

Use the repetition operator(s) and turn the resulting array into whatever tree you like.  In practice, I have never had a problem with this.

I have used both Pegged and have written an SQL parser at work (not exhaustive, just what's needed) that uses C macros as PEG elements.  Tangent: Using C macros for this worked surprisingly well and allowed me to avoid an extra step in the build process (I do not have access to D for that project).  Pegged is still much more scalable, optimizable, and the grammars are a lot less ugly/finicky.

November 05, 2013

Posted by Chad Joan
in reply to Andrei Alexandrescu

Chad Joan

Posted in reply to Andrei Alexandrescu

On Friday, 1 November 2013 at 21:22:46 UTC, Andrei Alexandrescu wrote:
> ...
> Bugs stopping Pegged from going forward should receive high priority. I also encourage others to participate to that and similar work.
>
>
> Andrei

Ack!  I share Philippe's sense of timing.  This would have been even nicer to hear a year ago when both of us were actively committing ;)

I was going to close a project or two that I care deeply about and had started a long time ago, but I see that this might be a hard decision.

Nonetheless, I am really glad that you are showing this interest!  I like to hear stuff like this, since I too really like Pegged.

Andrei and Philippe,

I feel compelled to share some of my thoughts that I never had time to finish back then.

I was working on a parser-engine-as-a-library that could be used to as optimized internals for Pegged, as well as any other tools that need to recognize these common patterns.

The idea was to expose an API like this:

    string makeParser()
    {
        auto builder = new ParserBuilder!char;
        builder.pushSequence();
            builder.literal('x');
            builder.pushMaybe();
                builder.literal('y');
            builder.pop();
        builder.pop();
        return builder.toDCode("callMe");
    }

    const foo = makeParser();

    pragma(msg, foo);

    mixin(foo);

That snippet would create a parser that recognizes the grammar 'x' ( 'y'? ).
The current fledgling implementation creates this parser:
http://pastebin.com/MgSqWXE2

Of course, no one would be expected to write grammars like that.  It would be the job of tools like Pegged or std.regex to package it up in nice syntax that is easy to use.

The code already takes a somewhat different strategy than Pegged's original strategy.  Rather than generating a bunch of templates that dmd then has to instantiate to actualize the parser, it just emits a bunch of very primitive procedural D code.  I suspect that this approach would mixin far faster with current dmd, because the deeply nested templates generated by Pegged seemed to be a bottleneck.  I have to hand it to Philippe though for coming up with a very clever way to bootstrap the thing: once I saw how his templates assembled together, I realized just how convenient that was!

(And I think my parser generator still has to be tought how to avoid making redundant branches in its output: there's some hash table action that belongs in there somewhere.)

The small amount of code that I have for it is here:
https://github.com/chadjoan/xdc/blob/master/src/xdc/parser_builder/parser_builder.d

I wanted to eventually make it generic enough to recognize patterns in things besides strings.  Being able to write grammars that recognize patterns in ASTs is /useful/.  That leads into the whole xdc project: automate all of the tedious crud in semantic analysis, and thus make compiler writing, and possibly other AST manipulations in user code, become all much easier.

The other thing I wanted to do was to optimize it.
- I intended it to do no allocations unless the caller asks for it.
- I intended to do a bunch of PEG/regex hybridization.

I noticed some mathematical properties of PEGs and regular expressions that should allow you to mix the two as much as possible.  All you have to do is tell it how to behave at the boundaries where they meet.  And given that PEGs already define their own behavior pretty well, it would become possible to lower a lot of a PEG into regular expressions connected with a minimal set of PEG rules.  This would be some awesome lowering: if you first do a pass that inlines as many rules as possible, and then do a second pass that converts PEG elements into regular elements whenever possible, then I feel like the thing will be damned near optimal.  If you are wondering what these mathematical properties are, then I encourage you to look at this snippet where I define "unitary" and "non-unitary" expressions, for lack of prior terms:
http://pastebin.com/iYBypRc5

Another fun thought: PEGs can have look-behind that includes any regular elements without any additional algorithmic complexity.  Just take all of the look-behinds in the grammar, mash them together into one big regular-expression using regular alternation (|), and then have the resulting automaton consume in lock-step with the PEG parser.  Whenever the PEG parser needs to do a lookbehind, it just checks to see if the companion automaton is in a matching state for the capture it needs.

*sigh*, I feel like I could write a paper on this stuff if I were in grad school right now.  Alas, I am stuck doing 50-60 hours a week of soul-sucking business programming.  Well, then again, my understanding is that even though I can think of things that seem like they would make interesting topics for publishable papers, reality would have the profs conscript me to do completely different things that are possibly just as inane as the business programming.

I worry that the greater threat to good AST manipulation tools in D is a lack of free time, and not the DMD bugs as much.

I hope this is useful to someone!

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation