Jump to page: 1 26  
Page
Thread overview
Pegged, a Parsing Expression Grammar (PEG) generator in D
Mar 10, 2012
Philippe Sigaud
Mar 10, 2012
Andrej Mitrovic
Mar 11, 2012
bearophile
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 13, 2012
Jay Norwood
Mar 13, 2012
Jay Norwood
Mar 11, 2012
Jonathan M Davis
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Jonathan M Davis
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Aug 09, 2013
Tom Compton
Mar 11, 2012
kraybourne
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Mar 11, 2012
Philippe Sigaud
Re: Pegged, From EBNF to PEG
Mar 13, 2012
bls
Mar 13, 2012
Dmitry Olshansky
Mar 13, 2012
bls
Mar 13, 2012
Dmitry Olshansky
Mar 13, 2012
Philippe Sigaud
Mar 13, 2012
Philippe Sigaud
Mar 14, 2012
Dmitry Olshansky
Mar 17, 2012
Philippe Sigaud
Mar 17, 2012
Dmitry Olshansky
Mar 17, 2012
Philippe Sigaud
Mar 17, 2012
Dmitry Olshansky
Mar 17, 2012
Philippe Sigaud
Mar 13, 2012
Philippe Sigaud
Mar 13, 2012
Tobias Pankrath
Mar 14, 2012
Jay Norwood
Mar 15, 2012
Timon Gehr
Mar 23, 2012
Martin Nowak
Aug 07, 2013
Elie Morisse
March 10, 2012
Hello,

I created a new Github project, Pegged, a Parsing Expression Grammar (PEG) generator in D.

https://github.com/PhilippeSigaud/Pegged

docs: https://github.com/PhilippeSigaud/Pegged/wiki

PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

The idea is to give the generator a PEG with the standard syntax. From this grammar definition, a set of related parsers will be created, to be used at runtime or compile time.

Usage
-----

To use Pegged, just call the `grammar` function with a PEG and mix it in. For example:


import pegged.grammar;

mixin(grammar("
    Expr     <- Factor AddExpr*
    AddExpr  <- ('+'/'-') Factor
    Factor   <- Primary MulExpr*
    MulExpr  <- ('*'/'/') Primary
    Primary  <- Parens / Number / Variable / '-' Primary

    Parens   <- '(' Expr ')'
    Number   <~ [0-9]+
    Variable <- Identifier
"));



This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for basic arithmetic expressions with operator precedence ('*' and '/' bind stronger than '+' or '-'). `Identifier` is a pre-defined parser recognizing your basic C-style identifier. Recursive or mutually recursive rules are OK (no left recursion for now).

To use a parser, use the `.parse` method. It will return a parse tree containing the calls to the different rules:

// Parsing at compile-time:
enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

pragma(msg, parseTree1.capture);
writeln(parseTree1);

// And at runtime too:
auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



Features
--------

* The complete set of PEG operators are implemented
* Pegged can parse its input at compile time and generate a complete parse tree at compile time. In a word: compile-time string (read: D code) transformation and generation.
* You can parse at runtime also, you lucky you.
* Use a standard and readable PEG syntax as a DSL, not a bunch of templates that hide the parser in noise.
* But you can use expression templates if you want, as parsers are all available as such. Pegged is implemented as an expression template, and what's good for the library writer is sure OK for the user too.
* Some useful additional operators are there too: a way to discard matches (thus dumping them from the parse tree), to push captures on a stack, to accept matches that are equal to another match
* Adding new parsers is easy.
* Grammars are composable: you can put different `mixin(grammar(rules));` in a module and then grammars and rules can refer to one another. That way, you can have utility grammars providing their functionalities to other grammars.
* That's why Pegged comes with some pre-defined grammars (JSON, etc).
* Grammars can be dumped in a file to create a D module.

More advanced features, outside the standard PEG perimeter are there to bring more power in the mix:

* Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The previous rule defines a parametrized parser taking two other parsers (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
* Named captures: any parser can be named with the `=` operator. The parse tree generated by the parser (so, also its matches) is delivered to the user in the output. Other parsers in the grammar see the named captures too.
* Semantic actions can be added to any rule in a grammar. Once a rule has matched, its associated action is called on the rule output and passed as final result to other parsers further up the grammar. Do what you want to the parse tree. If the passed actions are delegates, they can access external variables.


Philippe

March 10, 2012
On 11-03-2012 00:28, Philippe Sigaud wrote:
> Hello,
>
> I created a new Github project, Pegged, a Parsing Expression Grammar
> (PEG) generator in D.
>
> https://github.com/PhilippeSigaud/Pegged
>
> docs: https://github.com/PhilippeSigaud/Pegged/wiki
>
> PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar
>
> The idea is to give the generator a PEG with the standard syntax. From
> this grammar definition, a set of related parsers will be created, to be
> used at runtime or compile time.
>
> Usage
> -----
>
> To use Pegged, just call the `grammar` function with a PEG and mix it
> in. For example:
>
>
> import pegged.grammar;
>
> mixin(grammar("
> Expr <- Factor AddExpr*
> AddExpr <- ('+'/'-') Factor
> Factor <- Primary MulExpr*
> MulExpr <- ('*'/'/') Primary
> Primary <- Parens / Number / Variable / '-' Primary
>
> Parens <- '(' Expr ')'
> Number <~ [0-9]+
> Variable <- Identifier
> "));
>
>
>
> This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
> basic arithmetic expressions with operator precedence ('*' and '/' bind
> stronger than '+' or '-'). `Identifier` is a pre-defined parser
> recognizing your basic C-style identifier. Recursive or mutually
> recursive rules are OK (no left recursion for now).
>
> To use a parser, use the `.parse` method. It will return a parse tree
> containing the calls to the different rules:
>
> // Parsing at compile-time:
> enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");
>
> pragma(msg, parseTree1.capture);
> writeln(parseTree1);
>
> // And at runtime too:
> auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
> assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);
>
>
>
> Features
> --------
>
> * The complete set of PEG operators are implemented
> * Pegged can parse its input at compile time and generate a complete
> parse tree at compile time. In a word: compile-time string (read: D
> code) transformation and generation.
> * You can parse at runtime also, you lucky you.
> * Use a standard and readable PEG syntax as a DSL, not a bunch of
> templates that hide the parser in noise.
> * But you can use expression templates if you want, as parsers are all
> available as such. Pegged is implemented as an expression template, and
> what's good for the library writer is sure OK for the user too.
> * Some useful additional operators are there too: a way to discard
> matches (thus dumping them from the parse tree), to push captures on a
> stack, to accept matches that are equal to another match
> * Adding new parsers is easy.
> * Grammars are composable: you can put different
> `mixin(grammar(rules));` in a module and then grammars and rules can
> refer to one another. That way, you can have utility grammars providing
> their functionalities to other grammars.
> * That's why Pegged comes with some pre-defined grammars (JSON, etc).
> * Grammars can be dumped in a file to create a D module.
>
> More advanced features, outside the standard PEG perimeter are there to
> bring more power in the mix:
>
> * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
> previous rule defines a parametrized parser taking two other parsers
> (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
> * Named captures: any parser can be named with the `=` operator. The
> parse tree generated by the parser (so, also its matches) is delivered
> to the user in the output. Other parsers in the grammar see the named
> captures too.
> * Semantic actions can be added to any rule in a grammar. Once a rule
> has matched, its associated action is called on the rule output and
> passed as final result to other parsers further up the grammar. Do what
> you want to the parse tree. If the passed actions are delegates, they
> can access external variables.
>
>
> Philippe
>

Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful enough to parse a language such as C?

-- 
- Alex
March 10, 2012
I see you are not the only one who started writing string array literals like this:

enum PEGCode = grammarCode!(
     "Grammar <- S Definition+ EOI"
    ,"Definition <- RuleName Arrow Expression"
    ,"RuleName   <- Identifier>(ParamList?)"
    ,"Expression <- Sequence (OR Sequence)*"
);

IOW comma on the left side. I know it's not a style preference but
actually a (unfortunate but needed) technique for avoiding bugs. :)

So can you use this to actually parse D code, extract syntax trees and stuff like that? I'm clueless about parsing but it looks very neat. Nice work!
March 11, 2012
Andrej Mitrovic:

> IOW comma on the left side. I know it's not a style preference but
> actually a (unfortunate but needed) technique for avoiding bugs. :)

To avoid that bug: http://d.puremagic.com/issues/show_bug.cgi?id=3827

Bye,
bearophile
March 11, 2012
On Sun, Mar 11, 2012 at 00:34, Alex Rønne Petersen <xtzgzorex@gmail.com> wrote:

> Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful enough to parse a language such as C?

I think so. But you'd have to do add some semantic action to deal with typedefs and macros.

People parsed Java and Javascript with them. I personnally never used Pegged for more than JSON and custom formats, but I intend to try the D grammar.
March 11, 2012
On 3/10/12 5:56 PM, Andrej Mitrovic wrote:
> I see you are not the only one who started writing string array
> literals like this:
>
> enum PEGCode = grammarCode!(
>       "Grammar<- S Definition+ EOI"
>      ,"Definition<- RuleName Arrow Expression"
>      ,"RuleName<- Identifier>(ParamList?)"
>      ,"Expression<- Sequence (OR Sequence)*"
> );
>
> IOW comma on the left side. I know it's not a style preference but
> actually a (unfortunate but needed) technique for avoiding bugs. :)
>
> So can you use this to actually parse D code, extract syntax trees and
> stuff like that? I'm clueless about parsing but it looks very neat.
> Nice work!

I, too, think this is very significant work! Suggestion for Philippe: instead of this:

enum PEGCode = grammarCode!(
     "Grammar <- S Definition+ EOI"
    ,"Definition <- RuleName Arrow Expression"
    ,"RuleName   <- Identifier>(ParamList?)"
    ,"Expression <- Sequence (OR Sequence)*"
);

how about this:

enum PEGCode = grammarCode!("
    Grammar <- S Definition+ EOI;
    Definition <- RuleName Arrow Expression;
    RuleName   <- Identifier>(ParamList?);
    Expression <- Sequence (OR Sequence)*;
");

Splitting on ";" is trivial and makes client code considerably easier to play with.

I'll get back to this.


Andrei
March 11, 2012
On Sun, Mar 11, 2012 at 00:56, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote:
> I see you are not the only one who started writing string array literals like this:
>
> enum PEGCode = grammarCode!(
>     "Grammar <- S Definition+ EOI"
>    ,"Definition <- RuleName Arrow Expression"
>    ,"RuleName   <- Identifier>(ParamList?)"
>    ,"Expression <- Sequence (OR Sequence)*"
> );
>
> IOW comma on the left side. I know it's not a style preference but
> actually a (unfortunate but needed) technique for avoiding bugs. :)

Yes, I use what I call "Haskell-comma" (I saw it first on Haskell code). But in the previous sample, that's old code. Now, that should only be;

enum code = grammar("
    Grammar <- S Definition+ EOI
    Definition <- RuleName Arrow Expression
    RuleName   <- Identifier>(ParamList?)
    Expression <- Sequence (OR Sequence)*
");

I see I didn't update bootstrap.d, I'll do that.


> So can you use this to actually parse D code, extract syntax trees and stuff like that? I'm clueless about parsing but it looks very neat.

I can partially parse D code, because I only wrote part of the D grammar (see the pegged.examples.dgrammar module), just enough to parse most of template constraints. I intend to complete it and see if that floats.

Yes, it extracts syntax trees, at compile-time or runtime.

> Nice work!

Thanks!
March 11, 2012
On 3/11/12 1:22 AM, Philippe Sigaud wrote:
> On Sun, Mar 11, 2012 at 00:34, Alex Rønne Petersen<xtzgzorex@gmail.com>  wrote:
>
>> Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful
>> enough to parse a language such as C?
>
> I think so. But you'd have to do add some semantic action to deal with
> typedefs and macros.
>
> People parsed Java and Javascript with them. I personnally never used
> Pegged for more than JSON and custom formats, but I intend to try the
> D grammar.

Any chance you consider adding AST generator actions as discussed in the main forum a while ago?

Andrei
March 11, 2012
On Sunday, March 11, 2012 00:28:42 Philippe Sigaud wrote:
> Hello,
> 
> I created a new Github project, Pegged, a Parsing Expression
> Grammar (PEG) generator in D.

Cool! PEGs aren't used all that much for language grammars - primarily because BNF and EBNF are more traditional I think - but they're definitely cool. And it's great to see a D implementation of a PEG parser.

- Jonathan M Davis
March 11, 2012
On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> I, too, think this is very significant work! Suggestion for Philippe: instead of this:
>
>
> enum PEGCode = grammarCode!(
>     "Grammar <- S Definition+ EOI"
>    ,"Definition <- RuleName Arrow Expression"
>    ,"RuleName   <- Identifier>(ParamList?)"
>    ,"Expression <- Sequence (OR Sequence)*"
> );
>
> how about this:
>
> enum PEGCode = grammarCode!("
>    Grammar <- S Definition+ EOI;
>
>    Definition <- RuleName Arrow Expression;
>    RuleName   <- Identifier>(ParamList?);
>    Expression <- Sequence (OR Sequence)*;
> ");
>
> Splitting on ";" is trivial and makes client code considerably easier to play with.

It's already implemented! No need for ';'
The docs are up to date, the internal code also. It's only an old
bootstrapping file that Andrej cited that still contains multi-string
definitions. I'll correct that, as that means I didn"t push dogfooding
far enough.

Now, just use one string:

"Grammar <- ...
 Definition <- ...

"
« First   ‹ Prev
1 2 3 4 5 6