Lexer and parser generators using CTFE (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Lexer and parser generators using CTFE (page 3)

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Nick Sabalausky
in reply to Nick Sabalausky

Nick Sabalausky

Posted in reply to Nick Sabalausky

"Nick Sabalausky" <a@a.a> wrote in message news:jikcca$1vq7$1@digitalmars.com...
> "H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message news:mailman.215.1330472867.24984.digitalmars-d@puremagic.com...
>> On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote: [...]
>>> I won't deny that the combination of CTFE text processing and static introspection could improve on this. It could be made more feasible by some conventions, e.g. parse result always uses structs or classes and built-in arrays.
>>
>> Excellent idea, I like this.
>>
>>
>>> class Module
>>> {
>>>     this(Declaration[]);
>>> }
>>>
>>> class StructDeclaration : Declaration
>>> {
>>>     enum _enbf = "struct $1=Identifier { $2=Declaration* }";
>>>
>>>     this(Identifier, Declaration[]);
>>> }
>>>
>>> ...
>>>
>>> Parser!(Module, StructDeclaration, ...) parser;
>>> Module m = parser.parse(read("file.d"));
>>
>> I like this! Definitely an improvement over yacc syntax.
>>
>
> In Goldie, I've taken an inverted approach, which IMHO is easier to use: The types are automatically generated from the grammar, not the other way around. So applying that approach to the above code, it'd be more like this:
>
> mixin genGrammar!("myGrammar", `
>    Identifier = [a-zA-Z_][a-zA-Z_0-9]*
>    Module = Declaration+
>    Declaration = StructDeclaration
>    StructDeclaration = 'struct' Identifier '{' Declaration* '}'
> `);
>
> Which generates these classes:
>
> Parser!"myGrammar"
> Symbol!("myGrammar.Identifier")
> Symbol!("myGrammar.Module")
> Symbol!("myGrammar.Declaration")
> Symbol!("myGrammar.StructDeclaration")
>
> and/or these:
>
> Parser_myGrammar
> Symbol_myGrammar!"Identifier"
> Symbol_myGrammar!"Module"
> Symbol_myGrammar!"Declaration"
> Symbol_myGrammar!"StructDeclaration"
>
> would could then be aliased by the user however they wanted:
>
> alias Symbol_myGrammar MySym;
>
> And there can still be hooks (delegates, subclassing, whatever) to add customized behavior/functionality.
>

Hmm, maybe I need to think about what it would take to make Goldie able to parse at compile-time...

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by bcs
in reply to Andrei Alexandrescu

bcs

Posted in reply to Andrei Alexandrescu

On 02/27/2012 11:59 PM, Andrei Alexandrescu wrote:
> I'm starting a new thread on this because I think the matter is of
> strategic importance.
>
> We all felt for a long time that there's a lot of potential in CTFE, and
> potential applications have been discussed more than a few times,
> ranging from formatting strings parsed to DSLs and parser generators.
>
> Such feats are now approaching fruition because a number of factors
> converge:
>
> * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
> code straight from regexen.
>
> * The scope and quality of CTFE has improved enormously, making more
> advanced uses possible and even relatively easy (thanks Don!)
>
> * Hisayuki Mima implemented a parser generator in only 3000 lines of
> code (sadly, no comments or documentation yet :o))
>

A bit outdated but....

http://dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d

BTW, that uses very little in the way of CTFE and even less in the way of string mixins. Those are, IMHO, very powerful tools that should only be combined as weapons of last resort.

> * With the occasion of that announcement we also find out Philippe
> Sigaud has already a competing design and implementation of a parser
> generator.
>
> This is the kind of stuff I've had an eye on for the longest time. I'm
> saying it's of strategic importance because CTFE technology, though not
> new and already available with some languages, has unique powers when
> combined with other features of D. With CTFE we get to do things that
> are quite literally impossible to do in other languages.
>
> We need to have a easy-to-use, complete, seamless, and efficient
> lexer-parser generator combo in Phobos, pronto. The lexer itself could
> use a character-level PEG or a classic automaton, and emit tokens for
> consumption by a parser generator. The two should work in perfect tandem
> (no need for glue code). At the end of the day, defining a complete
> lexer+parser combo for a language should be just a few lines longer than
> the textual representation of the grammar itself.
>
> What do you all think? Let's get this project off the ground!
>
>
> Thanks,
>
> Andrei

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by bcs
in reply to mist

bcs

Posted in reply to mist

On 02/28/2012 08:23 AM, mist wrote:
> Something similar to Boost::Spirit2 but with sane syntax and better
> debugging would have been awesome.

How about not having to encode the syntax as arithmetic expressions? D can do string parsing of eBNF at compile time if you want that.

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Nick Sabalausky
in reply to Nick Sabalausky

Nick Sabalausky

Posted in reply to Nick Sabalausky

"Nick Sabalausky" <a@a.a> wrote in message news:jikcit$201o$1@digitalmars.com...
>
> Hmm, maybe I need to think about what it would take to make Goldie able to parse at compile-time...
>

Just gave it a quick shot. It was looking like it might not be too bad, but then I hit:

Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 'interpret.c'

Bleh.

(This was with DMD 2.058)

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Philippe Sigaud

Philippe Sigaud

Attachments:

text/html part

>> > On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:
>> > > I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time.
>> > >
>
>
> CTFE parsing is especially useful for DSEL (Domain Specific Embedded
Languages) or internal DSLs. The advantages are:
>
> 1. Syntactic errors (in the parsed constructs)  are given out at compile
time.
> 2. D reflections are available only at compile time. Referencing the
variables/identifiers in the parsed subset of DSL with the mainstream D code is impossible without reflections in place.

One of my  goals while writing a CT grammar generator was to get a compile-time parse-tree. Since it contains strings, it's easy to walk the tree, assembling strings as you go and generating the code you want (if//when you want to write code, that is)

Strings are a D way to represent code, so any way to get structured strings at compile-time opens whole vistas for code generation.

As for semantic actions, I added them in my code yesterday. I had hopes for using D's new anonymous syntax (p => p), but by being anonymous, they cannot be inserted easily in string mixins (other modules do not now about __lambda1 and co).

Anyway, I now have semantic actions at compile-time, I used them to write a small (as in, veeery simple) XML parser: I use semantic actions to push node names while encountering them and pop the last tag while encountering a closing tag. It seems to work OK.

That looks a bit like this (sorry, writing on a pad)

mixin(Grammar!("Doc <- Node*"
                "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
                "OpeningTag <- '<' Identifier '>'", OpeningAction,
                "ClosingTag <-  `</` Identifier '>'", ClosingAction,
                "Text <- (!(OpeningTag / ClosingTag) _)+"));

The PEG for Text just means: any char, as long as it's not an OpeningTag
nor a ClosingTag. PEG use '.' to say 'Any char', but I wanted to be able to
deal with qualified names, so I chose '_' instead.
When there is no action, it default to NoOp, as is the case for  Doc, Node
and Text.

I also added named captures (and capture comparison), to be able to say: "I want a sequence of equal chars":

"Equal <- _@first (_=first)*"

That is: any char, store it as "first", then take any number of char, long as their match is equal to first's match.

All this work at CT.

I'm  afraid being in holidays right now means I do not have easy access to GitHub (no git on a pad, and the computer I use to code right now does not have any network connection). I'll put all this online in a few days, because that must seems like the ramblings of a madman right now...

Philippe

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Dmitry Olshansky
in reply to H. S. Teoh

Dmitry Olshansky

Posted in reply to H. S. Teoh

On 29.02.2012 0:19, H. S. Teoh wrote:
> On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
>> On 28.02.2012 22:46, Martin Nowak wrote:
>>> On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org>  wrote:
> [...]
>>>> We need to have a easy-to-use, complete, seamless, and efficient
>>>> lexer-parser generator combo in Phobos, pronto. The lexer itself
>>>> could use a character-level PEG or a classic automaton, and emit
>>>> tokens for consumption by a parser generator. The two should work in
>>>> perfect tandem (no need for glue code). At the end of the day,
>>>> defining a complete lexer+parser combo for a language should be just
>>>> a few lines longer than the textual representation of the grammar
>>>> itself.
> [...]
>> To be truly successful it should have some distinct properties like:
>> - being faster or identical to handwritten stuff we already have
>> (like e.g. std.json ), reminds us of recent INI parser proposal,
>> should be an easy pick for general parser.
>
> We probably have to use string mixins to achieve this.
>
> OTOH, a generic parser generator is unlikely to be able to handle
> special case optimizations such as operator-precedence parsers
> (http://en.wikipedia.org/wiki/Operator-precedence_parser), because
> operator precedence is only implied by grammar rules, and to optimize
> for this case requires explicitly specifying each operator's precedence.
> There is some price to be paid for generality, unfortunately.
>

Well I see no problem in defining a separate OperatorGrammar constructor, that will take base Unit production and a bunch of operators with arity, plus brackets with their priority is plenty. Semantic then actions are trivially defined for each of operators respectively.

Someway of stacking grammars is bonus points for the project. As Andrei started with lexer+parser, I see no problem with it being lexer+parser(+parser)*.

>
>> - be flexible, the result need not be a predefined AST tree nodes,
>> Hisayuki Mima's parser shows some nice movement in this direction
>
> In my mind, D's delegates are ideal for this sort of thing. Each grammar
> rule will have an associated callback. We can provide default callbacks
> that generates AST nodes, but the user can override them with his own
> delegates to do whatever they want (evaluate expressions/commands on the
> fly, build only AST subtrees of interest, etc.).
>

Yup delegates and mixins are the tools. More over lambdas shine here, less clutter + type deduction.

>
>> - have reasonable compile times and memory consumption (though it will
>> only improve over time)
>
> This will be good motivation to improve dmd's CTFE implementation. ;-)
>
>
> [...]
>> Error reporting is the weak spot, still no good ideas on solving that.
>
> This is not specific to D, though, right?

Such a framework relying on mixins is bound to produce awful error messages at compile-time, unless explicitly stuffed up to watterline with some kind of static assert(__traits(compiles,...),"Error x + info");

>
>
> [...]
>>> I do not agree that it's the right moment to write a parser though.
>>> It hits the first of phobos two biggest shortcomings, the lack of a
>>> good I/O system and the missing Allocators.  Any parser written now
>>> will either risk to not play nice with ranges or has to come up with
>>> it's own buffering again.
>>
>> All good points. There is prototype of interactive regex matcher
>> that works directly on stream (buried in std.regex), it even passed
>> dry-run unittests back then. Though I had to postpone till I/O is
>> sorted out. I really loved Steven's design with it's easy access to
>> buffer and well thought out primitives, hope it will come about
>> sometime soon.
> [...]
>
> Are you talking about std.io? I glanced over that code briefly recently;
> it looks very promising. Hope it will make it into Phobos soon. We
> *need* to integrate streams, ranges, and the stdio writeln, etc., stuff.
> The current situation to an outsider looks quite bad (streams not
> compatible with stdio, no easy way to substitute string buffer for
> files, etc.).

Aye, that's it. Not to mention DIP9, though that's another topic in itself.

>
> We also need automatic UTF encoding detection for all input
> streams/files/ranges/whatever. Including mismatching endianness (i.e.
> auto byte-swapping).  In a recent personal project to write a D lexer
> (as an exercise to help me learn D better), I ended up having to roll my
> own input stream abstraction in order to handle automatic encoding
> detection. Which is quite poorly written, I've to admit. Something like
> this needs to be standardized in Phobos with a well-written
> implementation.
>
>
> T
>


-- 
Dmitry Olshansky

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Philippe Sigaud
in reply to Nick Sabalausky

Philippe Sigaud

Posted in reply to Nick Sabalausky

Attachments:

text/html part

Le 29 févr. 2012 07:20, "Nick Sabalausky" <a@a.a> a écrit :
>
> "Nick Sabalausky" <a@a.a> wrote in message news:jikcit$201o$1@digitalmars.com...
> >
> > Hmm, maybe I need to think about what it would take to make Goldie able
to
> > parse at compile-time...
> >
>
> Just gave it a quick shot. It was looking like it might not be too bad,
but
> then I hit:
>
> Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file
> 'interpret.c'
>
> Bleh.
>
> (This was with DMD 2.058)

Yeah, I had the very same yesterday :(

Also, another one on line 94 in interpret.c 'v->ctfeSomethin' failing.

Too bad.

In my case, I found a workaround: I was doing

array[] ~= SomeStruct(args);

which asserts at CTFE.

But:

auto s = SomeStructs(args);
array[] ~= s;

works.

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by d coder

d coder

Attachments:

text/html part

>
> I'm  afraid being in holidays right now means I do not have easy access to GitHub (no git on a pad, and the computer I use to code right now does not have any network connection). I'll put all this online in a few days, because that must seems like the ramblings of a madman right now...
>

I am eagerly waiting for you to upload the stuff on GitHub.

Regards
- Puneet

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Hisayuki Mima
in reply to Hisayuki Mima

Hisayuki Mima

Posted in reply to Hisayuki Mima

(2012/02/28 23:58), Hisayuki Mima wrote:
> (2012/02/28 16:59), Andrei Alexandrescu wrote:
>> I'm starting a new thread on this because I think the matter is of
>> strategic importance.
>>
>> We all felt for a long time that there's a lot of potential in CTFE, and
>> potential applications have been discussed more than a few times,
>> ranging from formatting strings parsed to DSLs and parser generators.
>>
>> Such feats are now approaching fruition because a number of factors
>> converge:
>>
>> * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
>> code straight from regexen.
>>
>> * The scope and quality of CTFE has improved enormously, making more
>> advanced uses possible and even relatively easy (thanks Don!)
>>
>> * Hisayuki Mima implemented a parser generator in only 3000 lines of
>> code (sadly, no comments or documentation yet :o))
>>
>> * With the occasion of that announcement we also find out Philippe
>> Sigaud has already a competing design and implementation of a parser
>> generator.
>>
>> This is the kind of stuff I've had an eye on for the longest time. I'm
>> saying it's of strategic importance because CTFE technology, though not
>> new and already available with some languages, has unique powers when
>> combined with other features of D. With CTFE we get to do things that
>> are quite literally impossible to do in other languages.
>>
>> We need to have a easy-to-use, complete, seamless, and efficient
>> lexer-parser generator combo in Phobos, pronto. The lexer itself could
>> use a character-level PEG or a classic automaton, and emit tokens for
>> consumption by a parser generator. The two should work in perfect tandem
>> (no need for glue code). At the end of the day, defining a complete
>> lexer+parser combo for a language should be just a few lines longer than
>> the textual representation of the grammar itself.
>>
>> What do you all think? Let's get this project off the ground!
>>
>>
>> Thanks,
>>
>> Andrei
>
> Hello Andrei,
>
> Certainly, I don't write yet the documentation of my library, ctpg.
> (But a few examples available here:
> https://github.com/youkei/ctpg/tree/master/example)
> So, I'd like to describe it here.
>
> To be honest, ctpg is inspired by one module of Scala's standard
> library, Parser Combinators.
> One of the differences between Parser Combinators and ctpg is that
> Parser Combinators generates parsers in run-time, but ctpg generates
> parsers in compile-time by the power of CTFE and mixin.
>
> A parser generated by ctpg is a recursive descent parser, so it does
> lexical analysis and parsing at a time. And the type of input which it
> can accept is string, wstring, dstring and ForwardRange whose element
> type is char, wchar or dchar.
> So, dividing lexical analysis and parsing into two processes is
> difficult in ctpg.
>
> Importantly, a parser generated by ctpg is statically typed as one of
> the examples, "parsing simple arithmetic expression" shows.
> Generally a parser generated by other tool and accepting tokens returns
> the abstract syntax tree, but it return the evaluated value in the example.
> In other words, it does lexical analysis, parsing and (type) converting
> at a time.
> If you want simply abstract syntax tree, it may be a little pain to use
> ctpg.
>
> That's all I'd like to say about ctpg here.
>
> By the way, I wholeheartedly agree with Andrei's opinion. But currently,
> I think, CTFE is limited because of this issue:
> http://d.puremagic.com/issues/show_bug.cgi?id=6498 .
> Without dealing with this issue, We couldn't bring out the full
> potential of CTFE.
>
> Thanks,
> Hisayuki Mima

A minimum of documentation is now available here:
https://github.com/youkei/ctpg/wiki/Home-en

Hisayuki Mima

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Andrei Alexandrescu
in reply to Hisayuki Mima

Andrei Alexandrescu

Posted in reply to Hisayuki Mima

On 2/28/12 8:58 AM, Hisayuki Mima wrote:
> Certainly, I don't write yet the documentation of my library, ctpg.
> (But a few examples available here:
> https://github.com/youkei/ctpg/tree/master/example)

Nice! I have a few comments. I should say that they entail a fair amount of work.

The current setup makes things very difficult for a common case - generating a parse tree. The user would need to insert a lot of custom code to create the appropriate nodes, but that's very easy for the parser because it already has the components. The parser should have a "build AST" option, in which case it should build all nodes without much effort from the user. That's what ANTLR does, and it has a special operator "^" to indicate where the root of the tree should be (http://www.antlr2.org/doc/trees.html).

So here's what I think your examples should look like:

The syntax could be changed a bit - it should be less like D and more like PEG. The "$" is implicit and shouldn't be needed. The ";" is a useful terminator for large productions spanning more than one line, so let's keep it. I don't understand the use of "!", for example the PEG for expression parsing at http://en.wikipedia.org/wiki/Parsing_expression_grammar is:

Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*
Expr    ← Sum

but your grammar has:

int primary = !"(" addExp !")" / intLit_p;

whereas it seems to me it should be

int primary = "(" addExp ")" / intLit_p;

No?

Here's how I think a parser that also builds an AST looks like:

mixin generateParsers!(
  Options.createAST,
  q{
    root    <- addExp;
    addExp  <- mulExp (("+" / "-")^ addExp)?
    mulExp  <- primary (("*" / "/")^ mulExp)?
    primary <- "(" addExp ")"% / intLit_p;
  }
);

I used the following notation: a node suffixed with "^" becomes the root of the produced AST, and a node suffixed with "%" will be dropped altogether (that's because ")" is not needed in the AST after the expression has been parsed). With only three characters I informed the parser what the shape of the AST will be.

At this point calling parse!root("1+2*3") would return an object of type ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs node has type ASTNode!"intLit_p" which has inside a value "1". The rhs node has type ASTNode!"mulExp", and that in turn has two children nodes lhs and rhs, both of type ASTNode!"intLit_p", each with its own value ("2" and "3", respectively). All these nodes are dynamically created by the parsing process and inherit a common type, e.g. ASTNode!"" or some type ASTRootNode.

> So, I'd like to describe it here.
>
> To be honest, ctpg is inspired by one module of Scala's standard
> library, Parser Combinators.
> One of the differences between Parser Combinators and ctpg is that
> Parser Combinators generates parsers in run-time, but ctpg generates
> parsers in compile-time by the power of CTFE and mixin.
>
> A parser generated by ctpg is a recursive descent parser, so it does
> lexical analysis and parsing at a time.

Oh, I forgot this property of PEGs - integrated lexing and parsing. So no need for a separate lexer!

> And the type of input which it
> can accept is string, wstring, dstring and ForwardRange whose element
> type is char, wchar or dchar.

Great. I think we should actually define the notion of BufferedRange. I'll get to that topic later.

> So, dividing lexical analysis and parsing into two processes is
> difficult in ctpg.

Yes, sorry I forgot about that!

> Importantly, a parser generated by ctpg is statically typed as one of
> the examples, "parsing simple arithmetic expression" shows.
> Generally a parser generated by other tool and accepting tokens returns
> the abstract syntax tree, but it return the evaluated value in the example.
> In other words, it does lexical analysis, parsing and (type) converting
> at a time.
> If you want simply abstract syntax tree, it may be a little pain to use
> ctpg.
>
> That's all I'd like to say about ctpg here.

Why would it be difficult to integrate AST creation with parsing? I'm not sure I understand this. My understanding is that you should be able to add a couple of simple operators ("^" and "%" as described above) to inform AST creation, and you're done!

Thanks,

Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation