February 29, 2012
On 2/28/12 9:57 AM, H. S. Teoh wrote:
> On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
>> This is the kind of stuff I've had an eye on for the longest time.
>> I'm saying it's of strategic importance because CTFE technology,
>> though not new and already available with some languages, has unique
>> powers when combined with other features of D. With CTFE we get to do
>> things that are quite literally impossible to do in other languages.
>
> For example?

For example integrated lexer and parser generators!


Andrei
February 29, 2012
On 2/28/12 10:23 AM, mist wrote:
> Something similar to Boost::Spirit2 but with sane syntax and better
> debugging would have been awesome.

Exactly. Boost::Spirit2 is a perfect illustration of a dancing bear - using inadequate technology for a specific purpose.

Andrei
February 29, 2012
On 2/28/12 12:46 PM, Martin Nowak wrote:
> I wrote a generic lexer generator some time ago.
> It already let to some compiler O(N^2) optimizations, because the token
> declarations sneak into the mangling :(.
> I also finally added a workaround for a remaining CTFE bug (#6815).
>
> https://gist.github.com/1255439 - lexer generator
> https://gist.github.com/1262321 - complete and fast D lexer

Wow, now we have an embarrassment of riches!

> I've ditched an attempt to write a parser combinator. It was based on
> expression templates and ended up at spirit craziness.

I'm not sure what a parser combinator is. I think grammar inheritance could actually be interesting to explore.

> <PERSONAL OPINION
> The hassle of providing good error messages and synthesizing parse results
> in a generic parser outweigh the benefit of a declarative grammar.
> /PERSONAL OPINION>

I think synthesizing ASTs is a solved problem, see my post before. Error messages are still a hassle, but I missed the point they became good in hand-written parsers :o).

> A lot becomes feasible from the CTFE perspective,
> despite some bugfixes I only miss exp and log currently.
>
> I do not agree that it's the right moment to write a parser though.
> It hits the first of phobos two biggest shortcomings, the lack of a good
> I/O
> system and the missing Allocators.
> Any parser written now will either risk to not play nice with ranges
> or has to come up with it's own buffering again.

Agreed, but that doesn't seem like the largest hurdle to me.


Andrei
February 29, 2012
On 2/28/12 1:52 PM, Dmitry Olshansky wrote:
> To be truly successful it should have some distinct properties like:
> - being faster or identical to handwritten stuff we already have (like
> e.g. std.json ), reminds us of recent INI parser proposal, should be an
> easy pick for general parser.

Yes.

> - be flexible, the result need not be a predefined AST tree nodes,
> Hisayuki Mima's parser shows some nice movement in this direction

Yes. But AST creation should be supported without any custom code.

> - have reasonable compile times and memory consumption (though it will
> only improve over time)

Yes. I guess PEGs have problems there.

> Recalling EBNF parser idea that I run with before finally being dragged
> down by real life. Roughly I thought to follow hybrid LL(*) aproach,
> while I had a solid plan on doing DFA for lexer and parser lookahead,
> the other things were more or less floating.

Well, maybe you could integrate that within your up-and-coming research. Grammars have been considered a solved problem every five years since 1970, and there's always new work coming :o).

> There is prototype of interactive regex matcher that
> works directly on stream (buried in std.regex), it even passed dry-run
> unittests back then. Though I had to postpone till I/O is sorted out. I
> really loved Steven's design with it's easy access to buffer and well
> thought out primitives, hope it will come about sometime soon.

An interactive regex would be a dream come true to me...


Andrei

February 29, 2012
On 02/28/2012 07:46 PM, Martin Nowak wrote:
>
> https://gist.github.com/1255439 - lexer generator
> https://gist.github.com/1262321 - complete and fast D lexer
>

Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
February 29, 2012
On 2/28/12 7:16 PM, Christopher Bergqvist wrote:
> I agree that the current direction of D in this area is impressive.
> However, I fail to see a killer-feature in generating a lexer-parser
> generator at compile-time instead of run-time.
>
> A run-time generator would benefit from not having to execute within the
> limited CTFE environment and would always be on-par in that respect. A
> compile time generator would internalize the generation and compilation
> of the result (with possible glue-code), simplifying the build process
> somewhat.
>
> What am I failing to pick up on?

Barrier of entry and granularity of approach, I think.

Currently if one wants to parse some simple grammar, there are options such as (a) do it by hand, (b) use boost::spirit, or (c) use lex/yacc.

Parsing by hand has the obvious disadvantages. Using boost::spirit has a steep learning curve and tends to create very contorted grammar representations, full of representation noise, and scales very poorly. Using lex/yacc is hamfisted - there's an additional build step, generated files to deal with, and the related logistics, which make lex/yacc a viable choice only for "big" grammars.

An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand. Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand.


Andrei
February 29, 2012
On 2/28/12 11:15 PM, Nick Sabalausky wrote:
> In Goldie, I've taken an inverted approach, which IMHO is easier to use: The
> types are automatically generated from the grammar, not the other way
> around. So applying that approach to the above code, it'd be more like this:
>
> mixin genGrammar!("myGrammar", `
>      Identifier = [a-zA-Z_][a-zA-Z_0-9]*
>      Module = Declaration+
>      Declaration = StructDeclaration
>      StructDeclaration = 'struct' Identifier '{' Declaration* '}'
> `);
>
> Which generates these classes:
>
> Parser!"myGrammar"
> Symbol!("myGrammar.Identifier")
> Symbol!("myGrammar.Module")
> Symbol!("myGrammar.Declaration")
> Symbol!("myGrammar.StructDeclaration")
>
> and/or these:
>
> Parser_myGrammar
> Symbol_myGrammar!"Identifier"
> Symbol_myGrammar!"Module"
> Symbol_myGrammar!"Declaration"
> Symbol_myGrammar!"StructDeclaration"
>
> would could then be aliased by the user however they wanted:
>
> alias Symbol_myGrammar MySym;
>
> And there can still be hooks (delegates, subclassing, whatever) to add
> customized behavior/functionality.

I think this is the right approach.

Andrei


February 29, 2012
On 2/29/12 12:07 AM, bcs wrote:
> On 02/28/2012 08:23 AM, mist wrote:
>> Something similar to Boost::Spirit2 but with sane syntax and better
>> debugging would have been awesome.
>
> How about not having to encode the syntax as arithmetic expressions? D
> can do string parsing of eBNF at compile time if you want that.

I think that's what he meant by "sane syntax".

Andrei
February 29, 2012
On 2/29/12 3:45 AM, Philippe Sigaud wrote:
> mixin(Grammar!("Doc <- Node*"
> "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
> "OpeningTag <- '<' Identifier '>'", OpeningAction,
> "ClosingTag <-  `</` Identifier '>'", ClosingAction,
> "Text <- (!(OpeningTag / ClosingTag) _)+"));

That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best.

Andrei

February 29, 2012
On 2/29/12 3:48 AM, Dmitry Olshansky wrote:
> Someway of stacking grammars is bonus points for the project. As Andrei
> started with lexer+parser, I see no problem with it being
> lexer+parser(+parser)*.

What does stacking grammars mean?

Anyhow, one thing that would become important is tree walkers - e.g. you have the tree and you define code for evaluating it etc.

> Such a framework relying on mixins is bound to produce awful error
> messages at compile-time, unless explicitly stuffed up to watterline
> with some kind of static assert(__traits(compiles,...),"Error x + info");

We need to figure out ways to make that simpler.


Andrei