View mode: basic / threaded / horizontal-split · Log in · Help
February 28, 2012
Lexer and parser generators using CTFE
I'm starting a new thread on this because I think the matter is of 
strategic importance.

We all felt for a long time that there's a lot of potential in CTFE, and 
potential applications have been discussed more than a few times, 
ranging from formatting strings parsed to DSLs and parser generators.

Such feats are now approaching fruition because a number of factors 
converge:

* Dmitry Olshansky's regex library (now in Phobos) generates efficient D 
code straight from regexen.

* The scope and quality of CTFE has improved enormously, making more 
advanced uses possible and even relatively easy (thanks Don!)

* Hisayuki Mima implemented a parser generator in only 3000 lines of 
code (sadly, no comments or documentation yet :o))

* With the occasion of that announcement we also find out Philippe 
Sigaud has already a competing design and implementation of a parser 
generator.

This is the kind of stuff I've had an eye on for the longest time. I'm 
saying it's of strategic importance because CTFE technology, though not 
new and already available with some languages, has unique powers when 
combined with other features of D. With CTFE we get to do things that 
are quite literally impossible to do in other languages.

We need to have a easy-to-use, complete, seamless, and efficient 
lexer-parser generator combo in Phobos, pronto. The lexer itself could 
use a character-level PEG or a classic automaton, and emit tokens for 
consumption by a parser generator. The two should work in perfect tandem 
(no need for glue code). At the end of the day, defining a complete 
lexer+parser combo for a language should be just a few lines longer than 
the textual representation of the grammar itself.

What do you all think? Let's get this project off the ground!


Thanks,

Andrei
February 28, 2012
Re: Lexer and parser generators using CTFE
On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu
wrote:
> I'm starting a new thread on this because I think the matter is 
> of strategic importance.
>
> We all felt for a long time that there's a lot of potential in 
> CTFE, and potential applications have been discussed more than 
> a few times, ranging from formatting strings parsed to DSLs and 
> parser generators.
>
> Such feats are now approaching fruition because a number of 
> factors converge:
>
> * Dmitry Olshansky's regex library (now in Phobos) generates 
> efficient D code straight from regexen.
>
> * The scope and quality of CTFE has improved enormously, making 
> more advanced uses possible and even relatively easy (thanks 
> Don!)
>
> * Hisayuki Mima implemented a parser generator in only 3000 
> lines of code (sadly, no comments or documentation yet :o))
>
> * With the occasion of that announcement we also find out 
> Philippe Sigaud has already a competing design and 
> implementation of a parser generator.
>
> This is the kind of stuff I've had an eye on for the longest 
> time. I'm saying it's of strategic importance because CTFE 
> technology, though not new and already available with some 
> languages, has unique powers when combined with other features 
> of D. With CTFE we get to do things that are quite literally 
> impossible to do in other languages.
>
> We need to have a easy-to-use, complete, seamless, and 
> efficient lexer-parser generator combo in Phobos, pronto. The 
> lexer itself could use a character-level PEG or a classic 
> automaton, and emit tokens for consumption by a parser 
> generator. The two should work in perfect tandem (no need for 
> glue code). At the end of the day, defining a complete 
> lexer+parser combo for a language should be just a few lines 
> longer than the textual representation of the grammar itself.
>
> What do you all think? Let's get this project off the ground!
>
>
> Thanks,
>
> Andrei

Definitely, I applaud this initiative! I've long been of the
opinion that CTFE parsing is D's killer-feature, which would
allow me to "sneak" D into a "nameless above average size
company". ;)
February 28, 2012
Re: Lexer and parser generators using CTFE
(2012/02/28 16:59), Andrei Alexandrescu wrote:
> I'm starting a new thread on this because I think the matter is of
> strategic importance.
>
> We all felt for a long time that there's a lot of potential in CTFE, and
> potential applications have been discussed more than a few times,
> ranging from formatting strings parsed to DSLs and parser generators.
>
> Such feats are now approaching fruition because a number of factors
> converge:
>
> * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
> code straight from regexen.
>
> * The scope and quality of CTFE has improved enormously, making more
> advanced uses possible and even relatively easy (thanks Don!)
>
> * Hisayuki Mima implemented a parser generator in only 3000 lines of
> code (sadly, no comments or documentation yet :o))
>
> * With the occasion of that announcement we also find out Philippe
> Sigaud has already a competing design and implementation of a parser
> generator.
>
> This is the kind of stuff I've had an eye on for the longest time. I'm
> saying it's of strategic importance because CTFE technology, though not
> new and already available with some languages, has unique powers when
> combined with other features of D. With CTFE we get to do things that
> are quite literally impossible to do in other languages.
>
> We need to have a easy-to-use, complete, seamless, and efficient
> lexer-parser generator combo in Phobos, pronto. The lexer itself could
> use a character-level PEG or a classic automaton, and emit tokens for
> consumption by a parser generator. The two should work in perfect tandem
> (no need for glue code). At the end of the day, defining a complete
> lexer+parser combo for a language should be just a few lines longer than
> the textual representation of the grammar itself.
>
> What do you all think? Let's get this project off the ground!
>
>
> Thanks,
>
> Andrei

Hello Andrei,

Certainly, I don't write yet the documentation of my library, ctpg.
(But a few examples available here: 
https://github.com/youkei/ctpg/tree/master/example)
So, I'd like to describe it here.

To be honest, ctpg is inspired by one module of Scala's standard 
library, Parser Combinators.
One of the differences between Parser Combinators and ctpg is that 
Parser Combinators generates parsers in run-time, but ctpg generates 
parsers in compile-time by the power of CTFE and mixin.

A parser generated by ctpg is a recursive descent parser, so it does 
lexical analysis and parsing at a time. And the type of input which it 
can accept is string, wstring, dstring and ForwardRange whose element 
type is char, wchar or dchar.
So, dividing lexical analysis and parsing into two processes is 
difficult in ctpg.

Importantly, a parser generated by ctpg is statically typed as one of 
the examples, "parsing simple arithmetic expression" shows.
Generally a parser generated by other tool and accepting tokens returns 
the abstract syntax tree, but it return the evaluated value in the example.
In other words, it does lexical analysis, parsing and (type) converting 
at a time.
If you want simply abstract syntax tree, it may be a little pain to use 
ctpg.

That's all I'd like to say about ctpg here.

By the way, I wholeheartedly agree with Andrei's opinion. But currently, 
I think, CTFE is limited because of this issue: 
http://d.puremagic.com/issues/show_bug.cgi?id=6498 .
Without dealing with this issue, We couldn't bring out the full 
potential of CTFE.

Thanks,
Hisayuki Mima
February 28, 2012
Re: Lexer and parser generators using CTFE
On 02/28/2012 12:36 AM, CTFE-4-the-win wrote:
> We need to have a easy-to-use, complete, seamless, and efficient
> lexer-parser generator combo in Phobos, pronto. The lexer itself could
> use a character-level PEG or a classic automaton, and emit tokens for
> consumption by a parser generator. The two should work in perfect tandem
> (no need for glue code). At the end of the day, defining a complete
> lexer+parser combo for a language should be just a few lines longer than
> the textual representation of the grammar itself.

Are you aware of Philippe Sigaud's PEG work (may become part of his 
template book)

Quote  Philippe ..
I recently wrote a parsing expression grammar module in D, also to 
create grammars and parsers at compile-time and parse inputs at CT.

(PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar)

Usage is like this:

mixin Grammar(
      "JSON <- '{' ( Pair ( ',' Pair )* )? '}'"
    , "Pair <- String ':' Value"
    , "Value <- String / Number / JSON / Array / True / False / Null"
    , `True <- "true"`
(..., rest of JSON grammar)
);

enum result = JSON.parse(
`{ "Hello":42,
    "World!":
    {
        "a":[0,1,2,3]
    }
}`);


End Quote
No that bad :) , I think.

Well, I am still a fan of EBNF based parser generators (recursive 
decent) but that's an other story. If I recall correctly BCS has created 
something working. ( BNF , so limited)
February 28, 2012
Re: Lexer and parser generators using CTFE
On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
> I'm starting a new thread on this because I think the matter is of
> strategic importance.

+1.


> We all felt for a long time that there's a lot of potential in CTFE,
> and potential applications have been discussed more than a few times,
> ranging from formatting strings parsed to DSLs and parser
> generators.

CTFE is one of the big features that attracted me to D.


[...]
> This is the kind of stuff I've had an eye on for the longest time.
> I'm saying it's of strategic importance because CTFE technology,
> though not new and already available with some languages, has unique
> powers when combined with other features of D. With CTFE we get to do
> things that are quite literally impossible to do in other languages.

For example?


> We need to have a easy-to-use, complete, seamless, and efficient
> lexer-parser generator combo in Phobos, pronto. The lexer itself could
> use a character-level PEG or a classic automaton, and emit tokens for
> consumption by a parser generator. The two should work in perfect
> tandem (no need for glue code). At the end of the day, defining a
> complete lexer+parser combo for a language should be just a few lines
> longer than the textual representation of the grammar itself.

Definitely agree. I frequently work with code that requires some amount
of lexing/parsing; having something like this in Phobos would be
absolutely awesome.


> What do you all think? Let's get this project off the ground!
[...]

+1.

On that note, I'd like to mention that CTFE currently has some parts
that need more work:

(1) We need compile-time alternatives to the functions in std.math so
that functions that are currently only implemented in asm can be used in
CTFE.

(2) It should be possible to generate AA literals from CTFE and use them
to initialize things like parsing/lexing lookup tables, etc.. While this
in itself is not a major critical issue, it would make D look bad if we
tout D's CTFE capabilities yet at the same time AA's have to be created
by static this() at runtime.


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
February 28, 2012
Re: Lexer and parser generators using CTFE
Something similar to Boost::Spirit2 but with sane syntax and 
better debugging would have been awesome.
February 28, 2012
Re: Lexer and parser generators using CTFE
On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> I'm starting a new thread on this because I think the matter is of  
> strategic importance.
>
> We all felt for a long time that there's a lot of potential in CTFE, and  
> potential applications have been discussed more than a few times,  
> ranging from formatting strings parsed to DSLs and parser generators.
>
> Such feats are now approaching fruition because a number of factors  
> converge:
>
> * Dmitry Olshansky's regex library (now in Phobos) generates efficient D  
> code straight from regexen.
>
> * The scope and quality of CTFE has improved enormously, making more  
> advanced uses possible and even relatively easy (thanks Don!)
>
> * Hisayuki Mima implemented a parser generator in only 3000 lines of  
> code (sadly, no comments or documentation yet :o))
>
> * With the occasion of that announcement we also find out Philippe  
> Sigaud has already a competing design and implementation of a parser  
> generator.
>
> This is the kind of stuff I've had an eye on for the longest time. I'm  
> saying it's of strategic importance because CTFE technology, though not  
> new and already available with some languages, has unique powers when  
> combined with other features of D. With CTFE we get to do things that  
> are quite literally impossible to do in other languages.
>
> We need to have a easy-to-use, complete, seamless, and efficient  
> lexer-parser generator combo in Phobos, pronto. The lexer itself could  
> use a character-level PEG or a classic automaton, and emit tokens for  
> consumption by a parser generator. The two should work in perfect tandem  
> (no need for glue code). At the end of the day, defining a complete  
> lexer+parser combo for a language should be just a few lines longer than  
> the textual representation of the grammar itself.
>
> What do you all think? Let's get this project off the ground!
>
>
> Thanks,
>
> Andrei

I wrote a generic lexer generator some time ago.
It already let to some compiler O(N^2) optimizations, because the token
declarations sneak into the mangling :(.
I also finally added a workaround for a remaining CTFE bug (#6815).

https://gist.github.com/1255439 - lexer generator
https://gist.github.com/1262321 - complete and fast D lexer

I've ditched an attempt to write a parser combinator. It was based on
expression templates and ended up at spirit craziness.

<PERSONAL OPINION
The hassle of providing good error messages and synthesizing parse results
in a generic parser outweigh the benefit of a declarative grammar.
/PERSONAL OPINION>

A lot becomes feasible from the CTFE perspective,
despite some bugfixes I only miss exp and log currently.

I do not agree that it's the right moment to write a parser though.
It hits the first of phobos two biggest shortcomings, the lack of a good  
I/O
system and the missing Allocators.
Any parser written now will either risk to not play nice with ranges
or has to come up with it's own buffering again.
February 28, 2012
Re: Lexer and parser generators using CTFE
On 28.02.2012 22:46, Martin Nowak wrote:
> On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> I'm starting a new thread on this because I think the matter is of
>> strategic importance.
>>
>> We all felt for a long time that there's a lot of potential in CTFE,
>> and potential applications have been discussed more than a few times,
>> ranging from formatting strings parsed to DSLs and parser generators.
>>
>> Such feats are now approaching fruition because a number of factors
>> converge:
>>
>> * Dmitry Olshansky's regex library (now in Phobos) generates efficient
>> D code straight from regexen.
>>
>> * The scope and quality of CTFE has improved enormously, making more
>> advanced uses possible and even relatively easy (thanks Don!)
>>
>> * Hisayuki Mima implemented a parser generator in only 3000 lines of
>> code (sadly, no comments or documentation yet :o))
>>
>> * With the occasion of that announcement we also find out Philippe
>> Sigaud has already a competing design and implementation of a parser
>> generator.
>>
>> This is the kind of stuff I've had an eye on for the longest time. I'm
>> saying it's of strategic importance because CTFE technology, though
>> not new and already available with some languages, has unique powers
>> when combined with other features of D. With CTFE we get to do things
>> that are quite literally impossible to do in other languages.
>>
>> We need to have a easy-to-use, complete, seamless, and efficient
>> lexer-parser generator combo in Phobos, pronto. The lexer itself could
>> use a character-level PEG or a classic automaton, and emit tokens for
>> consumption by a parser generator. The two should work in perfect
>> tandem (no need for glue code). At the end of the day, defining a
>> complete lexer+parser combo for a language should be just a few lines
>> longer than the textual representation of the grammar itself.
>>
>> What do you all think? Let's get this project off the ground!
>>
>>

To be truly successful it should have some distinct properties like:
- being faster or identical to handwritten stuff we already have (like 
e.g. std.json ), reminds us of recent INI parser proposal, should be an 
easy pick for general parser.
- be flexible, the result need not be a predefined AST tree nodes, 
Hisayuki Mima's parser shows some nice movement in this direction
- have reasonable compile times and memory consumption (though it will 
only improve over time)


Recalling EBNF parser idea that I run with before finally being dragged 
down by real life. Roughly I thought to follow hybrid LL(*) aproach, 
while I had a solid plan on doing DFA for lexer and parser lookahead, 
the other things were more or less floating.

>> Thanks,
>>
>> Andrei
>
> I wrote a generic lexer generator some time ago.
> It already let to some compiler O(N^2) optimizations, because the token
> declarations sneak into the mangling :(.
> I also finally added a workaround for a remaining CTFE bug (#6815).
>
> https://gist.github.com/1255439 - lexer generator


Cool stuff.

> https://gist.github.com/1262321 - complete and fast D lexer
>
> I've ditched an attempt to write a parser combinator. It was based on
> expression templates and ended up at spirit craziness.
>
> <PERSONAL OPINION
> The hassle of providing good error messages and synthesizing parse results
> in a generic parser outweigh the benefit of a declarative grammar.
> /PERSONAL OPINION>
>

Error reporting is the weak spot, still no good ideas on solving that.

> A lot becomes feasible from the CTFE perspective,
> despite some bugfixes I only miss exp and log currently.
>

Reminds me that I have to dig up a wonderful CTFE bug in std.regex 
that's being somehow worked around by duping arrays since August.

> I do not agree that it's the right moment to write a parser though.
> It hits the first of phobos two biggest shortcomings, the lack of a good
> I/O
> system and the missing Allocators.
> Any parser written now will either risk to not play nice with ranges
> or has to come up with it's own buffering again.

All good points. There is prototype of interactive regex matcher that 
works directly on stream (buried in std.regex), it even passed dry-run 
unittests back then. Though I had to postpone till I/O is sorted out. I 
really loved Steven's design with it's easy access to buffer and well 
thought out primitives, hope it will come about sometime soon.

-- 
Dmitry Olshansky
February 28, 2012
Re: Lexer and parser generators using CTFE
On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote:
[...]
> I wrote a generic lexer generator some time ago.
> It already let to some compiler O(N^2) optimizations, because the token
> declarations sneak into the mangling :(.
> I also finally added a workaround for a remaining CTFE bug (#6815).
> 
> https://gist.github.com/1255439 - lexer generator
> https://gist.github.com/1262321 - complete and fast D lexer

Cool! I'll have to take a look at this sometime.


[...]
> <PERSONAL OPINION
> The hassle of providing good error messages and synthesizing parse results
> in a generic parser outweigh the benefit of a declarative grammar.
> /PERSONAL OPINION>

But things like lex/yacc have been useful throughout the years. With D's
delegates, lexer/parser action rules should be even cleaner, no?


> A lot becomes feasible from the CTFE perspective, despite some
> bugfixes I only miss exp and log currently.

All of std.math should have CTFE versions so that we can perform complex
calculations at compile-time (e.g., create table of scaled sin/cos
values for use in high-performance 3D apps -- no need to waste time to
generate these tables at startup time).


> I do not agree that it's the right moment to write a parser though.
> It hits the first of phobos two biggest shortcomings, the lack of a
> good I/O system and the missing Allocators.  Any parser written now
> will either risk to not play nice with ranges or has to come up with
> it's own buffering again.

OTOH, perhaps once we start writing a lexer/parser generator, that will
force us to fix the I/O and allocator system. Without a good project to
motivate us to fix these tedious issues, it seems that we just lose
inspiration to do it after a while.


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
February 28, 2012
Re: Lexer and parser generators using CTFE
On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
> On 28.02.2012 22:46, Martin Nowak wrote:
> >On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
> ><SeeWebsiteForEmail@erdani.org> wrote:
[...]
> >>We need to have a easy-to-use, complete, seamless, and efficient
> >>lexer-parser generator combo in Phobos, pronto. The lexer itself
> >>could use a character-level PEG or a classic automaton, and emit
> >>tokens for consumption by a parser generator. The two should work in
> >>perfect tandem (no need for glue code). At the end of the day,
> >>defining a complete lexer+parser combo for a language should be just
> >>a few lines longer than the textual representation of the grammar
> >>itself.
[...]
> To be truly successful it should have some distinct properties like:
> - being faster or identical to handwritten stuff we already have
> (like e.g. std.json ), reminds us of recent INI parser proposal,
> should be an easy pick for general parser.

We probably have to use string mixins to achieve this.

OTOH, a generic parser generator is unlikely to be able to handle
special case optimizations such as operator-precedence parsers
(http://en.wikipedia.org/wiki/Operator-precedence_parser), because
operator precedence is only implied by grammar rules, and to optimize
for this case requires explicitly specifying each operator's precedence.
There is some price to be paid for generality, unfortunately.


> - be flexible, the result need not be a predefined AST tree nodes,
> Hisayuki Mima's parser shows some nice movement in this direction

In my mind, D's delegates are ideal for this sort of thing. Each grammar
rule will have an associated callback. We can provide default callbacks
that generates AST nodes, but the user can override them with his own
delegates to do whatever they want (evaluate expressions/commands on the
fly, build only AST subtrees of interest, etc.).


> - have reasonable compile times and memory consumption (though it will
> only improve over time)

This will be good motivation to improve dmd's CTFE implementation. ;-)


[...]
> Error reporting is the weak spot, still no good ideas on solving that.

This is not specific to D, though, right?


[...]
> >I do not agree that it's the right moment to write a parser though.
> >It hits the first of phobos two biggest shortcomings, the lack of a
> >good I/O system and the missing Allocators.  Any parser written now
> >will either risk to not play nice with ranges or has to come up with
> >it's own buffering again.
> 
> All good points. There is prototype of interactive regex matcher
> that works directly on stream (buried in std.regex), it even passed
> dry-run unittests back then. Though I had to postpone till I/O is
> sorted out. I really loved Steven's design with it's easy access to
> buffer and well thought out primitives, hope it will come about
> sometime soon.
[...]

Are you talking about std.io? I glanced over that code briefly recently;
it looks very promising. Hope it will make it into Phobos soon. We
*need* to integrate streams, ranges, and the stdio writeln, etc., stuff.
The current situation to an outsider looks quite bad (streams not
compatible with stdio, no easy way to substitute string buffer for
files, etc.).

We also need automatic UTF encoding detection for all input
streams/files/ranges/whatever. Including mismatching endianness (i.e.
auto byte-swapping).  In a recent personal project to write a D lexer
(as an exercise to help me learn D better), I ended up having to roll my
own input stream abstraction in order to handle automatic encoding
detection. Which is quite poorly written, I've to admit. Something like
this needs to be standardized in Phobos with a well-written
implementation.


T

-- 
"You know, maybe we don't *need* enemies." "Yeah, best friends are about
all I can take." -- Calvin & Hobbes
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home