Jump to page: 1 29  
Page
Thread overview
Lexer and parser generators using CTFE
Feb 28, 2012
CTFE-4-the-win
Feb 28, 2012
bls
Feb 29, 2012
Jonathan M Davis
Feb 29, 2012
Piotr Szturmaj
Mar 01, 2012
Marco Leise
Feb 29, 2012
James Miller
Feb 29, 2012
H. S. Teoh
Feb 29, 2012
d coder
Feb 29, 2012
Philippe Sigaud
Mar 01, 2012
deadalnix
Mar 01, 2012
Philippe Sigaud
May 27, 2012
John Belmonte
May 28, 2012
Philippe Sigaud
May 28, 2012
John Belmonte
May 29, 2012
Philippe Sigaud
Mar 01, 2012
H. S. Teoh
Mar 01, 2012
Philippe Sigaud
Mar 01, 2012
Jacob Carlborg
Feb 29, 2012
d coder
Feb 29, 2012
H. S. Teoh
Mar 01, 2012
Andrej Mitrovic
Feb 28, 2012
Hisayuki Mima
Feb 29, 2012
Hisayuki Mima
Mar 04, 2012
Hisayuki Mima
Mar 05, 2012
Ben Hanson
May 27, 2012
d coder
Jun 01, 2012
Hisayuki Mima
May 27, 2012
d coder
Feb 28, 2012
H. S. Teoh
Feb 29, 2012
Nick Sabalausky
Feb 28, 2012
mist
Feb 29, 2012
bcs
Feb 28, 2012
Martin Nowak
Feb 28, 2012
Dmitry Olshansky
Feb 28, 2012
H. S. Teoh
Feb 29, 2012
Dmitry Olshansky
Feb 29, 2012
Dmitry Olshansky
Feb 29, 2012
Nick Sabalausky
Feb 29, 2012
Nick Sabalausky
Feb 29, 2012
Nick Sabalausky
Mar 05, 2012
Ben Hanson
Feb 29, 2012
Dmitry Olshansky
Feb 28, 2012
H. S. Teoh
Feb 28, 2012
deadalnix
Feb 28, 2012
Martin Nowak
Feb 28, 2012
H. S. Teoh
Feb 29, 2012
Nick Sabalausky
Feb 29, 2012
Nick Sabalausky
Feb 29, 2012
Nick Sabalausky
Feb 29, 2012
Philippe Sigaud
Feb 29, 2012
Nick Sabalausky
Mar 07, 2012
Nick Sabalausky
Feb 29, 2012
Nick Sabalausky
Feb 29, 2012
Timon Gehr
Feb 29, 2012
Martin Nowak
Feb 29, 2012
H. S. Teoh
Feb 29, 2012
d coder
Feb 29, 2012
Nick Sabalausky
Feb 29, 2012
H. S. Teoh
Feb 29, 2012
Timon Gehr
Feb 29, 2012
Martin Nowak
Feb 29, 2012
Timon Gehr
Feb 29, 2012
Martin Nowak
Feb 29, 2012
H. S. Teoh
Feb 29, 2012
bcs
May 27, 2012
F i L
May 28, 2012
Jacob Carlborg
Jun 01, 2012
Ken
February 28, 2012
I'm starting a new thread on this because I think the matter is of strategic importance.

We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators.

Such feats are now approaching fruition because a number of factors converge:

* Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen.

* The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!)

* Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o))

* With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator.

This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages.

We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.

What do you all think? Let's get this project off the ground!


Thanks,

Andrei
February 28, 2012
On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu
wrote:
> I'm starting a new thread on this because I think the matter is of strategic importance.
>
> We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators.
>
> Such feats are now approaching fruition because a number of factors converge:
>
> * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen.
>
> * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!)
>
> * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o))
>
> * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator.
>
> This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages.
>
> We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.
>
> What do you all think? Let's get this project off the ground!
>
>
> Thanks,
>
> Andrei

Definitely, I applaud this initiative! I've long been of the
opinion that CTFE parsing is D's killer-feature, which would
allow me to "sneak" D into a "nameless above average size
company". ;)

February 28, 2012
(2012/02/28 16:59), Andrei Alexandrescu wrote:
> I'm starting a new thread on this because I think the matter is of
> strategic importance.
>
> We all felt for a long time that there's a lot of potential in CTFE, and
> potential applications have been discussed more than a few times,
> ranging from formatting strings parsed to DSLs and parser generators.
>
> Such feats are now approaching fruition because a number of factors
> converge:
>
> * Dmitry Olshansky's regex library (now in Phobos) generates efficient D
> code straight from regexen.
>
> * The scope and quality of CTFE has improved enormously, making more
> advanced uses possible and even relatively easy (thanks Don!)
>
> * Hisayuki Mima implemented a parser generator in only 3000 lines of
> code (sadly, no comments or documentation yet :o))
>
> * With the occasion of that announcement we also find out Philippe
> Sigaud has already a competing design and implementation of a parser
> generator.
>
> This is the kind of stuff I've had an eye on for the longest time. I'm
> saying it's of strategic importance because CTFE technology, though not
> new and already available with some languages, has unique powers when
> combined with other features of D. With CTFE we get to do things that
> are quite literally impossible to do in other languages.
>
> We need to have a easy-to-use, complete, seamless, and efficient
> lexer-parser generator combo in Phobos, pronto. The lexer itself could
> use a character-level PEG or a classic automaton, and emit tokens for
> consumption by a parser generator. The two should work in perfect tandem
> (no need for glue code). At the end of the day, defining a complete
> lexer+parser combo for a language should be just a few lines longer than
> the textual representation of the grammar itself.
>
> What do you all think? Let's get this project off the ground!
>
>
> Thanks,
>
> Andrei

Hello Andrei,

Certainly, I don't write yet the documentation of my library, ctpg.
(But a few examples available here: https://github.com/youkei/ctpg/tree/master/example)
So, I'd like to describe it here.

To be honest, ctpg is inspired by one module of Scala's standard library, Parser Combinators.
One of the differences between Parser Combinators and ctpg is that Parser Combinators generates parsers in run-time, but ctpg generates parsers in compile-time by the power of CTFE and mixin.

A parser generated by ctpg is a recursive descent parser, so it does lexical analysis and parsing at a time. And the type of input which it can accept is string, wstring, dstring and ForwardRange whose element type is char, wchar or dchar.
So, dividing lexical analysis and parsing into two processes is difficult in ctpg.

Importantly, a parser generated by ctpg is statically typed as one of the examples, "parsing simple arithmetic expression" shows.
Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example.
In other words, it does lexical analysis, parsing and (type) converting at a time.
If you want simply abstract syntax tree, it may be a little pain to use ctpg.

That's all I'd like to say about ctpg here.

By the way, I wholeheartedly agree with Andrei's opinion. But currently, I think, CTFE is limited because of this issue: http://d.puremagic.com/issues/show_bug.cgi?id=6498 .
Without dealing with this issue, We couldn't bring out the full potential of CTFE.

Thanks,
Hisayuki Mima
February 28, 2012
On 02/28/2012 12:36 AM, CTFE-4-the-win wrote:
> We need to have a easy-to-use, complete, seamless, and efficient
> lexer-parser generator combo in Phobos, pronto. The lexer itself could
> use a character-level PEG or a classic automaton, and emit tokens for
> consumption by a parser generator. The two should work in perfect tandem
> (no need for glue code). At the end of the day, defining a complete
> lexer+parser combo for a language should be just a few lines longer than
> the textual representation of the grammar itself.

Are you aware of Philippe Sigaud's PEG work (may become part of his template book)

Quote  Philippe ..
I recently wrote a parsing expression grammar module in D, also to create grammars and parsers at compile-time and parse inputs at CT.

(PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar)

Usage is like this:

mixin Grammar(
      "JSON <- '{' ( Pair ( ',' Pair )* )? '}'"
    , "Pair <- String ':' Value"
    , "Value <- String / Number / JSON / Array / True / False / Null"
    , `True <- "true"`
(..., rest of JSON grammar)
);

enum result = JSON.parse(
`{ "Hello":42,
    "World!":
    {
        "a":[0,1,2,3]
    }
}`);


End Quote
No that bad :) , I think.

Well, I am still a fan of EBNF based parser generators (recursive decent) but that's an other story. If I recall correctly BCS has created something working. ( BNF , so limited)
February 28, 2012
On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:
> I'm starting a new thread on this because I think the matter is of strategic importance.

+1.


> We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators.

CTFE is one of the big features that attracted me to D.


[...]
> This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages.

For example?


> We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.

Definitely agree. I frequently work with code that requires some amount of lexing/parsing; having something like this in Phobos would be absolutely awesome.


> What do you all think? Let's get this project off the ground!
[...]

+1.

On that note, I'd like to mention that CTFE currently has some parts that need more work:

(1) We need compile-time alternatives to the functions in std.math so that functions that are currently only implemented in asm can be used in CTFE.

(2) It should be possible to generate AA literals from CTFE and use them to initialize things like parsing/lexing lookup tables, etc.. While this in itself is not a major critical issue, it would make D look bad if we tout D's CTFE capabilities yet at the same time AA's have to be created by static this() at runtime.


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
February 28, 2012
Something similar to Boost::Spirit2 but with sane syntax and better debugging would have been awesome.
February 28, 2012
On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> I'm starting a new thread on this because I think the matter is of strategic importance.
>
> We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators.
>
> Such feats are now approaching fruition because a number of factors converge:
>
> * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen.
>
> * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!)
>
> * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o))
>
> * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator.
>
> This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages.
>
> We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.
>
> What do you all think? Let's get this project off the ground!
>
>
> Thanks,
>
> Andrei

I wrote a generic lexer generator some time ago.
It already let to some compiler O(N^2) optimizations, because the token
declarations sneak into the mangling :(.
I also finally added a workaround for a remaining CTFE bug (#6815).

https://gist.github.com/1255439 - lexer generator
https://gist.github.com/1262321 - complete and fast D lexer

I've ditched an attempt to write a parser combinator. It was based on
expression templates and ended up at spirit craziness.

<PERSONAL OPINION
The hassle of providing good error messages and synthesizing parse results
in a generic parser outweigh the benefit of a declarative grammar.
/PERSONAL OPINION>

A lot becomes feasible from the CTFE perspective,
despite some bugfixes I only miss exp and log currently.

I do not agree that it's the right moment to write a parser though.
It hits the first of phobos two biggest shortcomings, the lack of a good I/O
system and the missing Allocators.
Any parser written now will either risk to not play nice with ranges
or has to come up with it's own buffering again.
February 28, 2012
On 28.02.2012 22:46, Martin Nowak wrote:
> On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> I'm starting a new thread on this because I think the matter is of
>> strategic importance.
>>
>> We all felt for a long time that there's a lot of potential in CTFE,
>> and potential applications have been discussed more than a few times,
>> ranging from formatting strings parsed to DSLs and parser generators.
>>
>> Such feats are now approaching fruition because a number of factors
>> converge:
>>
>> * Dmitry Olshansky's regex library (now in Phobos) generates efficient
>> D code straight from regexen.
>>
>> * The scope and quality of CTFE has improved enormously, making more
>> advanced uses possible and even relatively easy (thanks Don!)
>>
>> * Hisayuki Mima implemented a parser generator in only 3000 lines of
>> code (sadly, no comments or documentation yet :o))
>>
>> * With the occasion of that announcement we also find out Philippe
>> Sigaud has already a competing design and implementation of a parser
>> generator.
>>
>> This is the kind of stuff I've had an eye on for the longest time. I'm
>> saying it's of strategic importance because CTFE technology, though
>> not new and already available with some languages, has unique powers
>> when combined with other features of D. With CTFE we get to do things
>> that are quite literally impossible to do in other languages.
>>
>> We need to have a easy-to-use, complete, seamless, and efficient
>> lexer-parser generator combo in Phobos, pronto. The lexer itself could
>> use a character-level PEG or a classic automaton, and emit tokens for
>> consumption by a parser generator. The two should work in perfect
>> tandem (no need for glue code). At the end of the day, defining a
>> complete lexer+parser combo for a language should be just a few lines
>> longer than the textual representation of the grammar itself.
>>
>> What do you all think? Let's get this project off the ground!
>>
>>

To be truly successful it should have some distinct properties like:
- being faster or identical to handwritten stuff we already have (like e.g. std.json ), reminds us of recent INI parser proposal, should be an easy pick for general parser.
- be flexible, the result need not be a predefined AST tree nodes, Hisayuki Mima's parser shows some nice movement in this direction
- have reasonable compile times and memory consumption (though it will only improve over time)


Recalling EBNF parser idea that I run with before finally being dragged down by real life. Roughly I thought to follow hybrid LL(*) aproach, while I had a solid plan on doing DFA for lexer and parser lookahead, the other things were more or less floating.

>> Thanks,
>>
>> Andrei
>
> I wrote a generic lexer generator some time ago.
> It already let to some compiler O(N^2) optimizations, because the token
> declarations sneak into the mangling :(.
> I also finally added a workaround for a remaining CTFE bug (#6815).
>
> https://gist.github.com/1255439 - lexer generator


Cool stuff.

> https://gist.github.com/1262321 - complete and fast D lexer
>
> I've ditched an attempt to write a parser combinator. It was based on
> expression templates and ended up at spirit craziness.
>
> <PERSONAL OPINION
> The hassle of providing good error messages and synthesizing parse results
> in a generic parser outweigh the benefit of a declarative grammar.
> /PERSONAL OPINION>
>

Error reporting is the weak spot, still no good ideas on solving that.

> A lot becomes feasible from the CTFE perspective,
> despite some bugfixes I only miss exp and log currently.
>

Reminds me that I have to dig up a wonderful CTFE bug in std.regex that's being somehow worked around by duping arrays since August.

> I do not agree that it's the right moment to write a parser though.
> It hits the first of phobos two biggest shortcomings, the lack of a good
> I/O
> system and the missing Allocators.
> Any parser written now will either risk to not play nice with ranges
> or has to come up with it's own buffering again.

All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.

-- 
Dmitry Olshansky
February 28, 2012
On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote: [...]
> I wrote a generic lexer generator some time ago.
> It already let to some compiler O(N^2) optimizations, because the token
> declarations sneak into the mangling :(.
> I also finally added a workaround for a remaining CTFE bug (#6815).
> 
> https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexer

Cool! I'll have to take a look at this sometime.


[...]
> <PERSONAL OPINION
> The hassle of providing good error messages and synthesizing parse results
> in a generic parser outweigh the benefit of a declarative grammar.
> /PERSONAL OPINION>

But things like lex/yacc have been useful throughout the years. With D's delegates, lexer/parser action rules should be even cleaner, no?


> A lot becomes feasible from the CTFE perspective, despite some bugfixes I only miss exp and log currently.

All of std.math should have CTFE versions so that we can perform complex calculations at compile-time (e.g., create table of scaled sin/cos values for use in high-performance 3D apps -- no need to waste time to generate these tables at startup time).


> I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators.  Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.

OTOH, perhaps once we start writing a lexer/parser generator, that will force us to fix the I/O and allocator system. Without a good project to motivate us to fix these tedious issues, it seems that we just lose inspiration to do it after a while.


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
February 28, 2012
On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
> On 28.02.2012 22:46, Martin Nowak wrote:
> >On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
[...]
> >>We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.
[...]
> To be truly successful it should have some distinct properties like: - being faster or identical to handwritten stuff we already have (like e.g. std.json ), reminds us of recent INI parser proposal, should be an easy pick for general parser.

We probably have to use string mixins to achieve this.

OTOH, a generic parser generator is unlikely to be able to handle special case optimizations such as operator-precedence parsers (http://en.wikipedia.org/wiki/Operator-precedence_parser), because operator precedence is only implied by grammar rules, and to optimize for this case requires explicitly specifying each operator's precedence. There is some price to be paid for generality, unfortunately.


> - be flexible, the result need not be a predefined AST tree nodes, Hisayuki Mima's parser shows some nice movement in this direction

In my mind, D's delegates are ideal for this sort of thing. Each grammar rule will have an associated callback. We can provide default callbacks that generates AST nodes, but the user can override them with his own delegates to do whatever they want (evaluate expressions/commands on the fly, build only AST subtrees of interest, etc.).


> - have reasonable compile times and memory consumption (though it will
> only improve over time)

This will be good motivation to improve dmd's CTFE implementation. ;-)


[...]
> Error reporting is the weak spot, still no good ideas on solving that.

This is not specific to D, though, right?


[...]
> >I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators.  Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.
> 
> All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.
[...]

Are you talking about std.io? I glanced over that code briefly recently; it looks very promising. Hope it will make it into Phobos soon. We *need* to integrate streams, ranges, and the stdio writeln, etc., stuff. The current situation to an outsider looks quite bad (streams not compatible with stdio, no easy way to substitute string buffer for files, etc.).

We also need automatic UTF encoding detection for all input streams/files/ranges/whatever. Including mismatching endianness (i.e. auto byte-swapping).  In a recent personal project to write a D lexer (as an exercise to help me learn D better), I ended up having to roll my own input stream abstraction in order to handle automatic encoding detection. Which is quite poorly written, I've to admit. Something like this needs to be standardized in Phobos with a well-written implementation.


T

-- 
"You know, maybe we don't *need* enemies." "Yeah, best friends are about all I can take." -- Calvin & Hobbes
« First   ‹ Prev
1 2 3 4 5 6 7 8 9