Lexer and parser generators using CTFE (page 6) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Lexer and parser generators using CTFE (page 6)

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by H. S. Teoh
in reply to Martin Nowak

H. S. Teoh

Posted in reply to Martin Nowak

On Wed, Feb 29, 2012 at 07:28:48PM +0100, Martin Nowak wrote:
> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
> 
> >On 02/28/2012 07:46 PM, Martin Nowak wrote:
> >>
> >>https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexer
> >>
> >
> >Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
> 
> No, it's as fast as dmd's lexer.
> 
> Writing the tokens to stdout takes a lot of time though.
> Just disable the "writeln(tok);" in the main loop.
> There is also an --echo option which recreates the complete
> source from the tokens.

This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-)

T

-- 
The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike Ellis

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Dmitry Olshansky
in reply to Andrei Alexandrescu

Dmitry Olshansky

Posted in reply to Andrei Alexandrescu

On 29.02.2012 20:31, Andrei Alexandrescu wrote:
>
>> Recalling EBNF parser idea that I run with before finally being dragged
>> down by real life. Roughly I thought to follow hybrid LL(*) aproach,
>> while I had a solid plan on doing DFA for lexer and parser lookahead,
>> the other things were more or less floating.
>
> Well, maybe you could integrate that within your up-and-coming research.
> Grammars have been considered a solved problem every five years since
> 1970, and there's always new work coming :o).
>

It might involve parsing and grammars, unless academia folk seduce me with their ideas of optimization for highly parallel architectures :).

>> There is prototype of interactive regex matcher that
>> works directly on stream (buried in std.regex), it even passed dry-run
>> unittests back then. Though I had to postpone till I/O is sorted out. I
>> really loved Steven's design with it's easy access to buffer and well
>> thought out primitives, hope it will come about sometime soon.
>
> An interactive regex would be a dream come true to me...
>
>

Where you've been? ;) I mean during GSOC we've spent days with Fawzi talking about shoehorning regex matcher to work on buffered streams. He actually did that prototype, I was mostly reviewing/fixing the source to make sure it fits. Recalling the inner works it was even expected to optionally do NFC normalization on the fly (wow, now that was ambitious) and all that without copying stuff around 99% of the time.

-- 
Dmitry Olshansky

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by d coder

d coder

Attachments:

text/html part

>
>
> This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-)
>
>
And how different was just a few seconds from 10-15 seconds? :-)

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by H. S. Teoh

H. S. Teoh

On Thu, Mar 01, 2012 at 12:19:30AM +0530, d coder wrote:
> >
> >
> > This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-)
> >
> >
> And how different was just a few seconds from 10-15 seconds? :-)

OK so I was imprecise. It was 2-3 seconds compared to 10-15 seconds. So it's about an order of magnitude difference. :)


T

-- 
"I speak better English than this villain Bush" -- Mohammed Saeed al-Sahaf, Iraqi Minister of Information

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Nick Sabalausky
in reply to d coder

Nick Sabalausky

Posted in reply to d coder

"d coder" <dlang.coder@gmail.com> wrote in message news:mailman.241.1330541814.24984.digitalmars-d@puremagic.com...
> >
>>
>> This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-)
>>
>>
> And how different was just a few seconds from 10-15 seconds? :-)
>

~10 seconds

;)

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Timon Gehr
in reply to Martin Nowak

Timon Gehr

Posted in reply to Martin Nowak

On 02/29/2012 07:28 PM, Martin Nowak wrote:
> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 02/28/2012 07:46 PM, Martin Nowak wrote:
>>>
>>> https://gist.github.com/1255439 - lexer generator
>>> https://gist.github.com/1262321 - complete and fast D lexer
>>>
>>
>> Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
>
> No, it's as fast as dmd's lexer.
>
> Writing the tokens to stdout takes a lot of time though.
> Just disable the "writeln(tok);" in the main loop.

I did that.

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Dmitry Olshansky
in reply to Andrei Alexandrescu

Dmitry Olshansky

Posted in reply to Andrei Alexandrescu

On 29.02.2012 20:47, Andrei Alexandrescu wrote:
> On 2/29/12 3:48 AM, Dmitry Olshansky wrote:
>> Someway of stacking grammars is bonus points for the project. As Andrei
>> started with lexer+parser, I see no problem with it being
>> lexer+parser(+parser)*.
>
> What does stacking grammars mean?
>

In general it should be read as using one parser as lexer for another parser. But there is some nice ways to integrate another parser within recursive descent parser.
Let's assume simple recursive descent parser that uses operator precedence grammar. Suppose we have the main grammar:
Declaration = Type Identifier;
Statement = 'if' '(' Expr ')' | Expr
Type = int | float | ...

Where Expr is connected to the second grammar expressed as ...
Ok, let's sketch it, skipping semantic actions:
operatorGrammar!q{
Expr
Unit: Constant | Identifier //assume we have lexer
Operators:
	Not, !, prefix, 4
	UnMinus, -, prefix, 4
	Mul, *, binary, 3
	Mul, /, binary, 3
	Plus, +, binary, 2
	BiMinus, -, binary, 2
	Index, [], brace-like postfix
	Braces, (), brace-like
};

Effectively operator grammar for expressions parses "token" Expr. It's easy in recursive descent because it can just try to call this parser, then try the other alternative if it fails.
I'm not sure this stacking does add that much, but it's something to keep in mind.

> Anyhow, one thing that would become important is tree walkers - e.g. you
> have the tree and you define code for evaluating it etc.
>

Yes, it's a must have if AST is generated generically via annotations.

>> Such a framework relying on mixins is bound to produce awful error
>> messages at compile-time, unless explicitly stuffed up to watterline
>> with some kind of static assert(__traits(compiles,...),"Error x + info");
>
> We need to figure out ways to make that simpler.
>
>
> Andrei

-- 
Dmitry Olshansky

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Martin Nowak
in reply to Timon Gehr

Martin Nowak

Posted in reply to Timon Gehr

On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 02/29/2012 07:28 PM, Martin Nowak wrote:
>> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
>>
>>> On 02/28/2012 07:46 PM, Martin Nowak wrote:
>>>>
>>>> https://gist.github.com/1255439 - lexer generator
>>>> https://gist.github.com/1262321 - complete and fast D lexer
>>>>
>>>
>>> Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
>>
>> No, it's as fast as dmd's lexer.
>>
>> Writing the tokens to stdout takes a lot of time though.
>> Just disable the "writeln(tok);" in the main loop.
>
> I did that.

Interesting, I've commented it out https://gist.github.com/1262321#L1559 and get the following.

<<<
PHOBOS=~/Code/D/DPL/phobos
mkdir test_lexer
cd test_lexer
curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
curl https://raw.github.com/gist/1262321/entity.d > entity.d
dmd -O -release -inline dlexer lexer entity
wc -l ${PHOBOS}/std/*.d
time ./dlexer ${PHOBOS}/std/*.d
>>>
./dlexer ${PHOBOS}/std/*.d  0.21s user 0.00s system 99% cpu 0.211 total

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Timon Gehr
in reply to Martin Nowak

Timon Gehr

Posted in reply to Martin Nowak

On 02/29/2012 09:03 PM, Martin Nowak wrote:
> On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 02/29/2012 07:28 PM, Martin Nowak wrote:
>>> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch>
>>> wrote:
>>>
>>>> On 02/28/2012 07:46 PM, Martin Nowak wrote:
>>>>>
>>>>> https://gist.github.com/1255439 - lexer generator
>>>>> https://gist.github.com/1262321 - complete and fast D lexer
>>>>>
>>>>
>>>> Well, it is slower at lexing than DMD at parsing. What is the
>>>> bottleneck?
>>>
>>> No, it's as fast as dmd's lexer.
>>>
>>> Writing the tokens to stdout takes a lot of time though.
>>> Just disable the "writeln(tok);" in the main loop.
>>
>> I did that.
>
> Interesting, I've commented it out https://gist.github.com/1262321#L1559
> and get the following.
>
> <<<
> PHOBOS=~/Code/D/DPL/phobos
> mkdir test_lexer
> cd test_lexer
> curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
> curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
> curl https://raw.github.com/gist/1262321/entity.d > entity.d
> dmd -O -release -inline dlexer lexer entity
> wc -l ${PHOBOS}/std/*.d
> time ./dlexer ${PHOBOS}/std/*.d
>>>>
> ./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total

I get 0.160s for lexing using your lexer.
Parsing the same file with DMDs parser takes 0.155 seconds. The difference grows with larger files.

February 29, 2012

Re: Lexer and parser generators using CTFE

Posted by Christopher Bergqvist
in reply to Andrei Alexandrescu

Christopher Bergqvist

Posted in reply to Andrei Alexandrescu

On Wednesday, 29 February 2012 at 16:41:22 UTC, Andrei Alexandrescu wrote:
> On 2/28/12 7:16 PM, Christopher Bergqvist wrote:
>> What am I failing to pick up on?
>
> Barrier of entry and granularity of approach, I think.
>
> Currently if one wants to parse some simple grammar, there are options such as (a) do it by hand, (b) use boost::spirit, or (c) use lex/yacc.
>
> Parsing by hand has the obvious disadvantages. Using boost::spirit has a steep learning curve and tends to create very contorted grammar representations, full of representation noise, and scales very poorly. Using lex/yacc is hamfisted - there's an additional build step, generated files to deal with, and the related logistics, which make lex/yacc a viable choice only for "big" grammars.
>
> An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand. Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand.
>
>
> Andrei

Thanks for your response.  The lowered barrier of entry in parsing something like a customized JSON format or config files is nice, and something I could see myself use.  I'm still skeptical about the level of "killer-featureness" but I would be glad to be proven wrong.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation