View mode: basic / threaded / horizontal-split · Log in · Help
February 29, 2012
Re: Lexer and parser generators using CTFE
On Wed, Feb 29, 2012 at 07:28:48PM +0100, Martin Nowak wrote:
> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
> 
> >On 02/28/2012 07:46 PM, Martin Nowak wrote:
> >>
> >>https://gist.github.com/1255439 - lexer generator
> >>https://gist.github.com/1262321 - complete and fast D lexer
> >>
> >
> >Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
> 
> No, it's as fast as dmd's lexer.
> 
> Writing the tokens to stdout takes a lot of time though.
> Just disable the "writeln(tok);" in the main loop.
> There is also an --echo option which recreates the complete
> source from the tokens.

This sounds like something I ran into in my D lexer project. Lexing
Phobos took upwards of 10-15 seconds, which is extremely slow
considering that dmd can *compile* it in just a few seconds. So I did
some profiling, and found out that most of the time was spent formatting
tokens into strings and running writeln. After I commented that out, the
running time dropped to just a few seconds. :-)


T

-- 
The fact that anyone still uses AOL shows that even the presence of
options doesn't stop some people from picking the pessimal one. - Mike
Ellis
February 29, 2012
Re: Lexer and parser generators using CTFE
On 29.02.2012 20:31, Andrei Alexandrescu wrote:
>
>> Recalling EBNF parser idea that I run with before finally being dragged
>> down by real life. Roughly I thought to follow hybrid LL(*) aproach,
>> while I had a solid plan on doing DFA for lexer and parser lookahead,
>> the other things were more or less floating.
>
> Well, maybe you could integrate that within your up-and-coming research.
> Grammars have been considered a solved problem every five years since
> 1970, and there's always new work coming :o).
>

It might involve parsing and grammars, unless academia folk seduce me 
with their ideas of optimization for highly parallel architectures :).

>> There is prototype of interactive regex matcher that
>> works directly on stream (buried in std.regex), it even passed dry-run
>> unittests back then. Though I had to postpone till I/O is sorted out. I
>> really loved Steven's design with it's easy access to buffer and well
>> thought out primitives, hope it will come about sometime soon.
>
> An interactive regex would be a dream come true to me...
>
>

Where you've been? ;) I mean during GSOC we've spent days with Fawzi 
talking about shoehorning regex matcher to work on buffered streams. He 
actually did that prototype, I was mostly reviewing/fixing the source to 
make sure it fits. Recalling the inner works it was even expected to 
optionally do NFC normalization on the fly (wow, now that was ambitious) 
and all that without copying stuff around 99% of the time.


-- 
Dmitry Olshansky
February 29, 2012
Re: Lexer and parser generators using CTFE
>
>
> This sounds like something I ran into in my D lexer project. Lexing
> Phobos took upwards of 10-15 seconds, which is extremely slow
> considering that dmd can *compile* it in just a few seconds. So I did
> some profiling, and found out that most of the time was spent formatting
> tokens into strings and running writeln. After I commented that out, the
> running time dropped to just a few seconds. :-)
>
>
And how different was just a few seconds from 10-15 seconds? :-)
February 29, 2012
Re: Lexer and parser generators using CTFE
On Thu, Mar 01, 2012 at 12:19:30AM +0530, d coder wrote:
> >
> >
> > This sounds like something I ran into in my D lexer project. Lexing
> > Phobos took upwards of 10-15 seconds, which is extremely slow
> > considering that dmd can *compile* it in just a few seconds. So I did
> > some profiling, and found out that most of the time was spent formatting
> > tokens into strings and running writeln. After I commented that out, the
> > running time dropped to just a few seconds. :-)
> >
> >
> And how different was just a few seconds from 10-15 seconds? :-)

OK so I was imprecise. It was 2-3 seconds compared to 10-15 seconds. So
it's about an order of magnitude difference. :)


T

-- 
"I speak better English than this villain Bush" -- Mohammed Saeed al-Sahaf, Iraqi Minister of Information
February 29, 2012
Re: Lexer and parser generators using CTFE
"d coder" <dlang.coder@gmail.com> wrote in message 
news:mailman.241.1330541814.24984.digitalmars-d@puremagic.com...
> >
>>
>> This sounds like something I ran into in my D lexer project. Lexing
>> Phobos took upwards of 10-15 seconds, which is extremely slow
>> considering that dmd can *compile* it in just a few seconds. So I did
>> some profiling, and found out that most of the time was spent formatting
>> tokens into strings and running writeln. After I commented that out, the
>> running time dropped to just a few seconds. :-)
>>
>>
> And how different was just a few seconds from 10-15 seconds? :-)
>

~10 seconds

;)
February 29, 2012
Re: Lexer and parser generators using CTFE
On 02/29/2012 07:28 PM, Martin Nowak wrote:
> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 02/28/2012 07:46 PM, Martin Nowak wrote:
>>>
>>> https://gist.github.com/1255439 - lexer generator
>>> https://gist.github.com/1262321 - complete and fast D lexer
>>>
>>
>> Well, it is slower at lexing than DMD at parsing. What is the bottleneck?
>
> No, it's as fast as dmd's lexer.
>
> Writing the tokens to stdout takes a lot of time though.
> Just disable the "writeln(tok);" in the main loop.

I did that.
February 29, 2012
Re: Lexer and parser generators using CTFE
On 29.02.2012 20:47, Andrei Alexandrescu wrote:
> On 2/29/12 3:48 AM, Dmitry Olshansky wrote:
>> Someway of stacking grammars is bonus points for the project. As Andrei
>> started with lexer+parser, I see no problem with it being
>> lexer+parser(+parser)*.
>
> What does stacking grammars mean?
>

In general it should be read as using one parser as lexer for another 
parser. But there is some nice ways to integrate another parser within 
recursive descent parser.
Let's assume simple recursive descent parser that uses operator 
precedence grammar. Suppose we have the main grammar:
Declaration = Type Identifier;
Statement = 'if' '(' Expr ')' | Expr
Type = int | float | ...

Where Expr is connected to the second grammar expressed as ...
Ok, let's sketch it, skipping semantic actions:
operatorGrammar!q{
Expr
Unit: Constant | Identifier //assume we have lexer
Operators:
	Not, !, prefix, 4
	UnMinus, -, prefix, 4
	Mul, *, binary, 3
	Mul, /, binary, 3
	Plus, +, binary, 2
	BiMinus, -, binary, 2
	Index, [], brace-like postfix
	Braces, (), brace-like
};

Effectively operator grammar for expressions parses "token" Expr. It's 
easy in recursive descent because it can just try to call this parser, 
then try the other alternative if it fails.
I'm not sure this stacking does add that much, but it's something to 
keep in mind.


> Anyhow, one thing that would become important is tree walkers - e.g. you
> have the tree and you define code for evaluating it etc.
>

Yes, it's a must have if AST is generated generically via annotations.

>> Such a framework relying on mixins is bound to produce awful error
>> messages at compile-time, unless explicitly stuffed up to watterline
>> with some kind of static assert(__traits(compiles,...),"Error x + info");
>
> We need to figure out ways to make that simpler.
>
>
> Andrei


-- 
Dmitry Olshansky
February 29, 2012
Re: Lexer and parser generators using CTFE
On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 02/29/2012 07:28 PM, Martin Nowak wrote:
>> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch>  
>> wrote:
>>
>>> On 02/28/2012 07:46 PM, Martin Nowak wrote:
>>>>
>>>> https://gist.github.com/1255439 - lexer generator
>>>> https://gist.github.com/1262321 - complete and fast D lexer
>>>>
>>>
>>> Well, it is slower at lexing than DMD at parsing. What is the  
>>> bottleneck?
>>
>> No, it's as fast as dmd's lexer.
>>
>> Writing the tokens to stdout takes a lot of time though.
>> Just disable the "writeln(tok);" in the main loop.
>
> I did that.

Interesting, I've commented it out https://gist.github.com/1262321#L1559  
and get the following.

<<<
PHOBOS=~/Code/D/DPL/phobos
mkdir test_lexer
cd test_lexer
curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
curl https://raw.github.com/gist/1262321/entity.d > entity.d
dmd -O -release -inline dlexer lexer entity
wc -l ${PHOBOS}/std/*.d
time ./dlexer ${PHOBOS}/std/*.d
>>>
./dlexer ${PHOBOS}/std/*.d  0.21s user 0.00s system 99% cpu 0.211 total
February 29, 2012
Re: Lexer and parser generators using CTFE
On 02/29/2012 09:03 PM, Martin Nowak wrote:
> On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 02/29/2012 07:28 PM, Martin Nowak wrote:
>>> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch>
>>> wrote:
>>>
>>>> On 02/28/2012 07:46 PM, Martin Nowak wrote:
>>>>>
>>>>> https://gist.github.com/1255439 - lexer generator
>>>>> https://gist.github.com/1262321 - complete and fast D lexer
>>>>>
>>>>
>>>> Well, it is slower at lexing than DMD at parsing. What is the
>>>> bottleneck?
>>>
>>> No, it's as fast as dmd's lexer.
>>>
>>> Writing the tokens to stdout takes a lot of time though.
>>> Just disable the "writeln(tok);" in the main loop.
>>
>> I did that.
>
> Interesting, I've commented it out https://gist.github.com/1262321#L1559
> and get the following.
>
> <<<
> PHOBOS=~/Code/D/DPL/phobos
> mkdir test_lexer
> cd test_lexer
> curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
> curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
> curl https://raw.github.com/gist/1262321/entity.d > entity.d
> dmd -O -release -inline dlexer lexer entity
> wc -l ${PHOBOS}/std/*.d
> time ./dlexer ${PHOBOS}/std/*.d
>>>>
> ./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total

I get 0.160s for lexing using your lexer.
Parsing the same file with DMDs parser takes 0.155 seconds. The 
difference grows with larger files.
February 29, 2012
Re: Lexer and parser generators using CTFE
On Wednesday, 29 February 2012 at 16:41:22 UTC, Andrei 
Alexandrescu wrote:
> On 2/28/12 7:16 PM, Christopher Bergqvist wrote:
>> What am I failing to pick up on?
>
> Barrier of entry and granularity of approach, I think.
>
> Currently if one wants to parse some simple grammar, there are 
> options such as (a) do it by hand, (b) use boost::spirit, or 
> (c) use lex/yacc.
>
> Parsing by hand has the obvious disadvantages. Using 
> boost::spirit has a steep learning curve and tends to create 
> very contorted grammar representations, full of representation 
> noise, and scales very poorly. Using lex/yacc is hamfisted - 
> there's an additional build step, generated files to deal with, 
> and the related logistics, which make lex/yacc a viable choice 
> only for "big" grammars.
>
> An efficient, integrated parser generator would lower the 
> barrier of entry dramatically - if we play our cards right, 
> even a sprintf specifier string could be parsed simpler and 
> faster using an embedded grammar, instead of painfully writing 
> the recognizer by hand. Parsing config files, XML, JSON, CSV, 
> various custom file formats and many others - all would all be 
> a few lines away. Ideally a user who has a basic understanding 
> of grammars should have an easier time using a small grammar to 
> parse simple custom formats, than writing the parsing code by 
> hand.
>
>
> Andrei

Thanks for your response.  The lowered barrier of entry in 
parsing something like a customized JSON format or config files 
is nice, and something I could see myself use.  I'm still 
skeptical about the level of "killer-featureness" but I would be 
glad to be proven wrong.
2 3 4 5 6 7 8 9
Top | Discussion index | About this forum | D home