Writing a Parser (page 2)

> Dan Wrote: > >> >> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. >> >> At the moment, I'm trying recursive descent parsing. > Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.

Ap Lee schrieb: > Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. > > Good luck. I haven't tested it, but there is a Antlr D backend: http://www.mbutscher.de/antlrd/

I tried antlrd on debian, while I could get it to install (apt-get for pure antlr) and download antlrd, All the examples resulted in an error message: #runantlr /tmp/SimpleC.g /tmp/SimpleC.g:3:1: unexpected token: grammar Which did not have any google results for potential reasons.. Regards Alan Frank Benoit wrote: > Ap Lee schrieb: >> Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. >> >> Good luck. > > I haven't tested it, but there is a Antlr D backend: > http://www.mbutscher.de/antlrd/

Dan wrote: >> Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately. > > I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level. That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has. Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io). >> A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same. > > Linguistics assigns them very different meanings. Formal languages do too, but for practical use it's not much of a difference. >> If you think of parse trees when dealing with your grammar > > Are those like sentence trees, with noun phrase, verb phrase, etc? Probably. I don't know linguistics. Example: S -> A A -> aBa B -> bCb C -> c C -> A parsing ababcbaba yields S | _____ A _____ / | \ a ___ B____ a / | \ b _ C__ b / | \ a B a /|\ b C b | c >>> - How to handle classic situations > > When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly: > > { { /* } */ } , { } } it's a matter of what else is allowed in { }. besides, usually /* */ comments are handled as whitespace lexemes, solving the problem before parsing. > Another classical problem is JavaScript RegExp literals or divide: > > /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. > > How would you write that? > How would the machine read that? the whole problem of parsing such a thing doesn't arise until embedded into the grammar. it therefore depends on what else interferes with this syntax. i don't know exactly what's allowed in JavaScript, but you can probably distinguish these expressions by the leading / - in a arithmetic expression that isn't allowed, therefore the parser tries to match a regexp expression. Very simplified: Expr -> ReEx | ArEx ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral ReEx -> '/' StringLiteral '/' OptParameters Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous.

Jascha Wetzel Wrote: > Dan wrote: >>Like MathML, it's way too far from the machine to generate an *efficient* parser. > That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has. Ah, but there's always overhead. > Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io). Eep. That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered. > it's a matter of what else is allowed in { }. besides, usually /* */ comments are handled as whitespace lexemes, solving the problem before parsing. Aha! Well then, the way I wrote my scanner/parser, a whole tree is built before parsing. It's not fully functional yet, but I'm not seeing any design failures. > > > Another classical problem is JavaScript RegExp literals or divide: > > > > /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. > > > > How would you write that? > > How would the machine read that? > > the whole problem of parsing such a thing doesn't arise until embedded > into the grammar. it therefore depends on what else interferes with this > syntax. i don't know exactly what's allowed in JavaScript, but you can > probably distinguish these expressions by the leading / - in a > arithmetic expression that isn't allowed, therefore the parser tries to > match a regexp expression. Very simplified: > Expr -> ReEx | ArEx > ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral > ReEx -> '/' StringLiteral '/' OptParameters > > Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous. So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected? That makes sense. I did it similar to that.

January 12, 2008

Re: Writing a Parser - Walnut and aPaGeD comments

Posted by Jascha Wetzel
in reply to Dan

Permalink

Jascha Wetzel

Posted in reply to Dan

Permalink

Dan wrote:
>> That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has.
> 
> Ah, but there's always overhead.

What i meant was, that this overhead has a purpose. Once your grammar uses it, it's not overhead anymore.
Automatically generated parsers usually facilitate bottom-up parsing algorithms (LALR, GLR) that have several advantages over top-down algorithms (LL(k), usually implemented as recursive descent) that are used in manually written parsers.
LALR for example is completely iterative (i.e. no stack action, no calls, less code -> completely cached), making it more efficient than recursive descent. GLR is almost completely iterative, depending on the implementation.

If an automatically generated parser is slower, it is usually due to features that make it more flexible wrt. to grammars. Therefore, as stated initially, once you need this flexibility, the parser performs nicely.

Another thing is backtracking. RD can use lookahead, but it is more elaborate to implement it manually and generally, top-down lookahead isn't as effective as bottom-up lookahead. That is, it takes more complex grammars to create the need for backtracking in a bottom-up parser than in a top-down parser. That is where bottom-up parsing is categorically faster than top-down.

All this makes me claim that it'll be hard to write for example a recursive descent D parser by hand, that performs better than it's automatically generated GLR counterpart. You might still succeed by a small margin, but you will have spent a *lot* more time to do so. And you will have a parser that is not near as easily maintainable as the grammar used for the automatically generated parser.

>> Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).
> 
> Eep.  That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.

The tango package has >400 files with multiple MB of code. I doubt there is any script or library written in JavaScript that is even close to that size. Parsing single files <1000 LoC is a matter of <1ms with apaged or any other LALR/GLR parser, no need to worry ;).

> So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected?
> 
> That makes sense.  I did it similar to that.

In this case it even only needs to look at the next token. You're usually right with what you do to solve such a case if you don't need to backtrack. If so, you should double check that there is no other way.

Forums