Let's stop parser Hell (page 7)

On Saturday, 7 July 2012 at 10:05:30 UTC, Dmitry Olshansky wrote: > I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". > > Proper CFG parsers all are liner-time anyway. Exactly, that is also my point. I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.

On 2012-07-07 03:12, Jonathan M Davis wrote: > Now, the issue of a "strong, dependable formalization of D's syntax" is > another thing entirely. Porting dmd's lexer and parser to Phobos would keep > the Phobos implementation in line with dmd much more easily and avoid > inconsistencies in the language definition and the like. However, if we write a > new lexer and parser specifically for Phobos which _doesn't_ port the lexer or > parser from dmd, then that _would_ help drive making the spec match the > compiler (or vice versa). So, I agree that could be a definite argument for > writing a lexer and parser from scratch rather than porting the one from dmd, > but I don't buy the bit about it smothering parser generators at all. I think > that the use cases are completely different. I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync. Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries. This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed. Create a C API for DMD and then create D bindings to be put into Phobos. -- /Jacob Carlborg

On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote: > I think that Pegged can be heavily optimized in performance, and there are no > fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser. Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.

On 07-Jul-12 15:30, Roman D. Boiko wrote: > On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote: >> I think that Pegged can be heavily optimized in performance, and there >> are no >> fundamental issues which would make it inherently slower than LALR or >> a hand-written D-specific parser. > > Hmm... found an interesting article: > http://www.antlr.org/papers/LL-star-PLDI11.pdf > > It describes some disadvantages of Packrat parsing, like problems with > debugging and error recovery. These are important for DCT, so I'll have > to perform additional research. Yup, LL(*) is my favorite so far. As for debugging and error recovery they are always a problem. -- Dmitry Olshansky

On Saturday, 7 July 2012 at 11:33:18 UTC, Dmitry Olshansky wrote: > Yup, LL(*) is my favorite so far. As for debugging and error recovery they are always a problem. It's interesting, that I also wanted to use DFA for non-terminals (similarly to LL(*)), but was considering their usage as pure performance optimization. Concerns raised in the article seem to be very reasonable, I will likely reconsider my previous plans... This forum is an endless source of high-quality feed-back for me.

> I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". > > Proper CFG parsers all are liner-time anyway. To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs. However I do agree, that error handling and flexibility are more important than raw speed. I don't want to get a concrete syntax tree full of unneeded productions, like the ones you get when you encode precedence rules in your grammar.

On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote: >> Proper CFG parsers all are liner-time anyway. > > To be picky here: > The languages that can be parsed in linear time are a strict subset of CFGs. Also any PEG defines a language with linear-time parsing. Some non-context-free languages can be described with PEG.

On 07/05/2012 08:28 AM, Roman D. Boiko wrote: > Pegged may eventually become standard, if it will be performanceoptimized and a bit more customizable Interesting, I thought you were hand-writing this stuff. I'm a fan of pegged and made some pull requests, one of which was aimed at making it more customizable by allowing the user to define what type gets used as a parse tree node, thus allowing one to potentially use their parse tree as an AST (and maybe do a couple other things too). It's a WIP, but the proof of concept is done: DMD can handle the extra templating at compile time, so it works. What kind of things did you want in terms of customizability?

July 07, 2012

Re: Let's stop parser Hell

Posted by Roman D. Boiko
in reply to Chad J

Permalink

Roman D. Boiko

Posted in reply to Chad J

Permalink

On Saturday, 7 July 2012 at 15:42:30 UTC, Chad J wrote:
> On 07/05/2012 08:28 AM, Roman D. Boiko wrote:
>> Pegged may eventually become standard, if it will be performanceoptimized and a bit more customizable
>
> Interesting, I thought you were hand-writing this stuff.
>
> I'm a fan of pegged and made some pull requests, one of which was aimed at making it more customizable by allowing the user to define what type gets used as a parse tree node, thus allowing one to potentially use their parse tree as an AST (and maybe do a couple other things too). It's a WIP, but the proof of concept is done: DMD can handle the extra templating at compile time, so it works.
>
> What kind of things did you want in terms of customizability?

There are many possible customization point which would make it a better fit for my project while also being useful in general.

The most critical was the one you implemented: ability to define a custom parse tree. I also needed the ability to use immutable structures (almost) everywhere, while currently they must be mutable. Also it would be great to avoid UTF conversions (instead use functions and types templated by the string type). I also wanted to add ability to reuse parts of previously generated AST which correspond to source code that has not been changed, or to identical source code parsed previously. (This would allow incremental parsing, e.g., while user is typing.) The latter would require to track source code positions separately, or even generating them on demand (this is the feature actively criticized by deadalnix in some previous discussions but central to DCT architecture). AST would only hold information about widths of its nodes.

I've also written some notes (10-15 suggestions) while studying Pegged code, which will be shared later.

However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.

On Saturday, 7 July 2012 at 15:39:52 UTC, Roman D. Boiko wrote: > On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote: >>> Proper CFG parsers all are liner-time anyway. >> >> To be picky here: >> The languages that can be parsed in linear time are a strict subset of CFGs. > Also any PEG defines a language with linear-time parsing. Some non-context-free languages can be described with PEG. Interesting, I thought that PEG ⊂ CFG holds.

Forums