Let's stop parser Hell (page 16)

July 08, 2012

Re: Let's stop parser Hell

Posted by David Piepgrass
in reply to Roman D. Boiko

Permalink

David Piepgrass

Posted in reply to Roman D. Boiko

Permalink

On Sunday, 8 July 2012 at 21:22:39 UTC, Roman D. Boiko wrote:
> On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:
>> It's been too long since I was actively working on parsers to give any details, but it is my understanding that because a hand-written parser is optimized for a specific grammar, it's going to be faster.
>
> My aim is to find out any potential bottlenecks and ensure that those are possible to get rid of. So, let's try.
>
> I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized) code where necessary, including those needed to take advantage of some particular aspects of D grammar. Thus I claim that optimization for D grammar is possible.

I'm convinced that the output of a parser generator (PG) can be very nearly as fast as hand-written code. ANTLR's output (last I checked) was not ideal, but the one I planned to make (a few years ago) would have produced faster code.

By default the PG's output will not be the speed of hand-written code, but the user can optimize it. Assuming an ANTLR-like PG, the user can inspect the original output looking for inefficient lookahead, or cases where the parser looks for rare cases before common cases, and then improve the grammar and insert ... I forget all the ANTLR terminology ... syntactic predicates or whatever, to optimize the parser.

> So far discussion goes in favor of LL(*) parser like ANTLR, which is top-down recursive-descent. Either Pegged will be optimized with LL(*) algorithms, or a new parser generator created.

Right, for instance I am interested in writing a top-down PG because I understand them better and prefer the top-down approach due to its flexibility (semantic actions, allowing custom code) and understandability (the user can realistically understand the output; in fact readability would be a specific goal of mine)

Roman, regarding what you were saying to me earlier:
>In stage 2 you have only performed some basic analysis, like, e.g., matched braces to define some hierarchy. This means that at the time when you find a missing brace, for example, you cannot tell anything more than that braces don't match.

Stage 2 actually can tell more than just "a brace is missing somewhere". Because so many languages are C-like. So given this situation:

   frob (c &% x)
      blip # gom;
   }

It doesn't need to know what language this is to tell where the brace belongs. Even in a more nebulous case like:

   frob (c &% x) bar @ lic
      blip # gom;
   }

probably the brace belongs at the end of the first line.

Perhaps your point is that there are situations where a parser that knows the "entire" grammar could make a better guess about where the missing brace/paren belongs. That's certainly true.

However, just because it can guess better, doesn't mean it can reinterpret the code based on that guess. I mean, I don't see any way to "back up" a parser by an arbitrary amount. A hypothetical stage 2 would probably be hand-written and could realistically back up and insert a brace/paren anywhere that the heuristics dictate, because it is producing a simple data structure and it doesn't need to do any semantic actions as it parses. A "full" parser, on the other hand, has done a lot of work that it can't undo, so the best it can do is report to the user "line 54: error: brace mismatch; did you forget a brace on line 13?" The heuristic is still helpful, but it has already parsed lines 13 to 54 in the wrong context (and, in some cases, has already split out a series of error messages that are unrelated to the user's actual mistake).

> As I demonstrated in some examples, it could get the output which implies incorrect structure

I was unable to find the examples you refer to... this thread's getting a little unweildy :)

David, I would suggest you to create a proof-of-concept prototype and pay attention to some non-trivial use cases. This way you'd likely be able to reduce major risks significantly before making any serious time investments.

On 2012-07-08 22:05, Jonathan M Davis wrote: > Well, for the lexer and parser, we're probably okay (or very close to it). As > Daniel pointed out elsewhere in this thread, that part of the frontend doesn't > change much (particularly the lexer). There's definitely still some churn, but > it's nothing like it used to be. It don't feel like that, didn't we get lambdas or UFCS in the last release? -- /Jacob Carlborg

On 2012-07-08 22:15, Roman D. Boiko wrote: > Can you provide a *specific* example of performance optimization which a > custom D compiler would have, but parser generator would be unlikely to > catch up. I'm not expert in this field but I've heard that it's harder to get good error reporting with a parser generator. -- /Jacob Carlborg

On Monday, July 09, 2012 08:33:31 Jacob Carlborg wrote: > On 2012-07-08 22:05, Jonathan M Davis wrote: > > Well, for the lexer and parser, we're probably okay (or very close to it). As Daniel pointed out elsewhere in this thread, that part of the frontend doesn't change much (particularly the lexer). There's definitely still some churn, but it's nothing like it used to be. > > It don't feel like that, didn't we get lambdas or UFCS in the last release? lambdas are a bit older than that IIRC, and I don't think that UFCS actually affects the lexer or parser. Yes, changes are still made, but they're increasingly rare, and most of the stuff being changed is on the semantic level (most of which is bug fixes). So, out of all of the parts of the compiler to duplicate, those are the least likely to have to have major changes made to them later (especially the lexer). - Jonathan M Davis

On 2012-07-09 08:39, Jonathan M Davis wrote: > lambdas are a bit older than that IIRC, and I don't think that UFCS actually > affects the lexer or parser. Yes, changes are still made, but they're > increasingly rare, and most of the stuff being changed is on the semantic level > (most of which is bug fixes). So, out of all of the parts of the compiler to > duplicate, those are the least likely to have to have major changes made to > them later (especially the lexer). I'm pretty sure UFCS affects lexing or parsing. How else would this be legal: 4.foo(); -- /Jacob Carlborg

On Monday, July 09, 2012 09:16:41 Jacob Carlborg wrote: > On 2012-07-09 08:39, Jonathan M Davis wrote: > > lambdas are a bit older than that IIRC, and I don't think that UFCS actually affects the lexer or parser. Yes, changes are still made, but they're increasingly rare, and most of the stuff being changed is on the semantic level (most of which is bug fixes). So, out of all of the parts of the compiler to duplicate, those are the least likely to have to have major changes made to them later (especially the lexer). > > I'm pretty sure UFCS affects lexing or parsing. How else would this be legal: > > 4.foo(); That definitely wouldn't affect lexing, because it doesn't affect the tokens at all. Whether it affects the parser would depend on the grammar. There's a good chance it would affect parsing, but it might not. It depends on how numeric literals are treated. In most cases though, UFCS definitely wouldn't affect parsing at all, because it's purely a matter of symbol resolution, so if changes had to be made to the parser, they would almost certainly have been minimal. Yes, some changes are still being made to the parser, but since the language is mostly frozen, almost all of the changes have gone into bug fixes, which are generally semantic issues and don't affect lexing or parsing at all. So, while there would still be some maintenance issues in keeping a separate lexer and parser in line with dmd, I really don't think that it would take much at this point. Most of the work would be in getting them to match dmd when they're first written. - Jonathan M Davis

On 07/09/2012 08:33 AM, Jacob Carlborg wrote: > On 2012-07-08 22:05, Jonathan M Davis wrote: > >> Well, for the lexer and parser, we're probably okay (or very close to >> it). As >> Daniel pointed out elsewhere in this thread, that part of the frontend >> doesn't >> change much (particularly the lexer). There's definitely still some >> churn, but >> it's nothing like it used to be. > > It don't feel like that, didn't we get lambdas or UFCS in the last release? > Those changes are trivial to incorporate into a well written parser.

Forums