View mode: basic / threaded / horizontal-split · Log in · Help
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
> Interesting, I thought that PEG ⊂ CFG holds.
See section 3.4 of http://bford.info/pub/lang/peg.pdf

It contains a simple proof that a non-context-free language (a^n) 
(b^n) (c^n) can be described with PEG syntax. There are 
infinitely many such languages.
July 07, 2012
Re: Let's stop parser Hell
On Sat, Jul 7, 2012 at 6:06 PM, Roman D. Boiko <rb@d-coding.com> wrote:

> The most critical was the one you implemented: ability to define a custom
> parse tree. I also needed the ability to use immutable structures (almost)
> everywhere, while currently they must be mutable. Also it would be great to
> avoid UTF conversions (instead use functions and types templated by the
> string type).

I added dstrings because

1- at the time (a few months ago), the lists here were awash in UTF-32
discussions and I thought that'd be the way to go anyway
2- other D parsing libraries seemed to go to UTF32 also (CTPG)
3- I wanted to be able to parse mathematical notation like nabla,
derivatives, etc. which all have UTF32 symbols.


> However, as I found a few hours ago, Packrat parsing (typically used to
> handle PEG) has serious disadvantages: it complicates debugging because of
> frequent backtracking, it has problems with error recovery, and typically
> disallows to add actions with side effects (because of possibility of
> backtracking). These are important enough to reconsider my plans of using
> Pegged. I will try to analyze whether the issues are so fundamental that I
> (or somebody else) will have to create an ANTLR-like parser instead, or
> whether it is possible to introduce changes into Pegged that would fix these
> problems.

Note that PEG does not impose to use packrat parsing, even though it
was developed to use it. I think it's a historical 'accident' that put
the two together: Bryan Ford thesis used the two together.

Note that many PEG parsers do not rely on packrat (Pegged does not).
There are a bunch of articles on Bryan Ford's website by a guy
writting a PEG parser for Java, and who found that storing the last
rules was enought to get a slight speed improvement, buth that doing
anymore sotrage was detrimental to the parser's overall efficiency.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
> I added dstrings because
>
> 1- at the time (a few months ago), the lists here were awash in 
> UTF-32
> discussions and I thought that'd be the way to go anyway
> 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
> 3- I wanted to be able to parse mathematical notation like 
> nabla,
> derivatives, etc. which all have UTF32 symbols.

I propose to switch code to use S if(isSomeString!S) everywhere. 
Client code would first determine source encoding scheme, and 
then instantiate parsers specifying a string type. This is not a 
trivial change, but I'm willing to help implementing it.

> Note that PEG does not impose to use packrat parsing, even 
> though it was developed to use it. I think it's a historical 
> 'accident' that put the two together: Bryan Ford thesis used 
> the two together.
>
> Note that many PEG parsers do not rely on packrat (Pegged does 
> not).
> There are a bunch of articles on Bryan Ford's website by a guy
> writting a PEG parser for Java, and who found that storing the 
> last rules was enought to get a slight speed improvement, buth 
> that doing anymore sotrage was detrimental to the parser's 
> overall efficiency.

That's great! Anyway I want to understand the advantages and 
limitations of both Pegged and ANTLR, and probably study some 
more techniques. Such research consumes a lot of time but can be 
done incrementally along with development.
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 12:05, Dmitry Olshansky wrote:
> On 07-Jul-12 13:06, Roman D. Boiko wrote:
>> On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
>>> This work on parsers might be a good place for me to dive in. I have
>>> an interest in parsers and familiarity with LL, LALR, PEGs, and even
>>> Pratt parsers (fun!), but I am still inexperienced.
>> ...
>>> One thing that has always concerned me about PEGs is that they always
>>> say PEGs are slower than traditional two-phase LALR(1) or LL(k)
>>> parsers. However, I have never seen any benchmarks. Does anyone know
>>> exactly how much performance you lose in an (optimized) PEG compared
>>> to an (optimized) LALR/LL parser + LL/regex lexer?
>>
>> I decided to ask a question about this:
>>
>> http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk
>>
>>
>>
>> Don't hesitate to edit it if you would like to clarify some aspect.
>
> I can tell you that they are not slower then one another in principle.
> Quality of implementations trumps every theoretical aspect here. LALR is
> usually fast even if implemented by book but they are hard to optimize
> futher and quite restrictive on "semantic extensions".
>
> Proper CFG parsers all are liner-time anyway.
>

D isn't 100% CFG. But it is close.
July 07, 2012
Re: Let's stop parser Hell
On Sat, Jul 7, 2012 at 6:25 PM, Roman D. Boiko <rb@d-coding.com> wrote:
> On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
>>
>> Interesting, I thought that PEG ⊂ CFG holds.
>
> See section 3.4 of http://bford.info/pub/lang/peg.pdf
>
> It contains a simple proof that a non-context-free language (a^n) (b^n)
> (c^n) can be described with PEG syntax. There are infinitely many such
> languages.

Not that anyone is interested in such languages, but still :)
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 13:05, Jacob Carlborg wrote:
> On 2012-07-07 03:12, Jonathan M Davis wrote:
>
>> Now, the issue of a "strong, dependable formalization of D's syntax" is
>> another thing entirely. Porting dmd's lexer and parser to Phobos would
>> keep
>> the Phobos implementation in line with dmd much more easily and avoid
>> inconsistencies in the language definition and the like. However, if
>> we write a
>> new lexer and parser specifically for Phobos which _doesn't_ port the
>> lexer or
>> parser from dmd, then that _would_ help drive making the spec match the
>> compiler (or vice versa). So, I agree that could be a definite
>> argument for
>> writing a lexer and parser from scratch rather than porting the one
>> from dmd,
>> but I don't buy the bit about it smothering parser generators at all.
>> I think
>> that the use cases are completely different.
>
> I think the whole point of having a compiler as a library is that the
> compiler should use the library as well. Otherwise the two will get out
> of sync.
>
> Just look at Clang, LLVM, LLDB and Xcode, they took the correct
> approach. Clang and LLVM (and I think LLDB) are available as libraries.
> Then the compiler, debugger (lldb) and IDE uses these libraries as part
> of their implementation. They don't have their own implementation that
> is similar to the libraries, making it "easy" to stay in sync. They
> _use_ the libraries as libraries.
>
> This is what DMD and Phobos should do as well. If it's too complicated
> to port the lexer/parser to D then it would be better, at least as a
> first step, to modify DMD as needed.

I tried that. This is almost impossible, dmd's parser and AST are very 
tightly mixed with dmd's internals.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:49:09 UTC, deadalnix wrote:
> I tried that. This is almost impossible, dmd's parser and AST 
> are very tightly mixed with dmd's internals.

+1
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 12:37 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
>
>> Note that PEG does not impose to use packrat parsing, even though it
>> was developed to use it. I think it's a historical 'accident' that put
>> the two together: Bryan Ford thesis used the two together.
>>
>> Note that many PEG parsers do not rely on packrat (Pegged does not).
>> There are a bunch of articles on Bryan Ford's website by a guy
>> writting a PEG parser for Java, and who found that storing the last
>> rules was enought to get a slight speed improvement, buth that doing
>> anymore sotrage was detrimental to the parser's overall efficiency.
>
> That's great! Anyway I want to understand the advantages and limitations
> of both Pegged and ANTLR, and probably study some more techniques. Such
> research consumes a lot of time but can be done incrementally along with
> development.

Yeah, it's good to hear this notion reinforced.  I had this suspicion 
that the packrat parser is not necessarily the best/fastest solution, 
mostly because of the large allocation that has to happen before you get 
O(n) performance.  Thus I figured that pegged might eventually use 
different parsing strategies underneath it all, possibly with a lot of 
special-casing and clever hand-tuned and profiled optimizations.  At 
least that's what makes sense to me.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
> Yeah, it's good to hear this notion reinforced.  I had this 
> suspicion that the packrat parser is not necessarily the 
> best/fastest solution, mostly because of the large allocation 
> that has to happen before you get O(n) performance.  Thus I 
> figured that pegged might eventually use different parsing 
> strategies underneath it all, possibly with a lot of 
> special-casing and clever hand-tuned and profiled 
> optimizations.  At least that's what makes sense to me.

At the very least, we could use DFA instead of backtracking where 
possible. This is the approach implemented in ANTLR, but I wanted 
to introduce them before I knew about existence of the latter, 
simply because this would likely produce the fastest parsers 
possible.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, July 07, 2012 18:37:54 Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
> > I added dstrings because
> > 
> > 1- at the time (a few months ago), the lists here were awash in
> > UTF-32
> > discussions and I thought that'd be the way to go anyway
> > 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
> > 3- I wanted to be able to parse mathematical notation like
> > nabla,
> > derivatives, etc. which all have UTF32 symbols.
> 
> I propose to switch code to use S if(isSomeString!S) everywhere.
> Client code would first determine source encoding scheme, and
> then instantiate parsers specifying a string type. This is not a
> trivial change, but I'm willing to help implementing it.

I don't know about this particular case, because I haven't really looked at 
pegged, but in general, string parsing stuff should be taking ranges of dchar 
and then specializing on string type where appropriate for efficiency.

- Jonathan M Davis
4 5 6 7 8 9 10 11 12
Top | Discussion index | About this forum | D home