July 07, 2012
On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
> Interesting, I thought that PEG ⊂ CFG holds.
See section 3.4 of http://bford.info/pub/lang/peg.pdf

It contains a simple proof that a non-context-free language (a^n) (b^n) (c^n) can be described with PEG syntax. There are infinitely many such languages.

July 07, 2012
On Sat, Jul 7, 2012 at 6:06 PM, Roman D. Boiko <rb@d-coding.com> wrote:

> The most critical was the one you implemented: ability to define a custom parse tree. I also needed the ability to use immutable structures (almost) everywhere, while currently they must be mutable. Also it would be great to avoid UTF conversions (instead use functions and types templated by the string type).

I added dstrings because

1- at the time (a few months ago), the lists here were awash in UTF-32
discussions and I thought that'd be the way to go anyway
2- other D parsing libraries seemed to go to UTF32 also (CTPG)
3- I wanted to be able to parse mathematical notation like nabla,
derivatives, etc. which all have UTF32 symbols.


> However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.

Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together.

Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.
July 07, 2012
On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
> I added dstrings because
>
> 1- at the time (a few months ago), the lists here were awash in UTF-32
> discussions and I thought that'd be the way to go anyway
> 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
> 3- I wanted to be able to parse mathematical notation like nabla,
> derivatives, etc. which all have UTF32 symbols.

I propose to switch code to use S if(isSomeString!S) everywhere. Client code would first determine source encoding scheme, and then instantiate parsers specifying a string type. This is not a trivial change, but I'm willing to help implementing it.

> Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together.
>
> Note that many PEG parsers do not rely on packrat (Pegged does not).
> There are a bunch of articles on Bryan Ford's website by a guy
> writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.

That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.
July 07, 2012
On 07/07/2012 12:05, Dmitry Olshansky wrote:
> On 07-Jul-12 13:06, Roman D. Boiko wrote:
>> On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:
>>> This work on parsers might be a good place for me to dive in. I have
>>> an interest in parsers and familiarity with LL, LALR, PEGs, and even
>>> Pratt parsers (fun!), but I am still inexperienced.
>> ...
>>> One thing that has always concerned me about PEGs is that they always
>>> say PEGs are slower than traditional two-phase LALR(1) or LL(k)
>>> parsers. However, I have never seen any benchmarks. Does anyone know
>>> exactly how much performance you lose in an (optimized) PEG compared
>>> to an (optimized) LALR/LL parser + LL/regex lexer?
>>
>> I decided to ask a question about this:
>>
>> http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk
>>
>>
>>
>> Don't hesitate to edit it if you would like to clarify some aspect.
>
> I can tell you that they are not slower then one another in principle.
> Quality of implementations trumps every theoretical aspect here. LALR is
> usually fast even if implemented by book but they are hard to optimize
> futher and quite restrictive on "semantic extensions".
>
> Proper CFG parsers all are liner-time anyway.
>

D isn't 100% CFG. But it is close.
July 07, 2012
On Sat, Jul 7, 2012 at 6:25 PM, Roman D. Boiko <rb@d-coding.com> wrote:
> On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:
>>
>> Interesting, I thought that PEG ⊂ CFG holds.
>
> See section 3.4 of http://bford.info/pub/lang/peg.pdf
>
> It contains a simple proof that a non-context-free language (a^n) (b^n)
> (c^n) can be described with PEG syntax. There are infinitely many such
> languages.

Not that anyone is interested in such languages, but still :)
July 07, 2012
On 07/07/2012 13:05, Jacob Carlborg wrote:
> On 2012-07-07 03:12, Jonathan M Davis wrote:
>
>> Now, the issue of a "strong, dependable formalization of D's syntax" is
>> another thing entirely. Porting dmd's lexer and parser to Phobos would
>> keep
>> the Phobos implementation in line with dmd much more easily and avoid
>> inconsistencies in the language definition and the like. However, if
>> we write a
>> new lexer and parser specifically for Phobos which _doesn't_ port the
>> lexer or
>> parser from dmd, then that _would_ help drive making the spec match the
>> compiler (or vice versa). So, I agree that could be a definite
>> argument for
>> writing a lexer and parser from scratch rather than porting the one
>> from dmd,
>> but I don't buy the bit about it smothering parser generators at all.
>> I think
>> that the use cases are completely different.
>
> I think the whole point of having a compiler as a library is that the
> compiler should use the library as well. Otherwise the two will get out
> of sync.
>
> Just look at Clang, LLVM, LLDB and Xcode, they took the correct
> approach. Clang and LLVM (and I think LLDB) are available as libraries.
> Then the compiler, debugger (lldb) and IDE uses these libraries as part
> of their implementation. They don't have their own implementation that
> is similar to the libraries, making it "easy" to stay in sync. They
> _use_ the libraries as libraries.
>
> This is what DMD and Phobos should do as well. If it's too complicated
> to port the lexer/parser to D then it would be better, at least as a
> first step, to modify DMD as needed.

I tried that. This is almost impossible, dmd's parser and AST are very tightly mixed with dmd's internals.
July 07, 2012
On Saturday, 7 July 2012 at 16:49:09 UTC, deadalnix wrote:
> I tried that. This is almost impossible, dmd's parser and AST are very tightly mixed with dmd's internals.

+1
July 07, 2012
On 07/07/2012 12:37 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
>
>> Note that PEG does not impose to use packrat parsing, even though it
>> was developed to use it. I think it's a historical 'accident' that put
>> the two together: Bryan Ford thesis used the two together.
>>
>> Note that many PEG parsers do not rely on packrat (Pegged does not).
>> There are a bunch of articles on Bryan Ford's website by a guy
>> writting a PEG parser for Java, and who found that storing the last
>> rules was enought to get a slight speed improvement, buth that doing
>> anymore sotrage was detrimental to the parser's overall efficiency.
>
> That's great! Anyway I want to understand the advantages and limitations
> of both Pegged and ANTLR, and probably study some more techniques. Such
> research consumes a lot of time but can be done incrementally along with
> development.

Yeah, it's good to hear this notion reinforced.  I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance.  Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations.  At least that's what makes sense to me.
July 07, 2012
On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
> Yeah, it's good to hear this notion reinforced.  I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance.  Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations.  At least that's what makes sense to me.

At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.
July 07, 2012
On Saturday, July 07, 2012 18:37:54 Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:
> > I added dstrings because
> > 
> > 1- at the time (a few months ago), the lists here were awash in
> > UTF-32
> > discussions and I thought that'd be the way to go anyway
> > 2- other D parsing libraries seemed to go to UTF32 also (CTPG)
> > 3- I wanted to be able to parse mathematical notation like
> > nabla,
> > derivatives, etc. which all have UTF32 symbols.
> 
> I propose to switch code to use S if(isSomeString!S) everywhere. Client code would first determine source encoding scheme, and then instantiate parsers specifying a string type. This is not a trivial change, but I'm willing to help implementing it.

I don't know about this particular case, because I haven't really looked at pegged, but in general, string parsing stuff should be taking ranges of dchar and then specializing on string type where appropriate for efficiency.

- Jonathan M Davis