July 07, 2012
On Saturday, July 07, 2012 13:05:29 Jacob Carlborg wrote:
> On 2012-07-07 03:12, Jonathan M Davis wrote:
> > Now, the issue of a "strong, dependable formalization of D's syntax" is
> > another thing entirely. Porting dmd's lexer and parser to Phobos would
> > keep
> > the Phobos implementation in line with dmd much more easily and avoid
> > inconsistencies in the language definition and the like. However, if we
> > write a new lexer and parser specifically for Phobos which _doesn't_ port
> > the lexer or parser from dmd, then that _would_ help drive making the
> > spec match the compiler (or vice versa). So, I agree that could be a
> > definite argument for writing a lexer and parser from scratch rather than
> > porting the one from dmd, but I don't buy the bit about it smothering
> > parser generators at all. I think that the use cases are completely
> > different.
> 
> I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync.
> 
> Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries.
> 
> This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed. Create a C API for DMD and then create D bindings to be put into Phobos.

There are multiple issues here. The one that Andrei is raising is the fact that D isn't properly formalized. Essentially, the compiler _is_ the spec, and while the online spec _mostly_ follows it, it doesn't entirely, and the online spec isn't always as precise as it needs to be regardless. With a fully formalized spec, it should be possible to fully implement a D compiler from the spec alone, and I don't think that that's currently possible.

Writing a D lexer and parser (if not a full-blown compiler) from scratch would help highlight the places in the spec which are lacking, and having it be part of Phobos would arguably increase Walter's incentive to make sure that the spec is in line with the compiler (and vice versa) so that stuff _other_ than the compiler which is based on the spec would be able to match the compiler.

Clang is in a _completely_ different situation. It's a C/C++ compiler, and both C and C++ already have official, formalized specs which Clang conforms to (or is supposed to anyway). Clang has no control over the spec at all. It merely implements it. So, there is no issue of trying to keep other tools or compilers in line with Clang due to it being the language's spec like we effectively have with dmd. It may help the tools which use Clang to be fully in line with Clang and not have to worry about whether Clang implements the spec slightly differently, but in theory, if they all follow the spec correctly, that isn't in issue (though obviously in practice it can be).

In D's case, all of the major D compilers use the same frontend, which helps compatability but harms the spec, because there's less incentive to keep it precise and  up-to-date. So, from the perspective of the spec, implementing the D lexer and parser for Phobos separately from dmd would be of great benefit.

IMHO, the reason that porting dmd's lexer and parser would be of great benefit is primarily maintenance. It makes it much easier to keep Phobos' lexer and parser in line with dmd, making discrepencies less likely, but it arguably reduces the incentive to improve the spec.

The benefits of having a lexer and parser as a library (regardless of whether it's from scratch or a port from dmd) are primarly derived from the fact that it makes it much easier to create tools which use them. Such tools no longer have to write their own lexers or parsers.

If the compiler uses the same library, it has the added benefit of making it so that any tool using the library will be in sync with the compiler, but if the spec is properly formalized and up-to-date, and the library is kep up-to-date with _it_, that shouldn't really be a problem. You still have the debate as to whether it's better to have a separate implementation based on the spec (thereby making it more likely that the spec is correct) or whether it's better to have the compiler share the implementation so that the library is guaranteed to match the compiler (though not necessarily the spec), but I think that that's a separate debate from whether we should have the lexer and parser as a library.

In all honesty though, I would be surprised if you could convince Walter to switch dmd's frontend to Phobos' lexer and parser even once they've been implemented. So, while I agree that there are benefits in doing so, I'm not sure how much chance you have of ever getting any traction with that.

Another thing to consider is that that might make it so that gdc and ldc couldn't share the same frontend with dmd (assuming that they have to keep their frontends in C or C++ -  I don't know if they do) - but if so, that would increase the incentive for the spec to be correct if dmd ever started using a D frontend.

- Jonathan M Davis
July 07, 2012
On 7/7/12 6:05 AM, Dmitry Olshansky wrote:
> I can tell you that they are not slower then one another in principle.
> Quality of implementations trumps every theoretical aspect here. LALR is
> usually fast even if implemented by book but they are hard to optimize
> futher and quite restrictive on "semantic extensions".

Isn't it the case that PEG require more memory for certain grammars?

Andrei
July 07, 2012
On 7/7/12 6:24 AM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
>> http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk
>>
>
> So far it looks like LALR parsers may have lower constant factors than
> Packrat.
>
> The difference could be minimized by paying attention to parsing of
> terminal symbols, which was in my plans already. It is not necessary to
> strictly follow Packrat parsing algorithm.
>
> The benefits of Pegged, in my view, are its support of Parsing
> Expression Grammar (PEG) and compile-time evaluation. It is easily
> extensible and modifiable.

Isn't also the fact that lexing and parsing are integrated? With traditional LALR you need a separate tokenizer.

> When I implemented recursive-descent parser by hand in one of early
> drafts of DCT, I strongly felt the need to generalize code in a way
> which in retrospect I would call PEG-like. The structure of my
> hand-written recursive-descent parser was a one-to-one mapping to an
> implemented subset of D specification, and I considered it problematic,
> because it was needed to duplicate the same structure in the resulting AST.
>
> PEG is basically a language that describes both, the implementation of
> parser, and the language syntax. It greatly reduces implicit code
> duplication.
>
> I think that generated code can be made almost as fast as a hand-written
> parser for a particular language (probably, a few percent slower).
> Especially if that language is similar to D (context-free, with
> fine-grained hierarchical grammar). Optimizations might require to
> forget about strictly following any theoretical algorithm, but that
> should be OK.

All that sounds really encouraging. I'm really looking forward to more work in that area. If you stumble upon bugs that block you, let us know and Walter agreed he'll boost their priority.


Andrei
July 07, 2012
> Note that PEG does not impose to use packrat parsing, even though it
> was developed to use it. I think it's a historical 'accident' that put
> the two together: Bryan Ford thesis used the two together.

Interesting. After trying to use ANTLR-C# several years back, I got disillusioned because nobody was interested in fixing the bugs in it (ANTLR's author is a Java guy first and foremost) and the source code of the required libraries didn't have source code or a license (wtf.)

So, for awhile I was thinking about how I might make my own parser generator that was "better" than ANTLR. I liked the syntax of PEG descriptions, but I was concerned about the performance hit of packrat and, besides, I already liked the syntax and flexibility of ANTLR. So my idea was to make something that was LL(k) and mixed the syntax of ANTLR and PEG while using more sane (IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 3 still uses the same semantics today...) All of this is 'water under the bridge' now, but I hand-wrote a lexer to help me plan out how my parser-generator would produce code. The output code was to be both more efficient and significantly more readable than ANTLR's output. I didn't get around to writing the parser-generator itself but I'll have a look back at my handmade lexer for inspiration.

>> However, as I found a few hours ago, Packrat parsing (typically used to
>> handle PEG) has serious disadvantages: it complicates debugging because of
>> frequent backtracking, it has problems with error recovery, and typically
>> disallows to add actions with side effects (because of possibility of
>> backtracking). These are important enough to reconsider my plans of using
>> Pegged. I will try to analyze whether the issues are so fundamental that I
>> (or somebody else) will have to create an ANTLR-like parser instead, or
>> whether it is possible to introduce changes into Pegged that would fix these
>> problems.

I don't like the sound of this either. Even if PEGs were fast, difficulty in debugging, error handling, etc. would give me pause. I insist on well-rounded tools. For example, even though LALR(1) may be the fastest type of parser (is it?), I prefer not to use it due to its inflexibility (it just doesn't like some reasonable grammars), and the fact that the generated code is totally unreadable and hard to debug (mind you, when I learned LALR in school I found that it is possible to visualize how it works in a pretty intuitive way--but debuggers won't do that for you.)

While PEGs are clearly far more flexible than LALR and probably more flexible than LL(k), I am a big fan of old-fashioned recursive descent because it's very flexible (easy to insert actions during parsing, and it's possible to use custom parsing code in certain places, if necessary*) and the parser generator's output is potentially very straightforward to understand and debug. In my mind, the main reason you want to use a parser generator instead of hand-coding is convenience, e.g. (1) to compress the grammar down so you can see it clearly, (2) have the PG compute the first-sets and follow-sets for you, (3) get reasonably automatic error handling.

* (If the language you want to parse is well-designed, you'll probably not need much custom parsing. But it's a nice thing to offer in a general-purpose parser generator.)

I'm not totally sure yet how to support good error messages, efficiency and straightforward output at the same time, but by the power of D I'm sure I could think of something...

I would like to submit another approach to parsing that I dare say is my favorite, even though I have hardly used it at all yet. ANTLR offers something called "tree parsing" that is extremely cool. It parses trees instead of linear token streams, and produces other trees as output. I don't have a good sense of how tree parsing works, but I think that some kind of tree-based parser generator could become the basis for a very flexible and easy-to-understand D front-end. If a PG operates on trees instead of linear token streams, I have a sneaky suspicion that it could revolutionize how a compiler front-end works.

Why? because right now parsers operate just once, on the user's input, and from there you manipulate the AST with "ordinary" code. But if you have a tree parser, you can routinely manipulate and transform parts of the tree with a sequence of independent parsers and grammars. Thus, parsers would replace a lot of things for which you would otherwise use a visitor pattern, or something. I think I'll try to sketch out this idea in more detail later.
July 07, 2012
On 7/7/12 7:33 AM, Dmitry Olshansky wrote:
> On 07-Jul-12 15:30, Roman D. Boiko wrote:
>> On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
>>> I think that Pegged can be heavily optimized in performance, and there
>>> are no
>>> fundamental issues which would make it inherently slower than LALR or
>>> a hand-written D-specific parser.
>>
>> Hmm... found an interesting article:
>> http://www.antlr.org/papers/LL-star-PLDI11.pdf
>>
>> It describes some disadvantages of Packrat parsing, like problems with
>> debugging and error recovery. These are important for DCT, so I'll have
>> to perform additional research.
>
> Yup, LL(*) is my favorite so far.

That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.

How many semantics hacks does D's syntax need for a LL(*) parser?


Andrei


July 07, 2012
On 07/07/2012 01:01 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:
>> Yeah, it's good to hear this notion reinforced. I had this suspicion
>> that the packrat parser is not necessarily the best/fastest solution,
>> mostly because of the large allocation that has to happen before you
>> get O(n) performance. Thus I figured that pegged might eventually use
>> different parsing strategies underneath it all, possibly with a lot of
>> special-casing and clever hand-tuned and profiled optimizations. At
>> least that's what makes sense to me.
>
> At the very least, we could use DFA instead of backtracking where
> possible. This is the approach implemented in ANTLR, but I wanted to
> introduce them before I knew about existence of the latter, simply
> because this would likely produce the fastest parsers possible.

These were my thoughts exactly, although somewhat unsubstantiated in my case ;)
July 07, 2012
On 07/07/2012 02:23 PM, David Piepgrass wrote:
>> Note that PEG does not impose to use packrat parsing, even though it
>> was developed to use it. I think it's a historical 'accident' that put
>> the two together: Bryan Ford thesis used the two together.
>
> Interesting. After trying to use ANTLR-C# several years back, I got
> disillusioned because nobody was interested in fixing the bugs in it
> (ANTLR's author is a Java guy first and foremost) and the source code of
> the required libraries didn't have source code or a license (wtf.)
>
> So, for awhile I was thinking about how I might make my own parser
> generator that was "better" than ANTLR. I liked the syntax of PEG
> descriptions, but I was concerned about the performance hit of packrat
> and, besides, I already liked the syntax and flexibility of ANTLR. So my
> idea was to make something that was LL(k) and mixed the syntax of ANTLR
> and PEG while using more sane (IMO) semantics than ANTLR did at the time
> (I've no idea if ANTLR 3 still uses the same semantics today...) All of
> this is 'water under the bridge' now, but I hand-wrote a lexer to help
> me plan out how my parser-generator would produce code. The output code
> was to be both more efficient and significantly more readable than
> ANTLR's output. I didn't get around to writing the parser-generator
> itself but I'll have a look back at my handmade lexer for inspiration.
>
>>> However, as I found a few hours ago, Packrat parsing (typically used to
>>> handle PEG) has serious disadvantages: it complicates debugging
>>> because of
>>> frequent backtracking, it has problems with error recovery, and
>>> typically
>>> disallows to add actions with side effects (because of possibility of
>>> backtracking). These are important enough to reconsider my plans of
>>> using
>>> Pegged. I will try to analyze whether the issues are so fundamental
>>> that I
>>> (or somebody else) will have to create an ANTLR-like parser instead, or
>>> whether it is possible to introduce changes into Pegged that would
>>> fix these
>>> problems.
>
> I don't like the sound of this either. Even if PEGs were fast,
> difficulty in debugging, error handling, etc. would give me pause. I
> insist on well-rounded tools. For example, even though LALR(1) may be
> the fastest type of parser (is it?), I prefer not to use it due to its
> inflexibility (it just doesn't like some reasonable grammars), and the
> fact that the generated code is totally unreadable and hard to debug
> (mind you, when I learned LALR in school I found that it is possible to
> visualize how it works in a pretty intuitive way--but debuggers won't do
> that for you.)
>
> While PEGs are clearly far more flexible than LALR and probably more
> flexible than LL(k), I am a big fan of old-fashioned recursive descent
> because it's very flexible (easy to insert actions during parsing, and
> it's possible to use custom parsing code in certain places, if
> necessary*) and the parser generator's output is potentially very
> straightforward to understand and debug. In my mind, the main reason you
> want to use a parser generator instead of hand-coding is convenience,
> e.g. (1) to compress the grammar down so you can see it clearly, (2)
> have the PG compute the first-sets and follow-sets for you, (3) get
> reasonably automatic error handling.
>
> * (If the language you want to parse is well-designed, you'll probably
> not need much custom parsing. But it's a nice thing to offer in a
> general-purpose parser generator.)
>
> I'm not totally sure yet how to support good error messages, efficiency
> and straightforward output at the same time, but by the power of D I'm
> sure I could think of something...
>
> I would like to submit another approach to parsing that I dare say is my
> favorite, even though I have hardly used it at all yet. ANTLR offers
> something called "tree parsing" that is extremely cool. It parses trees
> instead of linear token streams, and produces other trees as output. I
> don't have a good sense of how tree parsing works, but I think that some
> kind of tree-based parser generator could become the basis for a very
> flexible and easy-to-understand D front-end. If a PG operates on trees
> instead of linear token streams, I have a sneaky suspicion that it could
> revolutionize how a compiler front-end works.
>
> Why? because right now parsers operate just once, on the user's input,
> and from there you manipulate the AST with "ordinary" code. But if you
> have a tree parser, you can routinely manipulate and transform parts of
> the tree with a sequence of independent parsers and grammars. Thus,
> parsers would replace a lot of things for which you would otherwise use
> a visitor pattern, or something. I think I'll try to sketch out this
> idea in more detail later.

I was thinking the same thing.

My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree.  Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns.  I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops.

My vision is to have code similar to this in the front-end:

/+
Lower
	while ( boolExpr )
	{
		statements;
	}

Into

	loopAgain:
	if ( !boolExpr )
		goto exitLoop
	statements;
	goto loopAgain
	exitLoop:
+/
void lowerWhileStatement( SyntaxElement* syntaxNode )
{
	auto captures = syntaxNode.matchNodes(
		TOK_WHILE_NODE,
		OP_ENTER_NODE,
			OP_CAPTURE(0),
			OP_BEGIN,
				TOK_EXPRESSION,
			OP_END,
			OP_CAPTURE(1),
			OP_BEGIN,
				TOK_STATEMENT,
			OP_END,
		OP_LEAVE_NODE);
	
	if ( captures is null )
		return;
	
	syntaxNode.replaceWith(
		LabelNode("loopAgain"),
		TOK_IF_STATEMENT,
		OP_INSERT,
		OP_BEGIN,
			TOK_NEGATE,
			OP_INSERT,
			OP_BEGIN,
				captures[0], // Expression
			OP_END,
			GotoStatement("exitLoop"),
		OP_END,
		captures[1], // statements
		GotoStatement("loopAgain"),
		LabelNode("exitLoop")
		);
}


The specifics will easily change.  One problem with the above code is that it could probably stand to use more templates and compile-time action to allow the front-end to merge patterns happening in the same pass to be merged together into one expression, thus preventing any unnecessary rescanning.  It all becomes DFAs or DPDAs operating on syntax trees.

In this vision I do not use classes and inheritance for my AST.  Instead I use structs that contain some kind of nodeType member that would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the above code.  Dynamic dispatch is instead performed by (very fast) DFAs recognizing parts of the AST.

This kind of architecture leads to other interesting benefits, like being able to assert which symbols a pattern is designed to handle or which symbols are allowed to exist in the AST at any point in time. Thus if you write a lowering that introduces nodes that a later pass can't handle, you'll know very quickly, at least in principle.

I wanted to make such a front-end so that I could easily make a C backend.  I believe such a compiler would be able to do that with great ease.  I really want a D compiler that can output ANSI C code that can be used with few or no OS/CPU dependencies.  I would be willing to lose a lot of the nifty parallelism/concurrency stuff and deal with reference counting instead of full garbage collection, as long as it lets me EASILY target new systems (any phone, console platform, and some embedded microcontrollers).  Then what I have is something that's as ubiquitous as C, but adds a lot of useful features like exception handling, dynamic arrays, templates, CTFE, etc etc.  My ideas for how to deal with ASTs in pattern recognition and substitution followed from this.

Needing to use D in places where it isn't available is a real pain-point for me right now, and I'll probably find ways to spend time on it eventually.
July 07, 2012
On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:
> I was thinking the same thing.
>
> My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree.  Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns.  I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops.
Funny, we've discussed an idea to introduce some hybrid of regex and xpath for querying, searching and traversing AST with Dmitry a few weeks ago. A custom NDepend-like Code Query Language was the major alternative we considered. Discussion started on this forum and continued via email.

> In this vision I do not use classes and inheritance for my AST.
>  Instead I use structs that contain some kind of nodeType member that would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the above code.  Dynamic dispatch is instead performed by (very fast) DFAs recognizing parts of the AST.

Exactly. This idea first came to me in April after I implemented the first top-down recursive descent custom parser for a D subset. I tried Visitor pattern before that and wasn't happy. There are some subtle difficulties which I believe will be possible to overcome, most important being the need to introduce a mechanism for hierarchical classification (like a pow expression being an assign expression at the same time).

July 07, 2012
On 07-Jul-12 22:25, Andrei Alexandrescu wrote:
> On 7/7/12 7:33 AM, Dmitry Olshansky wrote:
>> On 07-Jul-12 15:30, Roman D. Boiko wrote:
>>> On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:
>>>> I think that Pegged can be heavily optimized in performance, and there
>>>> are no
>>>> fundamental issues which would make it inherently slower than LALR or
>>>> a hand-written D-specific parser.
>>>
>>> Hmm... found an interesting article:
>>> http://www.antlr.org/papers/LL-star-PLDI11.pdf
>>>
>>> It describes some disadvantages of Packrat parsing, like problems with
>>> debugging and error recovery. These are important for DCT, so I'll have
>>> to perform additional research.
>>
>> Yup, LL(*) is my favorite so far.
>
> That's Terence Parr's discovery, right? I've always liked ANTLR, so if
> PEGs turn out to have issues LL(*) sounds like a promising alternative.
>
> How many semantics hacks does D's syntax need for a LL(*) parser?
>

I believe that it may need very few _semantic_ predicates. But there are cases where infinite lookahead is a must. Can't recall which cases offhand.

>
> Andrei
>
>


-- 
Dmitry Olshansky


July 07, 2012
On 07-Jul-12 22:23, Andrei Alexandrescu wrote:
> On 7/7/12 6:24 AM, Roman D. Boiko wrote:
>> On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
>>> http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk
>>>
>>>
>>
>> So far it looks like LALR parsers may have lower constant factors than
>> Packrat.
>>
>> The difference could be minimized by paying attention to parsing of
>> terminal symbols, which was in my plans already. It is not necessary to
>> strictly follow Packrat parsing algorithm.
>>
>> The benefits of Pegged, in my view, are its support of Parsing
>> Expression Grammar (PEG) and compile-time evaluation. It is easily
>> extensible and modifiable.
>
> Isn't also the fact that lexing and parsing are integrated? With
> traditional LALR you need a separate tokenizer.
>

I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.

>> When I implemented recursive-descent parser by hand in one of early
>> drafts of DCT, I strongly felt the need to generalize code in a way
>> which in retrospect I would call PEG-like. The structure of my
>> hand-written recursive-descent parser was a one-to-one mapping to an
>> implemented subset of D specification, and I considered it problematic,
>> because it was needed to duplicate the same structure in the resulting
>> AST.
>>
>> PEG is basically a language that describes both, the implementation of
>> parser, and the language syntax. It greatly reduces implicit code
>> duplication.
>>
>> I think that generated code can be made almost as fast as a hand-written
>> parser for a particular language (probably, a few percent slower).
>> Especially if that language is similar to D (context-free, with
>> fine-grained hierarchical grammar). Optimizations might require to
>> forget about strictly following any theoretical algorithm, but that
>> should be OK.
>
> All that sounds really encouraging. I'm really looking forward to more
> work in that area. If you stumble upon bugs that block you, let us know
> and Walter agreed he'll boost their priority.
>
>
> Andrei


-- 
Dmitry Olshansky