View mode: basic / threaded / horizontal-split · Log in · Help
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu 
wrote:
> On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
>> I'll have to point out that the whole point about integrated 
>> lexing is
>> moot. It's more of liability then benefit. At very least it's 
>> just
>> implementation curiosity not advantage.
>
> Interesting. I'll have to ask for more details on why.
>
> Andrei

+1. Personally I associate PEG with a parser that includes a 
__distributed__ lexer inside, which gives the advantage of having 
to check less alternatives at each step (that's similar to 
deterministic parsing). If lexer is separate, it seems to be more 
difficult to scan for only a subset of possible tokens.
July 07, 2012
Re: Let's stop parser Hell
> 	auto captures = syntaxNode.matchNodes(
> 		TOK_WHILE_NODE,
> 		OP_ENTER_NODE,
> 			OP_CAPTURE(0),
> 			OP_BEGIN,
> 				TOK_EXPRESSION,
> 			OP_END,
> 			OP_CAPTURE(1),
> 			OP_BEGIN,
> 				TOK_STATEMENT,
> 			OP_END,
> 		OP_LEAVE_NODE);

I'm glad to hear you like the tree-parsing approach, Chad, 
although the particular syntax here looks pretty unfriendly :O -- 
does this represent something that you are working on right now?

> This kind of architecture leads to other interesting benefits, 
> like being able to assert which symbols a pattern is designed 
> to handle or which symbols are allowed to exist in the AST at 
> any point in time. Thus if you write a lowering that introduces 
> nodes that a later pass can't handle, you'll know very quickly, 
> at least in principle.
>
> I wanted to make such a front-end so that I could easily make a 
> C backend.  I believe such a compiler would be able to do that 
> with great ease.  I really want a D compiler that can output 
> ANSI C code that can be used with few or no OS/CPU 
> dependencies.  I would be willing to lose a lot of the nifty 
> parallelism/concurrency stuff and deal with reference counting 
> instead of full garbage collection, as long as it lets me 
> EASILY target new systems (any phone, console platform, and 
> some embedded microcontrollers).  Then what I have is something 
> that's as ubiquitous as C, but adds a lot of useful features 
> like exception handling, dynamic arrays, templates, CTFE, etc 
> etc.  My ideas for how to deal with ASTs in pattern recognition 
> and substitution followed from this.

I tend to agree that it would be better to have a "general" node 
class with the node type as a property rather than a subtype and 
rather than a myriad of independent types, although in the past I 
haven't been able to figure out how to make this approach 
simultaneously general, efficient, and easy to use. I'm kind of a 
perfectionist which perhaps holds me back sometimes :)

I'd like to add that if we give tree parsing first-class 
treatment, I believe the most logical approach to parsing has 
three or more stages instead of the traditional two (lex+parse):

1. Lexer
2. Tree-ification
3. Parsing to AST (which may itself use multiple stages, e.g. 
parse the declarations first, then parse function bodies later)

The new stage two simply groups things that are in parenthesis 
and braces. So an input stream such as the following:

A man (from a [very ugly] house in the suburbs) was quoted as 
saying {
    I saw Batman (and Robin) last night!
}

Is converted to a tree where everything parenthesized or braced 
gets to be a child:

A man (
   from a [
       very ugly
   ] house in the suburbs
) was quoted as saying {
   ...
}

Some of the things I like about this approach are:

1. It's language-agnostic. Lots of languages and DSLs could 
re-use exactly the same code from stage 2. (Stage 1, also, is 
fairly similar between languages and I wonder if a parameterized 
standard lexer is a worthwhile pursuit.)

2. It mostly eliminates the need for arbitrary-length lookahead 
for things like D's template_functions(...)(...). Of course, the 
tokens will almost always end up getting scanned twice, but hey, 
at least you know you won't need to scan them more than twice, 
right? (er, of course the semantic analysis will scan it several 
times anyway. Maybe this point is moot.)

3. It is very efficient for tools that don't need to examine 
function bodies. Such tools can easily leave out that part of the 
parser simply by not invoking the function-body sub-parser.

4. It leaves open the door to supporting embedded DSLs in the 
future. It's trivial to just ignore a block of text in braces and 
hand it off to a DSL later. It is similar to the way PEGs allow 
several different parties to contribute parts of a grammar, 
except that this approach does not constrain all the parties to 
actually use PEGs; for instance if I am a really lazy DSL author 
and I already have a SQL parser laying around (whether it's 
LL(k), LALR, whatever), I can just feed the original input text 
to that parser (or, better, use the flat token stream, sans 
comments, that came out of the lexer.)

5. It's risky 'cause I've never heard of anyone taking this 
approach before. Bring on the danger!

I have observed that most PLs (Programming Langs) use one of two 
versions of stage 2: (1) C-style, with structure indicated 
entirely with {}, (), [], and possibly <> (shudder), or (2) 
Python-style, with structure indicated by indentation instead of 
{}. My favorite is the Boo language, which combines these two, 
using Python style by default, but also having a WSA parsing mode 
(whitespace-agnostic) with braces, and switching to WSA mode 
inside a Python-style module whenever the user uses an opener 
("(,{,[") (IIRC). So a single Boo-like stage 2 could be re-used 
for almost all PLs, and thus would be a reasonable candidate as 
part of the standard library or a "D Boost library" (there is not 
such a thing, is there?).

> I wanted to make such a front-end so that I could easily make a 
> C backend.  I believe such a compiler would be able to do that 
> with great ease.
>
> Needing to use D in places where it isn't available is a real 
> pain-point for me right now, and I'll probably find ways to 
> spend time on it eventually.

Yeah, with a tree-transforming parser, I imagine the same thing, 
except my current fetish is to convert a certain subset of D to 
multiple other languages automatically. Then I could write 
libraries that can easily be used by an astonishingly large 
audience. I certainly would like to see D targetting Android, but 
that's best done directly from D to ARM.

Anyway, the devil's in the detail. Originally I wanted to do a 
parser generator and a "completely general AST" in C# and 
couldn't seem to work out the details, but D is more flexible and 
is likely better suited to the task.
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 08:55 PM, Chad J wrote:
> On 07/07/2012 02:23 PM, David Piepgrass wrote:
>>> Note that PEG does not impose to use packrat parsing, even though it
>>> was developed to use it. I think it's a historical 'accident' that put
>>> the two together: Bryan Ford thesis used the two together.
>>
>> Interesting. After trying to use ANTLR-C# several years back, I got
>> disillusioned because nobody was interested in fixing the bugs in it
>> (ANTLR's author is a Java guy first and foremost) and the source code of
>> the required libraries didn't have source code or a license (wtf.)
>>
>> So, for awhile I was thinking about how I might make my own parser
>> generator that was "better" than ANTLR. I liked the syntax of PEG
>> descriptions, but I was concerned about the performance hit of packrat
>> and, besides, I already liked the syntax and flexibility of ANTLR. So my
>> idea was to make something that was LL(k) and mixed the syntax of ANTLR
>> and PEG while using more sane (IMO) semantics than ANTLR did at the time
>> (I've no idea if ANTLR 3 still uses the same semantics today...) All of
>> this is 'water under the bridge' now, but I hand-wrote a lexer to help
>> me plan out how my parser-generator would produce code. The output code
>> was to be both more efficient and significantly more readable than
>> ANTLR's output. I didn't get around to writing the parser-generator
>> itself but I'll have a look back at my handmade lexer for inspiration.
>>
>>>> However, as I found a few hours ago, Packrat parsing (typically used to
>>>> handle PEG) has serious disadvantages: it complicates debugging
>>>> because of
>>>> frequent backtracking, it has problems with error recovery, and
>>>> typically
>>>> disallows to add actions with side effects (because of possibility of
>>>> backtracking). These are important enough to reconsider my plans of
>>>> using
>>>> Pegged. I will try to analyze whether the issues are so fundamental
>>>> that I
>>>> (or somebody else) will have to create an ANTLR-like parser instead, or
>>>> whether it is possible to introduce changes into Pegged that would
>>>> fix these
>>>> problems.
>>
>> I don't like the sound of this either. Even if PEGs were fast,
>> difficulty in debugging, error handling, etc. would give me pause. I
>> insist on well-rounded tools. For example, even though LALR(1) may be
>> the fastest type of parser (is it?), I prefer not to use it due to its
>> inflexibility (it just doesn't like some reasonable grammars), and the
>> fact that the generated code is totally unreadable and hard to debug
>> (mind you, when I learned LALR in school I found that it is possible to
>> visualize how it works in a pretty intuitive way--but debuggers won't do
>> that for you.)
>>
>> While PEGs are clearly far more flexible than LALR and probably more
>> flexible than LL(k), I am a big fan of old-fashioned recursive descent
>> because it's very flexible (easy to insert actions during parsing, and
>> it's possible to use custom parsing code in certain places, if
>> necessary*) and the parser generator's output is potentially very
>> straightforward to understand and debug. In my mind, the main reason you
>> want to use a parser generator instead of hand-coding is convenience,
>> e.g. (1) to compress the grammar down so you can see it clearly, (2)
>> have the PG compute the first-sets and follow-sets for you, (3) get
>> reasonably automatic error handling.
>>
>> * (If the language you want to parse is well-designed, you'll probably
>> not need much custom parsing. But it's a nice thing to offer in a
>> general-purpose parser generator.)
>>
>> I'm not totally sure yet how to support good error messages, efficiency
>> and straightforward output at the same time, but by the power of D I'm
>> sure I could think of something...
>>
>> I would like to submit another approach to parsing that I dare say is my
>> favorite, even though I have hardly used it at all yet. ANTLR offers
>> something called "tree parsing" that is extremely cool. It parses trees
>> instead of linear token streams, and produces other trees as output. I
>> don't have a good sense of how tree parsing works, but I think that some
>> kind of tree-based parser generator could become the basis for a very
>> flexible and easy-to-understand D front-end. If a PG operates on trees
>> instead of linear token streams, I have a sneaky suspicion that it could
>> revolutionize how a compiler front-end works.
>>
>> Why? because right now parsers operate just once, on the user's input,
>> and from there you manipulate the AST with "ordinary" code. But if you
>> have a tree parser, you can routinely manipulate and transform parts of
>> the tree with a sequence of independent parsers and grammars. Thus,
>> parsers would replace a lot of things for which you would otherwise use
>> a visitor pattern, or something. I think I'll try to sketch out this
>> idea in more detail later.
>
> I was thinking the same thing.
>
> My intent is to create a kind of regular-expression-of-nodes with
> push/pop operators to recognize ascent and descent on the tree.  Such a
> regular expression would allow one to capture subtrees out of
> generalized patterns and then place them into new trees that then become
> the input for the next pattern or set of patterns.  I think this is much
> closer to how I conceptualize semantic analysis than how semantic
> analysis is done in front ends like DMD: it should to be done with
> pattern recognition and substitution, not with myriad nested
> if-statements and while-loops.
>
> My vision is to have code similar to this in the front-end:
>
> /+
> Lower
>      while ( boolExpr )
>      {
>          statements;
>      }
>
> Into
>
>      loopAgain:
>      if ( !boolExpr )
>          goto exitLoop
>      statements;
>      goto loopAgain
>      exitLoop:
> +/
> void lowerWhileStatement( SyntaxElement* syntaxNode )
> {
>      auto captures = syntaxNode.matchNodes(
>          TOK_WHILE_NODE,
>          OP_ENTER_NODE,
>              OP_CAPTURE(0),
>              OP_BEGIN,
>                  TOK_EXPRESSION,
>              OP_END,
>              OP_CAPTURE(1),
>              OP_BEGIN,
>                  TOK_STATEMENT,
>              OP_END,
>          OP_LEAVE_NODE);
>
>      if ( captures is null )
>          return;
>
>      syntaxNode.replaceWith(
>          LabelNode("loopAgain"),
>          TOK_IF_STATEMENT,
>          OP_INSERT,
>          OP_BEGIN,
>              TOK_NEGATE,
>              OP_INSERT,
>              OP_BEGIN,
>                  captures[0], // Expression
>              OP_END,
>              GotoStatement("exitLoop"),
>          OP_END,
>          captures[1], // statements
>          GotoStatement("loopAgain"),
>          LabelNode("exitLoop")
>          );
> }
>
>
> The specifics will easily change.

I'd suggest:

AstOp!`
Lower
    while ( boolExpr )
    {
        statements;
    }

Into
    loopAgain:
    if ( !boolExpr ) {
        statements;
    } else goto exitLoop
    goto loopAgain
    exitLoop:

`.run(syntaxNode);
July 07, 2012
Re: Let's stop parser Hell
On 08-Jul-12 00:09, Andrei Alexandrescu wrote:
> On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
>> I'll have to point out that the whole point about integrated lexing is
>> moot. It's more of liability then benefit. At very least it's just
>> implementation curiosity not advantage.
>
> Interesting. I'll have to ask for more details on why.
>

I think I told that before but anyway the main point is:
the reason lexer exists is performance.

It's obvious that one can write grammar without ever using lexer. If the 
notation is good say regex or the one used in PEG (that is not quite the 
same it turns out) then token can be easily describe in grammar. Once we 
got lexer can give only 2 things:
	-manageability, it's jsut splitting grammar in 2 parts that could be 
maintained separately (in fact some "lexers" are not DFA nor regular)
(sometimes it could go the opposite direction - it could be  better to 
have one integrated grammar)
	- speed. The reason it could be faster is because lexer can use highly 
deterministic grammar.

Now back to PEGs and packrats. What I see everywhere I look is 
essentially integration of a backtracking-NFA lexer, that is not fast 
nor is particularly convenient. The notation is the whole other matter.

So my view on the matter: whether or not lexer is integrated is two-fold 
question: notation vs implementation.

E.g. it may make regular lexer-like things a part of notation thus 
having rules like:
	Identifier  -> Alpha (Alpha|Number|_)*

Yet it may or may use the same engine for _all_ _rules_, in fact if 
implementation optimize things by ripping off regular parts of grammar 
to small DFA engines it would make some kind of lexer behind the scenes!

Anyway highly hybrid parsers are all the rage now, and reasons are 
obvious - they runs fast and can be bend by some magic to quite 
convoluted grammars of modern languages ;)

-- 
Dmitry Olshansky
July 07, 2012
Re: Let's stop parser Hell
On 08-Jul-12 00:19, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu wrote:
>> On 7/7/12 3:17 PM, Dmitry Olshansky wrote:
>>> I'll have to point out that the whole point about integrated lexing is
>>> moot. It's more of liability then benefit. At very least it's just
>>> implementation curiosity not advantage.
>>
>> Interesting. I'll have to ask for more details on why.
>>
>> Andrei
>
> +1. Personally I associate PEG with a parser that includes a
> __distributed__ lexer inside,


Well put yet it's still lexer. And most horribly it's not optimized to 
DFAs most of the time last time I've checked papers.

 which gives the advantage of having to
> check less alternatives at each step (that's similar to deterministic
> parsing). If lexer is separate, it seems to be more difficult to scan
> for only a subset of possible tokens.
>
And given the backtracking nature of PEGs you'll do your distributed 
thing many times over or ... spend a lot of RAM to remember not to redo 
it. I recall lexing takes even more then parsing itself.

-- 
Dmitry Olshansky
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 10:26 PM, Timon Gehr wrote:
> On 07/07/2012 08:55 PM, Chad J wrote:
>>  ...
>>
>> The specifics will easily change.
>
> I'd suggest:
>
> AstOp!`
> Lower
>      while ( boolExpr )
>      {
>          statements;
>      }
>
> Into
>      loopAgain:
>      if ( !boolExpr ) {
>          statements;
>      } else goto exitLoop
>      goto loopAgain
>      exitLoop:
>
> `.run(syntaxNode);
>

Also, I'd like to point out that the transformation is not actually 
valid, because there are break/continue.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:
> I'd like to add that if we give tree parsing first-class 
> treatment, I believe the most logical approach to parsing has 
> three or more stages instead of the traditional two (lex+parse):
>
> 1. Lexer
> 2. Tree-ification
> 3. Parsing to AST (which may itself use multiple stages, e.g. 
> parse the declarations first, then parse function bodies later)
>
> The new stage two simply groups things that are in parenthesis 
> and braces. So an input stream such as the following:

I bet that after stage 2 you would have performed almost the same 
amount of work (in other words, spent almost the same time) as 
you would if you did full parsing. After that you would need to 
iterate the whole tree (possibly multiple times), modify (or 
recreate if the AST is immutable) its nodes, etc. Altogether this 
might be a lot of overhead.

My opinion is that tree manipulation is something that should be 
available to clients of parser-as-a-library or even of 
parser+semantic analyzer, but not necessarily advantageous for 
parser itself.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:
> And given the backtracking nature of PEGs you'll do your 
> distributed thing many times over or ... spend a lot of RAM to 
> remember not to redo it. I recall lexing takes even more then 
> parsing itself.

I think that your conclusions are about statistical evidences of 
PEG misuses and poor PEG parser implementations. My point was 
that there is nothing fundamentally worse in having lexer 
integrated with parser, but there are performance advantages of 
having to check less possible cases when the structural 
information is available (so that lexSmth could be called when 
Smth is expected, thus requiring less switch branches if switch 
is used).

As for lexing multiple times, simply use a free list of terminals 
(aka tokens). I still assume that grammar is properly defined, so 
that there is only one way to split source into tokens.
July 07, 2012
Re: Let's stop parser Hell
On 08-Jul-12 00:50, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:
>> And given the backtracking nature of PEGs you'll do your distributed
>> thing many times over or ... spend a lot of RAM to remember not to
>> redo it. I recall lexing takes even more then parsing itself.
>
> I think that your conclusions are about statistical evidences of PEG
> misuses and poor PEG parser implementations. My point was that there is
> nothing fundamentally worse in having lexer integrated with parser, but
> there are performance advantages of having to check less possible cases
> when the structural information is available (so that lexSmth could be
> called when Smth is expected, thus requiring less switch branches if
> switch is used).

You may misunderstood me as well, the point is still the same:
there are 2 things - notation and implementation, the fact that lexer is 
integrated in notation like in PEGs is not related to the fact that PEGs 
in their classic definition never use term Token and do backtracking 
parsing essentially on character level.



> As for lexing multiple times, simply use a free list of terminals (aka
> tokens). I still assume that grammar is properly defined, so that there
> is only one way to split source into tokens.
>

Tokens.. there is no such term in use if we talk about 'pure' PEG.

-- 
Dmitry Olshansky
July 07, 2012
Re: Let's stop parser Hell
> Since I didn't understand your question I assume that my 
> statement was somehow incorrect (likely because I made some 
> wrong assumptions about ANTLR). I didn't know about its 
> existence until today and still don't understand it completely. 
> What I think I understood is that it uses DFA for deciding 
> which grammar rule to apply instead of doing backtracking. I 
> also think that it uses DFA for low-level scanning (I'm not 
> sure).

ANTLR 3 doesn't use a DFA unless it needs to. If unlimited 
lookahead is not called for, it uses standard LL(k) or perhaps it 
uses the modified (approximate? I forget the name) LL(k) from 
ANTLR 2. DFA comes into play, for instance, if you need to check 
what comes after an argument list (of, unlimited, length) before 
you can decide that it *is* an argument list and start the "real" 
parsing (The author says LL(k) is too inefficient so he used a 
restricted form of it; personally I'm not convinced, but I 
digress)
7 8 9 10 11 12 13 14 15
Top | Discussion index | About this forum | D home