View mode: basic / threaded / horizontal-split · Log in · Help
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
> You may misunderstood me as well, the point is still the same:
> there are 2 things - notation and implementation, the fact that 
> lexer is integrated in notation like in PEGs is not related to 
> the fact that PEGs in their classic definition never use term 
> Token and do backtracking parsing essentially on character 
> level.

But PEG itself doesn't require backtracking parsing, does it? So 
that's an implementation detail, or a specific algorithm. Lexer 
separation seems to be independent of this.

>> As for lexing multiple times, simply use a free list of 
>> terminals (aka tokens). I still assume that grammar is 
>> properly defined, so that there is only one way to split 
>> source into tokens.
>>
>
> Tokens.. there is no such term in use if we talk about 'pure' 
> PEG.

Terminal symbols.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:27:14 UTC, Dmitry Olshansky wrote:
> So my view on the matter: whether or not lexer is integrated is 
> two-fold question: notation vs implementation.
>
> E.g. it may make regular lexer-like things a part of notation 
> thus having rules like:
> 	Identifier  -> Alpha (Alpha|Number|_)*
>
> Yet it may or may use the same engine for _all_ _rules_, in 
> fact if implementation optimize things by ripping off regular 
> parts of grammar to small DFA engines it would make some kind 
> of lexer behind the scenes!

This is the implementation I have in mind as the only sane way to 
parse PEG efficiently. Plus specializing DFA to only check for 
those terminals which are allowed according to available 
structural information at the moment of their invocation.

>
> Anyway highly hybrid parsers are all the rage now, and reasons 
> are obvious - they runs fast and can be bend by some magic to 
> quite convoluted grammars of modern languages ;)
July 07, 2012
Re: Let's stop parser Hell
On 08-Jul-12 01:24, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
>> You may misunderstood me as well, the point is still the same:
>> there are 2 things - notation and implementation, the fact that lexer
>> is integrated in notation like in PEGs is not related to the fact that
>> PEGs in their classic definition never use term Token and do
>> backtracking parsing essentially on character level.
>
> But PEG itself doesn't require backtracking parsing, does it?

It does. Specifically the algorithm is to try option A, try option B, 
try option C...

it's obvious how backtracking comes in ... whan say A fails somewhere in 
the middle of it. Of course there is a fair amount of bookkeeping that 
reduces doing work all over again but it does ... allocate.

This could be avoided via disambiguation a-la LL(k) but that won't be 
true PEG anymore ;)

So that's
> an implementation detail, or a specific algorithm.
No

Lexer separation
> seems to be independent of this.
>
Yes

>>> As for lexing multiple times, simply use a free list of terminals
>>> (aka tokens). I still assume that grammar is properly defined, so
>>> that there is only one way to split source into tokens.
>>>
>>
>> Tokens.. there is no such term in use if we talk about 'pure' PEG.
>
> Terminal symbols.
>
Characters.

-- 
Dmitry Olshansky
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 20:39:18 UTC, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:
>> I'd like to add that if we give tree parsing first-class 
>> treatment, I believe the most logical approach to parsing has 
>> three or more stages instead of the traditional two 
>> (lex+parse):
>>
>> 1. Lexer
>> 2. Tree-ification
>> 3. Parsing to AST (which may itself use multiple stages, e.g. 
>> parse the declarations first, then parse function bodies later)
>>
>> The new stage two simply groups things that are in parenthesis 
>> and braces. So an input stream such as the following:
>
> I bet that after stage 2 you would have performed almost the 
> same amount of work (in other words, spent almost the same 
> time) as you would if you did full parsing. After that you 
> would need to iterate the whole tree (possibly multiple times), 
> modify (or recreate if the AST is immutable) its nodes, etc. 
> Altogether this might be a lot of overhead.
>
> My opinion is that tree manipulation is something that should 
> be available to clients of parser-as-a-library or even of 
> parser+semantic analyzer, but not necessarily advantageous for 
> parser itself.

Hmm, you've got a good point there, although simple 
tree-ification is clearly less work than standard parsing, since 
statements like "auto x = y + z;" would be quickly "blitted" into 
the same node in phase 2, but would become multiple separate 
nodes in phase 3.

What I like about it is not its performance, but how it matches 
the way we think about languages. Humans tend to see overall 
structure first, and examine the fine details later. The tree 
parsing approach is similarly nonlinear and can be modularized in 
a way that might be more intuitive than traditional EBNF.

On the other hand, one could argue it is /too/ flexible, 
admitting so many different approaches to parsing that a 
front-end based on this approach is not necessarily intuitive to 
follow; and of course, not using a standard EBNF-type grammar 
could be argued to be bad.

Still... it's a fun concept, and even if the initial parsing ends 
up using the good-old lex-parse approach, semantic analysis and 
lowering can benefit from a tree parser. Tree parsing, of course, 
is just a generalization of linear parsing and so a tree parser 
generator (TPG) could work equally well for flat input.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
> On 08-Jul-12 01:24, Roman D. Boiko wrote:
>> But PEG itself doesn't require backtracking parsing, does it?
>
> It does. Specifically the algorithm is to try option A, try 
> option B, try option C...
There is no algorithm, only __grammar__. It specifies that option 
A has higher priority than option B in expression A / B / C. But 
it doesn't forbid, for example, to analyze all in parallel (track 
state transitions) for each character consequently, as a proper 
DFA implementation would do, until some option succeeds and all 
previous fail. A greedy regex would also have to check all 
successor options (C in the exapmle above) to determine the 
longest one.

> it's obvious how backtracking comes in ... whan say A fails 
> somewhere in the middle of it.
Simply stop tracking respective state. Process others in parallel 
as I described above.

> Of course there is a fair amount of bookkeeping that reduces 
> doing work all over again but it does ... allocate.

The amount of used memory would be proportional to the length of 
input after the branching point times number of rules (actually, 
of states in the DFA, which is likely larger but still finite for 
a particular grammar).

>>> Tokens.. there is no such term in use if we talk about 'pure' 
>>> PEG.
>>
>> Terminal symbols.
>>
> Characters.
Nope. According to the definition, PEG is a set of non-terminal 
symbols, terminal symbols, production rules, and a starting 
non-terminal. Terminals are not necessarily characters.
July 07, 2012
Re: Let's stop parser Hell
> What I like about it is not its performance, but how it matches 
> the way we think about languages. Humans tend to see overall 
> structure first, and examine the fine details later. The tree 
> parsing approach is similarly nonlinear and can be modularized 
> in a way that might be more intuitive than traditional EBNF.

That reminds me, I forgot to write a another advantage I expected 
the three-phase approach to have, namely, that it seems easier to 
tell what the programmer "meant" with three phases, in the face 
of errors. I mean, phase 2 can tell when braces and parenthesis 
are not matched up properly and then it can make reasonable 
guesses about where those missing braces/parenthesis were meant 
to be, based on things like indentation. That would be especially 
helpful when the parser is used in an IDE, since if the IDE 
guesses the intention correctly, it can still understand broken 
code and provide code completion for it. And since phase 2 is a 
standard tool, anybody's parser can use it.

Example:

void f() {
    if (cond)
        x = y + 1;
        y = z + 1;
    }
} // The error appears to be here, but it's really 4 lines up.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:43:58 UTC, David Piepgrass wrote:
> Still... it's a fun concept, and even if the initial parsing 
> ends up using the good-old lex-parse approach, semantic 
> analysis and lowering can benefit from a tree parser. Tree 
> parsing, of course, is just a generalization of linear parsing 
> and so a tree parser generator (TPG) could work equally well 
> for flat input.

Exactly. Semantic analysis is what would benefit from that, as 
well as client code (for example, refactoring tools, etc.)

Parser would become quite complicated. Probably it would need to 
perform complex tree navigation (which might decrease performance 
proportionally to the average tree depth or even more, if parser 
algorithms were themselves fast). Any non-trivial query would 
require direct character manipulation (like comparing sub-strings 
of captures with string literals, etc.). Probably you would get 
frequent cache misses because of the need to jump throughout the 
(incomplete) AST. It would definitely be problematic to build an 
immutable AST model, thus (IMO) significantly and needlessly 
constraining you and users of your library. And likely you would 
have to deal with many more problems __without__ significant 
gains.
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 03:13 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:
>
>> In this vision I do not use classes and inheritance for my AST.
>> Instead I use structs that contain some kind of nodeType member that
>> would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE
>> in the above code. Dynamic dispatch is instead performed by (very
>> fast) DFAs recognizing parts of the AST.
>
> Exactly. This idea first came to me in April after I implemented the
> first top-down recursive descent custom parser for a D subset. I tried
> Visitor pattern before that and wasn't happy. There are some subtle
> difficulties which I believe will be possible to overcome, most
> important being the need to introduce a mechanism for hierarchical
> classification (like a pow expression being an assign expression at the
> same time).
>

From some of my earlier scribblings:

enum : SyntaxElement
{
  AST_EXPRESSION          = 0x0001_0000_0000_0000,
    AST_UNARY_EXPR        = 0x0000_0001_0000_0000 | AST_EXPRESSION,
      AST_NEGATE_EXPR     = 0x0000_0000_0001_0000 |  AST_UNARY_EXPR,
      AST_COMPLIMENT_EXPR = 0x0000_0000_0002_0000 |  AST_UNARY_EXPR,
      AST_POST_ADD_EXPR   = 0x0000_0000_0003_0000 |  AST_UNARY_EXPR,
      AST_POST_SUB_EXPR   = 0x0000_0000_0004_0000 |  AST_UNARY_EXPR,
      AST_PRE_ADD_EXPR    = 0x0000_0000_0005_0000 |  AST_UNARY_EXPR,
      AST_PRE_SUB_EXPR    = 0x0000_0000_0006_0000 |  AST_UNARY_EXPR,
    AST_BINARY_EXPR       = 0x0000_0002_0000_0000 | AST_EXPRESSION,
      AST_AND_EXPR        = 0x0000_0000_0001_0000 |  AST_BINARY_EXPR,
      AST_OR_EXPR         = 0x0000_0000_0002_0000 |  AST_BINARY_EXPR,
      AST_XOR_EXPR        = 0x0000_0000_0003_0000 |  AST_BINARY_EXPR,
      AST_AND_AND_EXPR    = 0x0000_0000_0004_0000 |  AST_BINARY_EXPR,
      AST_OR_OR_EXPR      = 0x0000_0000_0005_0000 |  AST_BINARY_EXPR,
      AST_ADD_EXPR        = 0x0000_0000_0006_0000 |  AST_BINARY_EXPR,
    AST_TRINARY_EXPR      = 0x0000_0003_0000_0000 | AST_EXPRESSION,
    AST_NARY_EXPR         = 0x0000_0004_0000_0000 | AST_EXPRESSION,
  AST_STATEMENT           = 0x0002_0000_0000_0000,
}


bool isA( SyntaxElement leafier, SyntaxElement rootier )
{
	SyntaxElement mask = 0;
	
	if ( rootier & 0x0000_0000_FFFF_FFFF )
	{
		if ( rootier & 0x0000_0000_0000_FFFF )
			mask = 0xFFFF_FFFF_FFFF_FFFF;
		else
			mask = 0xFFFF_FFFF_FFFF_0000;
	}
	else
	{
		if ( rootier & 0x0000_FFFF_FFFF_FFFF )
			mask = 0xFFFF_FFFF_0000_0000;
		else
			mask = 0xFFFF_0000_0000_0000;
	}
	
	return (leafier & mask) == rootier;
}

unittest
{
	assert(  isA( AST_EXPRESSION,  AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_NEGATE_EXPR) );
	assert(  isA( AST_NEGATE_EXPR, AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_UNARY_EXPR) );
	assert( !isA( AST_EXPRESSION,  AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_BINARY_EXPR) );
	assert( !isA( AST_NEGATE_EXPR, AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_COMPLIMENT_EXPR) );
	assert(false);
}

This approach of course has shameful nesting limitations, but I feel 
like these determinations could be fairly well optimized even for the 
general case.  For example: another approach that I might be more 
inclined to take is to give each token/symbol a low-valued index into a 
small inheritance table.

I would expect the regex engine to call the isA function as one of it's 
operations.  Thus placing an AST_EXPRESSION into your expression would 
also match an AST_NEGATE_EXPR too.
July 07, 2012
Re: Let's stop parser Hell
On 08-Jul-12 01:49, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
>> On 08-Jul-12 01:24, Roman D. Boiko wrote:
>>> But PEG itself doesn't require backtracking parsing, does it?
>>
>> It does. Specifically the algorithm is to try option A, try option B,
>> try option C...
> There is no algorithm, only __grammar__. It specifies that option A has
> higher priority than option B in expression A / B / C. But it doesn't
> forbid, for example, to analyze all in parallel (track state
> transitions) for each character consequently, as a proper DFA
> implementation would do, until some option succeeds and all previous
> fail.

That would not always work, but yes that's what we should do IMHO.
It's not what packrats do however (when I say algorithm I mean 
specifically packrat). And PEGs are commonly associated with packrats.

 A greedy regex would also have to check all successor options (C
> in the exapmle above) to determine the longest one.
>
Yes, yet regex  engine (in simple form) can 'merge' identical parallel 
states*  easily it's not so simple with general CFGs.

*As in NFA states, I like to think in multiple NFA states vs single DFA 
state.

>> it's obvious how backtracking comes in ... whan say A fails somewhere
>> in the middle of it.
> Simply stop tracking respective state. Process others in parallel as I
> described above.
>
Yeah it could be done, but it doesn't buy you a thing in a lot of cases 
unlike in regular grammar. The problem is that state is not as simple as 
in regex for instance (even in regex that must capture  some submatches 
DFA won't do BTW). Thus you can't merge seemingly identical states 
because they depend on (different) interpretation of previous part of 
input.

That's why DFA in ANTLR is only used to do lookahead or lexing, it 
doesn't maintain path through states during parsing.

>> Of course there is a fair amount of bookkeeping that reduces doing
>> work all over again but it does ... allocate.
>
> The amount of used memory would be proportional to the length of input
> after the branching point times number of rules (actually, of states in
> the DFA, which is likely larger but still finite for a particular grammar).
>
Finite and proportional to input size? Another way to say why an O(n) 
memory requirement for packrats?

>>>> Tokens.. there is no such term in use if we talk about 'pure' PEG.
>>>
>>> Terminal symbols.
>>>
>> Characters.
> Nope. According to the definition, PEG is a set of non-terminal symbols,
> terminal symbols, production rules, and a starting non-terminal.
> Terminals are not necessarily characters.

Yup. But I'm not talking definitions here I'm talking the notion of 
"integrated lexing" and lexing is certainly about characters or was it?


-- 
Dmitry Olshansky
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
> it seems easier to tell what the programmer "meant" with three 
> phases, in the face of errors. I mean, phase 2 can tell when 
> braces and parenthesis are not matched up properly and then it 
> can make reasonable guesses about where those missing 
> braces/parenthesis were meant to be, based on things like 
> indentation. That would be especially helpful when the parser 
> is used in an IDE, since if the IDE guesses the intention 
> correctly, it can still understand broken code and provide code 
> completion for it. And since phase 2 is a standard tool, 
> anybody's parser can use it.

There could be multiple errors that compensate each other and 
make your phase 2 succeed and prevent phase 3 from doing proper 
error handling. Even knowing that there is an error, in many 
cases you would not be able to create a meaningful error message. 
And any error would make your phase-2 tree incorrect, so it would 
be difficult to recover from it by inserting an additional token 
or ignoring tokens until parser is able to continue its work 
properly. All this would suffer for the same reason: you loose 
information.
8 9 10 11 12 13 14 15 16
Top | Discussion index | About this forum | D home