July 07, 2012
On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
> You may misunderstood me as well, the point is still the same:
> there are 2 things - notation and implementation, the fact that lexer is integrated in notation like in PEGs is not related to the fact that PEGs in their classic definition never use term Token and do backtracking parsing essentially on character level.

But PEG itself doesn't require backtracking parsing, does it? So that's an implementation detail, or a specific algorithm. Lexer separation seems to be independent of this.

>> As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens.
>>
>
> Tokens.. there is no such term in use if we talk about 'pure' PEG.

Terminal symbols.

July 07, 2012
On Saturday, 7 July 2012 at 20:27:14 UTC, Dmitry Olshansky wrote:
> So my view on the matter: whether or not lexer is integrated is two-fold question: notation vs implementation.
>
> E.g. it may make regular lexer-like things a part of notation thus having rules like:
> 	Identifier  -> Alpha (Alpha|Number|_)*
>
> Yet it may or may use the same engine for _all_ _rules_, in fact if implementation optimize things by ripping off regular parts of grammar to small DFA engines it would make some kind of lexer behind the scenes!

This is the implementation I have in mind as the only sane way to parse PEG efficiently. Plus specializing DFA to only check for those terminals which are allowed according to available structural information at the moment of their invocation.

>
> Anyway highly hybrid parsers are all the rage now, and reasons are obvious - they runs fast and can be bend by some magic to quite convoluted grammars of modern languages ;)


July 07, 2012
On 08-Jul-12 01:24, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:
>> You may misunderstood me as well, the point is still the same:
>> there are 2 things - notation and implementation, the fact that lexer
>> is integrated in notation like in PEGs is not related to the fact that
>> PEGs in their classic definition never use term Token and do
>> backtracking parsing essentially on character level.
>
> But PEG itself doesn't require backtracking parsing, does it?

It does. Specifically the algorithm is to try option A, try option B, try option C...

it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it. Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate.

This could be avoided via disambiguation a-la LL(k) but that won't be true PEG anymore ;)

So that's
> an implementation detail, or a specific algorithm.
No

Lexer separation
> seems to be independent of this.
>
Yes

>>> As for lexing multiple times, simply use a free list of terminals
>>> (aka tokens). I still assume that grammar is properly defined, so
>>> that there is only one way to split source into tokens.
>>>
>>
>> Tokens.. there is no such term in use if we talk about 'pure' PEG.
>
> Terminal symbols.
>
Characters.

-- 
Dmitry Olshansky


July 07, 2012
On Saturday, 7 July 2012 at 20:39:18 UTC, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:
>> I'd like to add that if we give tree parsing first-class treatment, I believe the most logical approach to parsing has three or more stages instead of the traditional two (lex+parse):
>>
>> 1. Lexer
>> 2. Tree-ification
>> 3. Parsing to AST (which may itself use multiple stages, e.g. parse the declarations first, then parse function bodies later)
>>
>> The new stage two simply groups things that are in parenthesis and braces. So an input stream such as the following:
>
> I bet that after stage 2 you would have performed almost the same amount of work (in other words, spent almost the same time) as you would if you did full parsing. After that you would need to iterate the whole tree (possibly multiple times), modify (or recreate if the AST is immutable) its nodes, etc. Altogether this might be a lot of overhead.
>
> My opinion is that tree manipulation is something that should be available to clients of parser-as-a-library or even of parser+semantic analyzer, but not necessarily advantageous for parser itself.

Hmm, you've got a good point there, although simple tree-ification is clearly less work than standard parsing, since statements like "auto x = y + z;" would be quickly "blitted" into the same node in phase 2, but would become multiple separate nodes in phase 3.

What I like about it is not its performance, but how it matches the way we think about languages. Humans tend to see overall structure first, and examine the fine details later. The tree parsing approach is similarly nonlinear and can be modularized in a way that might be more intuitive than traditional EBNF.

On the other hand, one could argue it is /too/ flexible, admitting so many different approaches to parsing that a front-end based on this approach is not necessarily intuitive to follow; and of course, not using a standard EBNF-type grammar could be argued to be bad.

Still... it's a fun concept, and even if the initial parsing ends up using the good-old lex-parse approach, semantic analysis and lowering can benefit from a tree parser. Tree parsing, of course, is just a generalization of linear parsing and so a tree parser generator (TPG) could work equally well for flat input.

July 07, 2012
On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
> On 08-Jul-12 01:24, Roman D. Boiko wrote:
>> But PEG itself doesn't require backtracking parsing, does it?
>
> It does. Specifically the algorithm is to try option A, try option B, try option C...
There is no algorithm, only __grammar__. It specifies that option A has higher priority than option B in expression A / B / C. But it doesn't forbid, for example, to analyze all in parallel (track state transitions) for each character consequently, as a proper DFA implementation would do, until some option succeeds and all previous fail. A greedy regex would also have to check all successor options (C in the exapmle above) to determine the longest one.

> it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it.
Simply stop tracking respective state. Process others in parallel as I described above.

> Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate.

The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar).

>>> Tokens.. there is no such term in use if we talk about 'pure' PEG.
>>
>> Terminal symbols.
>>
> Characters.
Nope. According to the definition, PEG is a set of non-terminal symbols, terminal symbols, production rules, and a starting non-terminal. Terminals are not necessarily characters.
July 07, 2012
> What I like about it is not its performance, but how it matches the way we think about languages. Humans tend to see overall structure first, and examine the fine details later. The tree parsing approach is similarly nonlinear and can be modularized in a way that might be more intuitive than traditional EBNF.

That reminds me, I forgot to write a another advantage I expected the three-phase approach to have, namely, that it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it.

Example:

void f() {
    if (cond)
        x = y + 1;
        y = z + 1;
    }
} // The error appears to be here, but it's really 4 lines up.
July 07, 2012
On Saturday, 7 July 2012 at 21:43:58 UTC, David Piepgrass wrote:
> Still... it's a fun concept, and even if the initial parsing ends up using the good-old lex-parse approach, semantic analysis and lowering can benefit from a tree parser. Tree parsing, of course, is just a generalization of linear parsing and so a tree parser generator (TPG) could work equally well for flat input.

Exactly. Semantic analysis is what would benefit from that, as well as client code (for example, refactoring tools, etc.)

Parser would become quite complicated. Probably it would need to perform complex tree navigation (which might decrease performance proportionally to the average tree depth or even more, if parser algorithms were themselves fast). Any non-trivial query would require direct character manipulation (like comparing sub-strings of captures with string literals, etc.). Probably you would get frequent cache misses because of the need to jump throughout the (incomplete) AST. It would definitely be problematic to build an immutable AST model, thus (IMO) significantly and needlessly constraining you and users of your library. And likely you would have to deal with many more problems __without__ significant gains.
July 07, 2012
On 07/07/2012 03:13 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:
>
>> In this vision I do not use classes and inheritance for my AST.
>> Instead I use structs that contain some kind of nodeType member that
>> would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE
>> in the above code. Dynamic dispatch is instead performed by (very
>> fast) DFAs recognizing parts of the AST.
>
> Exactly. This idea first came to me in April after I implemented the
> first top-down recursive descent custom parser for a D subset. I tried
> Visitor pattern before that and wasn't happy. There are some subtle
> difficulties which I believe will be possible to overcome, most
> important being the need to introduce a mechanism for hierarchical
> classification (like a pow expression being an assign expression at the
> same time).
>

From some of my earlier scribblings:

enum : SyntaxElement
{
  AST_EXPRESSION          = 0x0001_0000_0000_0000,
    AST_UNARY_EXPR        = 0x0000_0001_0000_0000 | AST_EXPRESSION,
      AST_NEGATE_EXPR     = 0x0000_0000_0001_0000 |  AST_UNARY_EXPR,
      AST_COMPLIMENT_EXPR = 0x0000_0000_0002_0000 |  AST_UNARY_EXPR,
      AST_POST_ADD_EXPR   = 0x0000_0000_0003_0000 |  AST_UNARY_EXPR,
      AST_POST_SUB_EXPR   = 0x0000_0000_0004_0000 |  AST_UNARY_EXPR,
      AST_PRE_ADD_EXPR    = 0x0000_0000_0005_0000 |  AST_UNARY_EXPR,
      AST_PRE_SUB_EXPR    = 0x0000_0000_0006_0000 |  AST_UNARY_EXPR,
    AST_BINARY_EXPR       = 0x0000_0002_0000_0000 | AST_EXPRESSION,
      AST_AND_EXPR        = 0x0000_0000_0001_0000 |  AST_BINARY_EXPR,
      AST_OR_EXPR         = 0x0000_0000_0002_0000 |  AST_BINARY_EXPR,
      AST_XOR_EXPR        = 0x0000_0000_0003_0000 |  AST_BINARY_EXPR,
      AST_AND_AND_EXPR    = 0x0000_0000_0004_0000 |  AST_BINARY_EXPR,
      AST_OR_OR_EXPR      = 0x0000_0000_0005_0000 |  AST_BINARY_EXPR,
      AST_ADD_EXPR        = 0x0000_0000_0006_0000 |  AST_BINARY_EXPR,
    AST_TRINARY_EXPR      = 0x0000_0003_0000_0000 | AST_EXPRESSION,
    AST_NARY_EXPR         = 0x0000_0004_0000_0000 | AST_EXPRESSION,
  AST_STATEMENT           = 0x0002_0000_0000_0000,
}


bool isA( SyntaxElement leafier, SyntaxElement rootier )
{
	SyntaxElement mask = 0;
	
	if ( rootier & 0x0000_0000_FFFF_FFFF )
	{
		if ( rootier & 0x0000_0000_0000_FFFF )
			mask = 0xFFFF_FFFF_FFFF_FFFF;
		else
			mask = 0xFFFF_FFFF_FFFF_0000;
	}
	else
	{
		if ( rootier & 0x0000_FFFF_FFFF_FFFF )
			mask = 0xFFFF_FFFF_0000_0000;
		else
			mask = 0xFFFF_0000_0000_0000;
	}
	
	return (leafier & mask) == rootier;
}

unittest
{
	assert(  isA( AST_EXPRESSION,  AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_NEGATE_EXPR) );
	assert(  isA( AST_NEGATE_EXPR, AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_UNARY_EXPR) );
	assert( !isA( AST_EXPRESSION,  AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_BINARY_EXPR) );
	assert( !isA( AST_NEGATE_EXPR, AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_COMPLIMENT_EXPR) );
	assert(false);
}

This approach of course has shameful nesting limitations, but I feel like these determinations could be fairly well optimized even for the general case.  For example: another approach that I might be more inclined to take is to give each token/symbol a low-valued index into a small inheritance table.

I would expect the regex engine to call the isA function as one of it's operations.  Thus placing an AST_EXPRESSION into your expression would also match an AST_NEGATE_EXPR too.
July 07, 2012
On 08-Jul-12 01:49, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
>> On 08-Jul-12 01:24, Roman D. Boiko wrote:
>>> But PEG itself doesn't require backtracking parsing, does it?
>>
>> It does. Specifically the algorithm is to try option A, try option B,
>> try option C...
> There is no algorithm, only __grammar__. It specifies that option A has
> higher priority than option B in expression A / B / C. But it doesn't
> forbid, for example, to analyze all in parallel (track state
> transitions) for each character consequently, as a proper DFA
> implementation would do, until some option succeeds and all previous
> fail.

That would not always work, but yes that's what we should do IMHO.
It's not what packrats do however (when I say algorithm I mean specifically packrat). And PEGs are commonly associated with packrats.

 A greedy regex would also have to check all successor options (C
> in the exapmle above) to determine the longest one.
>
Yes, yet regex  engine (in simple form) can 'merge' identical parallel states*  easily it's not so simple with general CFGs.

*As in NFA states, I like to think in multiple NFA states vs single DFA state.

>> it's obvious how backtracking comes in ... whan say A fails somewhere
>> in the middle of it.
> Simply stop tracking respective state. Process others in parallel as I
> described above.
>
Yeah it could be done, but it doesn't buy you a thing in a lot of cases unlike in regular grammar. The problem is that state is not as simple as in regex for instance (even in regex that must capture  some submatches DFA won't do BTW). Thus you can't merge seemingly identical states because they depend on (different) interpretation of previous part of input.

That's why DFA in ANTLR is only used to do lookahead or lexing, it doesn't maintain path through states during parsing.

>> Of course there is a fair amount of bookkeeping that reduces doing
>> work all over again but it does ... allocate.
>
> The amount of used memory would be proportional to the length of input
> after the branching point times number of rules (actually, of states in
> the DFA, which is likely larger but still finite for a particular grammar).
>
Finite and proportional to input size? Another way to say why an O(n) memory requirement for packrats?

>>>> Tokens.. there is no such term in use if we talk about 'pure' PEG.
>>>
>>> Terminal symbols.
>>>
>> Characters.
> Nope. According to the definition, PEG is a set of non-terminal symbols,
> terminal symbols, production rules, and a starting non-terminal.
> Terminals are not necessarily characters.

Yup. But I'm not talking definitions here I'm talking the notion of "integrated lexing" and lexing is certainly about characters or was it?


-- 
Dmitry Olshansky


July 07, 2012
On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
> it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it.

There could be multiple errors that compensate each other and make your phase 2 succeed and prevent phase 3 from doing proper error handling. Even knowing that there is an error, in many cases you would not be able to create a meaningful error message. And any error would make your phase-2 tree incorrect, so it would be difficult to recover from it by inserting an additional token or ignoring tokens until parser is able to continue its work properly. All this would suffer for the same reason: you loose information.