View mode: basic / threaded / horizontal-split · Log in · Help
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
> enum : SyntaxElement
> {
>   AST_EXPRESSION          = 0x0001_0000_0000_0000,
>     AST_UNARY_EXPR        = 0x0000_0001_0000_0000 |

This would cause wasting space (probably a lot). Definitely it 
would not be easy to implement in a parser generator, when 
various language properties are not known beforehand for 
fine-grained tuning.

> This approach of course has shameful nesting limitations, but I 
> feel like these determinations could be fairly well optimized 
> even for the general case.  For example: another approach that 
> I might be more inclined to take is to give each token/symbol a 
> low-valued index into a small inheritance table.

Depending on implementation, that might introduce the multiplier 
overhead of table access per each comparison (and there would be 
many in case of searching for nodes of specific type).

> I would expect the regex engine to call the isA function as one 
> of it's operations.  Thus placing an AST_EXPRESSION into your 
> expression would also match an AST_NEGATE_EXPR too.

But actually it is not so difficult to implement in a very 
similar way to what you described. I was thinking about a lookup 
table, but different from a traditional inheritance table. It 
would be indexed by AST node type (integral enum value), and 
store various classification information as bits. Maybe this is 
what you meant and I misunderstood you... Example is here: 
https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d 
(sorry, it doesn't show how to do classification, and has a 
different context, but I hope you get the idea). The advantage 
over storing hierarchical information directly in each token is 
obviously memory usage.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:07:02 UTC, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
>> it seems easier to tell what the programmer "meant" with three 
>> phases, in the face of errors. I mean, phase 2 can tell when 
>> braces and parenthesis are not matched up properly and then it 
>> can make reasonable guesses about where those missing 
>> braces/parenthesis were meant to be, based on things like 
>> indentation. That would be especially helpful when the parser 
>> is used in an IDE, since if the IDE guesses the intention 
>> correctly, it can still understand broken code and provide 
>> code completion for it. And since phase 2 is a standard tool, 
>> anybody's parser can use it.
>
> There could be multiple errors that compensate each other and 
> make your phase 2 succeed and prevent phase 3 from doing proper 
> error handling. Even knowing that there is an error, in many 
> cases you would not be able to create a meaningful error 
> message. And any error would make your phase-2 tree incorrect, 
> so it would be difficult to recover from it by inserting an 
> additional token or ignoring tokens until parser is able to 
> continue its work properly. All this would suffer for the same 
> reason: you loose information.

This is all true, but forgetting a brace commonly results in a 
barrage of error messages anyway. Code that guesses what you 
meant obviously won't work all the time, and phase 3 would have 
to take care not to emit an error message about a "{" token that 
doesn't actually exist (that was merely "guessed-in"). But at 
least it's nice for a parser to be /able/ to guess what you 
meant; for a typical parser it would be out of the question, upon 
detecting an error, to back up four source lines, insert a brace 
and try again.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:05:05 UTC, Dmitry Olshansky wrote:
> That would not always work, but yes that's what we should do 
> IMHO.
I didn't do a deeper research on this, just shared my current 
understanding. When that doesn't work in a particular case, we 
will need to find the problem and solve that.

> It's not what packrats do however (when I say algorithm I mean 
> specifically packrat). And PEGs are commonly associated with 
> packrats.
As Philippe Sigaud admitted, that is an incorrect association.

>>> it's obvious how backtracking comes in ... whan say A fails 
>>> somewhere in the middle of it.
>> Simply stop tracking respective state. Process others in 
>> parallel as I described above.
>>
> Yeah it could be done, but it doesn't buy you a thing in a lot 
> of cases unlike in regular grammar. The problem is that state 
> is not as simple as in regex for instance (even in regex that 
> must capture  some submatches DFA won't do BTW). Thus you can't 
> merge seemingly identical states because they depend on 
> (different) interpretation of previous part of input.
Well, if you must track the path to a particular state, then DFA 
simply doesn't match the problem and there's nothing we can do.

> That's why DFA in ANTLR is only used to do lookahead or lexing, 
> it doesn't maintain path through states during parsing.
See above. Is maintaining path really necessary? I thought that 
DFA with additional states could be created for any particular 
situation of ambiguity, etc. This needs further research.

>>> Of course there is a fair amount of bookkeeping that reduces 
>>> doing
>>> work all over again but it does ... allocate.
>>
>> The amount of used memory would be proportional to the length 
>> of input
>> after the branching point times number of rules (actually, of 
>> states in
>> the DFA, which is likely larger but still finite for a 
>> particular grammar).
>>
> Finite and proportional to input size? Another way to say why 
> an O(n) memory requirement for packrats?
Sorry, I incorrectly included the O(n) multiplier. It should be 
finite. When you go to the next character, you don't need 
previous states any more.

> Yup. But I'm not talking definitions here I'm talking the 
> notion of "integrated lexing" and lexing is certainly about 
> characters or was it?

I thought integrated lexing was about doing it along with parsing 
and specifying its rules directly inside the grammar. From the 
former we get structural context to narrow-down available 
options, and from the latter we have a uniform grammar 
description.
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:
> This is all true, but forgetting a brace commonly results in a 
> barrage of error messages anyway. Code that guesses what you 
> meant obviously won't work all the time, and phase 3 would have 
> to take care not to emit an error message about a "{" token 
> that doesn't actually exist (that was merely "guessed-in"). But 
> at least it's nice for a parser to be /able/ to guess what you 
> meant; for a typical parser it would be out of the question, 
> upon detecting an error, to back up four source lines, insert a 
> brace and try again.

So you simply admit that error recovery is difficult to 
implement. For me, it is a must-have, and thus throwing away 
information is bad.
July 07, 2012
Re: Let's stop parser Hell
On 7/7/12 4:15 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu wrote:
>> Doesn't ANTLR use full-fledged character-level LL(*) parsing even in
>> the tokenizer?
>
> Since I didn't understand your question I assume that my statement was
> somehow incorrect (likely because I made some wrong assumptions about
> ANTLR). I didn't know about its existence until today and still don't
> understand it completely. What I think I understood is that it uses DFA
> for deciding which grammar rule to apply instead of doing backtracking.
> I also think that it uses DFA for low-level scanning (I'm not sure).

This should help:

http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers

Andrei
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 06:18 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
>> enum : SyntaxElement
>> {
>> AST_EXPRESSION = 0x0001_0000_0000_0000,
>> AST_UNARY_EXPR = 0x0000_0001_0000_0000 |
>
> This would cause wasting space (probably a lot). Definitely it would not
> be easy to implement in a parser generator, when various language
> properties are not known beforehand for fine-grained tuning.
>
>> This approach of course has shameful nesting limitations, but I feel
>> like these determinations could be fairly well optimized even for the
>> general case. For example: another approach that I might be more
>> inclined to take is to give each token/symbol a low-valued index into
>> a small inheritance table.
>
> Depending on implementation, that might introduce the multiplier
> overhead of table access per each comparison (and there would be many in
> case of searching for nodes of specific type).
>
>> I would expect the regex engine to call the isA function as one of
>> it's operations. Thus placing an AST_EXPRESSION into your expression
>> would also match an AST_NEGATE_EXPR too.
>
> But actually it is not so difficult to implement in a very similar way
> to what you described. I was thinking about a lookup table, but
> different from a traditional inheritance table. It would be indexed by
> AST node type (integral enum value), and store various classification
> information as bits. Maybe this is what you meant and I misunderstood
> you... Example is here:
> https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d
> (sorry, it doesn't show how to do classification, and has a different
> context, but I hope you get the idea). The advantage over storing
> hierarchical information directly in each token is obviously memory usage.

I see what you mean.

I'm not sure that I buy that language properties are known before-hand. 
 The front-end knows what the language grammar looks like and it knows 
what kind of things it can find in an AST.

Thus you can make the regex DSL do this transformation and let the DFAs 
handle everything:

expr -> (expr | unary_expr | negate_expr | binary_compliment | incr_expr 
| decr_expr | binary_expr | assign_expr | add_assign_expr | nary_expr | ...)

Because what we really want it to do is match any of those expression 
kinds when we place the expression symbol in our regex.

I think the important point here though is that inheritance can be 
reduced to bit-twiddling optimizations in this case.
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 04:26 PM, Timon Gehr wrote:
> On 07/07/2012 08:55 PM, Chad J wrote:
>>
>> The specifics will easily change.
>
> I'd suggest:
>
> AstOp!`
> Lower
> while ( boolExpr )
> {
> statements;
> }
>
> Into
> loopAgain:
> if ( !boolExpr ) {
> statements;
> } else goto exitLoop
> goto loopAgain
> exitLoop:
>
> `.run(syntaxNode);
>

I wish.

Can you make it happen?
July 07, 2012
Re: Let's stop parser Hell
On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
> I see what you mean.
And I think that I expressed myself badly. Let me rephrase.

When the memory hierarchy is deep, every level would require at 
least one bit position. Or even every base class would require a 
separate bit. (I think that the former + several bits to 
distinguish among hierarchies.) Otherwise it would not be easy to 
check by a bit-mask. Even if the above is incorrect (and that is 
likely since I didn't try to encode that compactly for the real 
grammar), I think that in general that information would only be 
possible to store in a fairly large integral. Especially if we 
try to generate parser from grammar, and thus can't do 
fine-tuning to pack the information tightly. This overhead would 
be paid per each AST node __instance__. But that is not 
necessary, since we could store information in a lookup table 
only once per node __type__.
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 04:26 PM, David Piepgrass wrote:
>> auto captures = syntaxNode.matchNodes(
>> TOK_WHILE_NODE,
>> OP_ENTER_NODE,
>> OP_CAPTURE(0),
>> OP_BEGIN,
>> TOK_EXPRESSION,
>> OP_END,
>> OP_CAPTURE(1),
>> OP_BEGIN,
>> TOK_STATEMENT,
>> OP_END,
>> OP_LEAVE_NODE);
>
> I'm glad to hear you like the tree-parsing approach, Chad, although the
> particular syntax here looks pretty unfriendly :O -- does this represent
> something that you are working on right now?
>

Yes and yes.

I didn't choose this because it because it's pretty.

I chose it because:
(1) It's easy to implement.
(2) Both the implementation and syntax can be altered easily.

I do not have time to write a language for the tree pattern recognition 
and substitution that is needed to do this in an aesthetically pleasing 
way.  I've tried to sketch what it might look like before, and even then 
it is hard to make it nice, much less begin implementing the thing.  I'd 
love to have such a language, but resource constraints exist.

I also think that this approach would allow me to find out what my usage 
patterns look like before I commit to a more complicated 
architecture/tool.  I really think the design of this regex/language/DSL 
thing should be dominated by its usage.  This is a tricky 
chicken-and-egg thing because it's not currently used.  The hacky syntax 
you see is the bootstrapping.

Point (2) is important because, since we don't have existing usage 
patterns, this thing is going to change.  It's going to change /a lot/. 
 I want it to be easy to change.  I think a complete DSL will be harder 
to change quickly.

I also like how it doesn't require a lot of CTFE trickery or pushing DMD 
too far.  D has really cool features, but I find that when I use things 
like CTFE aggressively then I lose productivity because I end up 
spending a lot of time finding compiler bugs.  This leads to my current 
strategy: use the simpler features that work for sure, and only use the 
more advanced stuff when I really need to.  I think my syntax fits this 
strategy and thus contributes to point (1).

That said, it is good that even mostly-working CTFE exists and that a 
powerful template and metaprogramming system exists, because I don't 
think a compiler like this would be very practical to program otherwise. 
 It would be doable in other languages, but could easily suffer from 
performance pessimizations due to being forced to compute everything at 
runtime.

If anyone has an approach that shares the above strengths and looks 
nicer or is more powerful, I'd love to see it.

>
> 5. It's risky 'cause I've never heard of anyone taking this approach
> before. Bring on the danger!
>

The danger is the fun part! <g>

>
>> I wanted to make such a front-end so that I could easily make a C
>> backend. I believe such a compiler would be able to do that with great
>> ease.
>>
>> Needing to use D in places where it isn't available is a real
>> pain-point for me right now, and I'll probably find ways to spend time
>> on it eventually.
>
> Yeah, with a tree-transforming parser, I imagine the same thing, except
> my current fetish is to convert a certain subset of D to multiple other
> languages automatically. Then I could write libraries that can easily be
> used by an astonishingly large audience. I certainly would like to see D
> targetting Android, but that's best done directly from D to ARM.
>

That does sound very cool.  Possibly difficult though, due to having to 
cater to the lowest-common-denominator in all of your API designs.  No 
templated functions or ranges in your API, that's for sure.  I'm sure 
there are some things where this is very doable though; it probably 
depends on what kind of libraries you are writing.

As for D targeting Android, my intent is really to target X where X is 
any CPU/OS combo you can think of.  I want to be able to get D, the 
language, not necessarily phobos or other niceties, to work on any 
platform, and to do so without much work.  Cross-compiling to a new 
platform that has never been cross-compiled before should require zero 
coding.  Perhaps it might be possible to have a text file with some 
key-value configuration that tells it certain common features are 
available on the target, thus allowing you to have more features with 
almost no effort involved.

Still, I'll always take a crippled but existent D compiler that targets 
Android over a perfect but non-existent D compiler that targets Android.

I think that the D-directly-to-ARM is the current approach for 
cross-compiling.  I critique it for its underwhelming lack of results.

> Anyway, the devil's in the detail. Originally I wanted to do a parser
> generator and a "completely general AST" in C# and couldn't seem to work
> out the details, but D is more flexible and is likely better suited to
> the task.

I can easily see how this is the case.  I don't think I'd be interested 
in doing a project like this in any other language.  I imagined trying 
to do something like this in C or Java or C# and it just doesn't seem 
practical.  For instance, I don't think the "use regular expressions to 
match AST structures" would work well in other cases because it would 
either (1) have a bunch of runtime overhead for compiling the 
expressions into DFAs at runtime or (2) suffer from integration problems 
if you try to compile the expressions in separate files before compiling 
the rest of the front-end.
July 07, 2012
Re: Let's stop parser Hell
On 07/07/2012 07:04 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
>> I see what you mean.
> And I think that I expressed myself badly. Let me rephrase.
>
> When the memory hierarchy is deep, every level would require at least
> one bit position. Or even every base class would require a separate bit.
> (I think that the former + several bits to distinguish among
> hierarchies.) Otherwise it would not be easy to check by a bit-mask.
> Even if the above is incorrect (and that is likely since I didn't try to
> encode that compactly for the real grammar), I think that in general
> that information would only be possible to store in a fairly large
> integral. Especially if we try to generate parser from grammar, and thus
> can't do fine-tuning to pack the information tightly. This overhead
> would be paid per each AST node __instance__. But that is not necessary,
> since we could store information in a lookup table only once per node
> __type__.

Yep. Makes sense.
9 10 11 12 13 14 15 16 17
Top | Discussion index | About this forum | D home