July 07, 2012
On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
> enum : SyntaxElement
> {
>   AST_EXPRESSION          = 0x0001_0000_0000_0000,
>     AST_UNARY_EXPR        = 0x0000_0001_0000_0000 |

This would cause wasting space (probably a lot). Definitely it would not be easy to implement in a parser generator, when various language properties are not known beforehand for fine-grained tuning.

> This approach of course has shameful nesting limitations, but I feel like these determinations could be fairly well optimized even for the general case.  For example: another approach that I might be more inclined to take is to give each token/symbol a low-valued index into a small inheritance table.

Depending on implementation, that might introduce the multiplier overhead of table access per each comparison (and there would be many in case of searching for nodes of specific type).

> I would expect the regex engine to call the isA function as one of it's operations.  Thus placing an AST_EXPRESSION into your expression would also match an AST_NEGATE_EXPR too.

But actually it is not so difficult to implement in a very similar way to what you described. I was thinking about a lookup table, but different from a traditional inheritance table. It would be indexed by AST node type (integral enum value), and store various classification information as bits. Maybe this is what you meant and I misunderstood you... Example is here: https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d (sorry, it doesn't show how to do classification, and has a different context, but I hope you get the idea). The advantage over storing hierarchical information directly in each token is obviously memory usage.
July 07, 2012
On Saturday, 7 July 2012 at 22:07:02 UTC, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:
>> it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it.
>
> There could be multiple errors that compensate each other and make your phase 2 succeed and prevent phase 3 from doing proper error handling. Even knowing that there is an error, in many cases you would not be able to create a meaningful error message. And any error would make your phase-2 tree incorrect, so it would be difficult to recover from it by inserting an additional token or ignoring tokens until parser is able to continue its work properly. All this would suffer for the same reason: you loose information.

This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again.
July 07, 2012
On Saturday, 7 July 2012 at 22:05:05 UTC, Dmitry Olshansky wrote:
> That would not always work, but yes that's what we should do IMHO.
I didn't do a deeper research on this, just shared my current understanding. When that doesn't work in a particular case, we will need to find the problem and solve that.

> It's not what packrats do however (when I say algorithm I mean specifically packrat). And PEGs are commonly associated with packrats.
As Philippe Sigaud admitted, that is an incorrect association.

>>> it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it.
>> Simply stop tracking respective state. Process others in parallel as I described above.
>>
> Yeah it could be done, but it doesn't buy you a thing in a lot of cases unlike in regular grammar. The problem is that state is not as simple as in regex for instance (even in regex that must capture  some submatches DFA won't do BTW). Thus you can't merge seemingly identical states because they depend on (different) interpretation of previous part of input.
Well, if you must track the path to a particular state, then DFA simply doesn't match the problem and there's nothing we can do.

> That's why DFA in ANTLR is only used to do lookahead or lexing, it doesn't maintain path through states during parsing.
See above. Is maintaining path really necessary? I thought that DFA with additional states could be created for any particular situation of ambiguity, etc. This needs further research.

>>> Of course there is a fair amount of bookkeeping that reduces doing
>>> work all over again but it does ... allocate.
>>
>> The amount of used memory would be proportional to the length of input
>> after the branching point times number of rules (actually, of states in
>> the DFA, which is likely larger but still finite for a particular grammar).
>>
> Finite and proportional to input size? Another way to say why an O(n) memory requirement for packrats?
Sorry, I incorrectly included the O(n) multiplier. It should be finite. When you go to the next character, you don't need previous states any more.

> Yup. But I'm not talking definitions here I'm talking the notion of "integrated lexing" and lexing is certainly about characters or was it?

I thought integrated lexing was about doing it along with parsing and specifying its rules directly inside the grammar. From the former we get structural context to narrow-down available options, and from the latter we have a uniform grammar description.
July 07, 2012
On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:
> This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again.

So you simply admit that error recovery is difficult to implement. For me, it is a must-have, and thus throwing away information is bad.
July 07, 2012
On 7/7/12 4:15 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu wrote:
>> Doesn't ANTLR use full-fledged character-level LL(*) parsing even in
>> the tokenizer?
>
> Since I didn't understand your question I assume that my statement was
> somehow incorrect (likely because I made some wrong assumptions about
> ANTLR). I didn't know about its existence until today and still don't
> understand it completely. What I think I understood is that it uses DFA
> for deciding which grammar rule to apply instead of doing backtracking.
> I also think that it uses DFA for low-level scanning (I'm not sure).

This should help:

http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers

Andrei
July 07, 2012
On 07/07/2012 06:18 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
>> enum : SyntaxElement
>> {
>> AST_EXPRESSION = 0x0001_0000_0000_0000,
>> AST_UNARY_EXPR = 0x0000_0001_0000_0000 |
>
> This would cause wasting space (probably a lot). Definitely it would not
> be easy to implement in a parser generator, when various language
> properties are not known beforehand for fine-grained tuning.
>
>> This approach of course has shameful nesting limitations, but I feel
>> like these determinations could be fairly well optimized even for the
>> general case. For example: another approach that I might be more
>> inclined to take is to give each token/symbol a low-valued index into
>> a small inheritance table.
>
> Depending on implementation, that might introduce the multiplier
> overhead of table access per each comparison (and there would be many in
> case of searching for nodes of specific type).
>
>> I would expect the regex engine to call the isA function as one of
>> it's operations. Thus placing an AST_EXPRESSION into your expression
>> would also match an AST_NEGATE_EXPR too.
>
> But actually it is not so difficult to implement in a very similar way
> to what you described. I was thinking about a lookup table, but
> different from a traditional inheritance table. It would be indexed by
> AST node type (integral enum value), and store various classification
> information as bits. Maybe this is what you meant and I misunderstood
> you... Example is here:
> https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d
> (sorry, it doesn't show how to do classification, and has a different
> context, but I hope you get the idea). The advantage over storing
> hierarchical information directly in each token is obviously memory usage.

I see what you mean.

I'm not sure that I buy that language properties are known before-hand.  The front-end knows what the language grammar looks like and it knows what kind of things it can find in an AST.

Thus you can make the regex DSL do this transformation and let the DFAs handle everything:

expr -> (expr | unary_expr | negate_expr | binary_compliment | incr_expr | decr_expr | binary_expr | assign_expr | add_assign_expr | nary_expr | ...)

Because what we really want it to do is match any of those expression kinds when we place the expression symbol in our regex.

I think the important point here though is that inheritance can be reduced to bit-twiddling optimizations in this case.
July 07, 2012
On 07/07/2012 04:26 PM, Timon Gehr wrote:
> On 07/07/2012 08:55 PM, Chad J wrote:
>>
>> The specifics will easily change.
>
> I'd suggest:
>
> AstOp!`
> Lower
> while ( boolExpr )
> {
> statements;
> }
>
> Into
> loopAgain:
> if ( !boolExpr ) {
> statements;
> } else goto exitLoop
> goto loopAgain
> exitLoop:
>
> `.run(syntaxNode);
>

I wish.

Can you make it happen?
July 07, 2012
On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
> I see what you mean.
And I think that I expressed myself badly. Let me rephrase.

When the memory hierarchy is deep, every level would require at least one bit position. Or even every base class would require a separate bit. (I think that the former + several bits to distinguish among hierarchies.) Otherwise it would not be easy to check by a bit-mask. Even if the above is incorrect (and that is likely since I didn't try to encode that compactly for the real grammar), I think that in general that information would only be possible to store in a fairly large integral. Especially if we try to generate parser from grammar, and thus can't do fine-tuning to pack the information tightly. This overhead would be paid per each AST node __instance__. But that is not necessary, since we could store information in a lookup table only once per node __type__.
July 07, 2012
On 07/07/2012 04:26 PM, David Piepgrass wrote:
>> auto captures = syntaxNode.matchNodes(
>> TOK_WHILE_NODE,
>> OP_ENTER_NODE,
>> OP_CAPTURE(0),
>> OP_BEGIN,
>> TOK_EXPRESSION,
>> OP_END,
>> OP_CAPTURE(1),
>> OP_BEGIN,
>> TOK_STATEMENT,
>> OP_END,
>> OP_LEAVE_NODE);
>
> I'm glad to hear you like the tree-parsing approach, Chad, although the
> particular syntax here looks pretty unfriendly :O -- does this represent
> something that you are working on right now?
>

Yes and yes.

I didn't choose this because it because it's pretty.

I chose it because:
(1) It's easy to implement.
(2) Both the implementation and syntax can be altered easily.

I do not have time to write a language for the tree pattern recognition and substitution that is needed to do this in an aesthetically pleasing way.  I've tried to sketch what it might look like before, and even then it is hard to make it nice, much less begin implementing the thing.  I'd love to have such a language, but resource constraints exist.

I also think that this approach would allow me to find out what my usage patterns look like before I commit to a more complicated architecture/tool.  I really think the design of this regex/language/DSL thing should be dominated by its usage.  This is a tricky chicken-and-egg thing because it's not currently used.  The hacky syntax you see is the bootstrapping.

Point (2) is important because, since we don't have existing usage patterns, this thing is going to change.  It's going to change /a lot/.  I want it to be easy to change.  I think a complete DSL will be harder to change quickly.

I also like how it doesn't require a lot of CTFE trickery or pushing DMD too far.  D has really cool features, but I find that when I use things like CTFE aggressively then I lose productivity because I end up spending a lot of time finding compiler bugs.  This leads to my current strategy: use the simpler features that work for sure, and only use the more advanced stuff when I really need to.  I think my syntax fits this strategy and thus contributes to point (1).

That said, it is good that even mostly-working CTFE exists and that a powerful template and metaprogramming system exists, because I don't think a compiler like this would be very practical to program otherwise.  It would be doable in other languages, but could easily suffer from performance pessimizations due to being forced to compute everything at runtime.

If anyone has an approach that shares the above strengths and looks nicer or is more powerful, I'd love to see it.

>
> 5. It's risky 'cause I've never heard of anyone taking this approach
> before. Bring on the danger!
>

The danger is the fun part! <g>

>
>> I wanted to make such a front-end so that I could easily make a C
>> backend. I believe such a compiler would be able to do that with great
>> ease.
>>
>> Needing to use D in places where it isn't available is a real
>> pain-point for me right now, and I'll probably find ways to spend time
>> on it eventually.
>
> Yeah, with a tree-transforming parser, I imagine the same thing, except
> my current fetish is to convert a certain subset of D to multiple other
> languages automatically. Then I could write libraries that can easily be
> used by an astonishingly large audience. I certainly would like to see D
> targetting Android, but that's best done directly from D to ARM.
>

That does sound very cool.  Possibly difficult though, due to having to cater to the lowest-common-denominator in all of your API designs.  No templated functions or ranges in your API, that's for sure.  I'm sure there are some things where this is very doable though; it probably depends on what kind of libraries you are writing.

As for D targeting Android, my intent is really to target X where X is any CPU/OS combo you can think of.  I want to be able to get D, the language, not necessarily phobos or other niceties, to work on any platform, and to do so without much work.  Cross-compiling to a new platform that has never been cross-compiled before should require zero coding.  Perhaps it might be possible to have a text file with some key-value configuration that tells it certain common features are available on the target, thus allowing you to have more features with almost no effort involved.

Still, I'll always take a crippled but existent D compiler that targets Android over a perfect but non-existent D compiler that targets Android.

I think that the D-directly-to-ARM is the current approach for cross-compiling.  I critique it for its underwhelming lack of results.

> Anyway, the devil's in the detail. Originally I wanted to do a parser
> generator and a "completely general AST" in C# and couldn't seem to work
> out the details, but D is more flexible and is likely better suited to
> the task.

I can easily see how this is the case.  I don't think I'd be interested in doing a project like this in any other language.  I imagined trying to do something like this in C or Java or C# and it just doesn't seem practical.  For instance, I don't think the "use regular expressions to match AST structures" would work well in other cases because it would either (1) have a bunch of runtime overhead for compiling the expressions into DFAs at runtime or (2) suffer from integration problems if you try to compile the expressions in separate files before compiling the rest of the front-end.
July 07, 2012
On 07/07/2012 07:04 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:
>> I see what you mean.
> And I think that I expressed myself badly. Let me rephrase.
>
> When the memory hierarchy is deep, every level would require at least
> one bit position. Or even every base class would require a separate bit.
> (I think that the former + several bits to distinguish among
> hierarchies.) Otherwise it would not be easy to check by a bit-mask.
> Even if the above is incorrect (and that is likely since I didn't try to
> encode that compactly for the real grammar), I think that in general
> that information would only be possible to store in a fairly large
> integral. Especially if we try to generate parser from grammar, and thus
> can't do fine-tuning to pack the information tightly. This overhead
> would be paid per each AST node __instance__. But that is not necessary,
> since we could store information in a lookup table only once per node
> __type__.

Yep. Makes sense.