Pegged: Syntax Highlighting (page 2)

March 18, 2012

Re: Pegged: Syntax Highlighting

Posted by Andrei Alexandrescu
in reply to Philippe Sigaud

Permalink

Andrei Alexandrescu

Posted in reply to Philippe Sigaud

Permalink

On 3/17/12 3:53 PM, Philippe Sigaud wrote:
> On Sat, Mar 17, 2012 at 18:11, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org>  wrote:
>
>>> The D grammar is a 1000-line / hundreds of rules monster. I finished
>>> writing it and am now crushing bugs.
>>> God, that generates a 10_000 line module to parse it. I should
>>> simplify the code generator somewhat.
>>
>>
>> Science is done. Welcome to implementation :o).
>
> Hey, it's only 3.000 lines now :) Coming from a thousand-lines
> grammar, it's not that much an inflation.

That's quite promising.

> Indeed, but I fear the D grammar is a bit too complex to be easily
> walked. Now that I read it, I realize that '1' is parsed as a
> 10-levels deep leaf!
> Compared to lisp, it's... not in the same league, to say the least. I
> will see to drastically simplify the parse tree.

This is where custom directives for helping AST creation might help. Also, ANTLR solves that problem by allowing people to define tree walkers. They have much simpler grammars (heck, the hard job has already been done - no more ambiguities). At an extreme, languages such as ML are good at walking trees because they essentially embed a tree walker in their pattern matching grammar for function parameters.

> Does anyone have experience with other languages similar to D and that
> offer AST-walking? Doesn't C# have something like this?
> (I'll have a look at Scala macros)

Heck, I just found this which destroys ANTLR's tree walkers:

http://www.antlr.org/article/1170602723163/treewalkers.html

Didn't read it yet, but clearly it's an opposing viewpoint and relevant to your work (don't forget to also read the article to which it's replying http://antlr.org/article/1100569809276/use.tree.grammars.tml).

>> 1. Would you consider submitting your work to Phobos?
>
> Yes, of course. It's already Boost-licensed.
> Seeing the review processes for other modules, it'd most certainly put
> the code in great shape. But then, it's far from being submittable
> right now.

Let us know how we can help. This is an important project.

>> 2. Do you think your approach can generate parsers competitive with
>> hand-written ones? If not, why?
>
> Right now, no, if only because I didn't take any step in making it
> fast or in limiting its RAM consumption.
> After applying some ideas I have, I don't know. There are many people
> here that are parser-aware and could help make the code faster. But at
> the core, to allow mutually recursive rules, the design use classes:
>
> class A : someParserCombinationThatMayUseA { ... }
>
> Which means A.parse (a static method) is just typeof(super).parse
> (also static, and so on). Does that entail any crippling disadvantage
> compared to hand-written parser?

I'm not sure without seeing more code.


Andrei

On 3/17/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote: > If ddmd-clean is OK for you, that's cool. Keep us informed how that went. Seems to work ok: http://i.imgur.com/qGVZD.png I'd love to see if I can do it with Pegged too. I've yet to see how Pegged works internally though and whether I can expose a nice API for this one purpose.

On Tue, Mar 27, 2012 at 16:41, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote: > On 3/17/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote: >> If ddmd-clean is OK for you, that's cool. Keep us informed how that went. > > Seems to work ok: http://i.imgur.com/qGVZD.png Nice one. Care to explain how you did it?

On 3/27/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote: > Nice one. Care to explain how you did it? Sure. Currently the "editor" is just a viewer (can't edit text ironically :p), and is a port of one of the lessons of Neatpad (http://www.catch22.net/tuts/neatpad). It's win32-specific and later lessons cover very platform-specific unicode stuff so I haven't really bothered with the rest of the tutorial. What I have is one large char[]/wchar[] buffer, I store indices to newlines within this buffer and when I need to lex a certain line I just pass a slice into DDMD's lexer based on the position of the newlines. I then store the beginning of each token and its type (e.g. { index 5, TOK.TOKImport }) as an array for that specific line. It's easy to paint a line this way. But I do have a couple of issues. One is that I have no way to figure out where empty spaces are and not just spaces within string literals. The DDMD API only exposes the beginning of each token and not its length. And the lexer doesn't tokenize empty spaces between real tokens. So with a string like this: import foo; The space between 'import' and 'foo' ends up being treated as 'TOK.TOKImport'. It's not a big issue when I only have foreground coloring (empty space won't be drawn), but when I have background coloring I end up with this: http://i.imgur.com/0wUcR.png The other issue is that DDMD explicitly takes a char[] and not just any input range. The WinAPI text-drawing APIs require UTF16 arrays (the unicode-aware functions anyway), so I end up having to store two buffers, one UTF8 and one UTF16. I'm sure these issues can be fixed in DDMD though. With that being said the paint routine only takes about ~150 microseconds to finish which is pretty neat. Anyway if you have a win32 box you can clone: https://github.com/AndrejMitrovic/DNeatpad Then run: DNeatpad\WindowsAPI\build.bat DNeatpad\ddmd\build.bat DNeatpad\textview\build.bat That last one builds the "neatpad" folder as well. Anyway I was just doing this for fun I have no intention on writing text editors. :)

On 3/28/12, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote: > snip Btw it crashes sometime when I open std.datetime and scroll and resize the window. I've no idea what's causing it. I don't seem to index pass array bounds, and I'm not allocating win32 handles all the time either. A catch(Throwable) doesn't help. Oh well..

On 3/28/12, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote: > On 3/28/12, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote: >> snip Accidentally left out ddmd from the repo but now it's in. I think it should compile now. Let me know if it doesn't.

On 3/27/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote: > snip Philippe your example on this wiki page doesn't seem to work: https://github.com/PhilippeSigaud/Pegged/wiki/ import pegged.grammar; mixin(grammar( "Expr <- Factor AddExpr* AddExpr <- ('+'/'-') Factor Factor <- Primary MulExpr* MulExpr <- ('*'/'/') Primary Primary <- Parens / Number / Variable / '-' Primary Parens <- '(' Expr ')' Number <~ [0-9]+ Variable <- Identifier")); void main() { auto parseTree2 = Expr.parse(" 0 + 123 - 456 "); writeln(parseTree2.capture); } ["Expr failure at pos [index: 0, line: 0, col: 0]", "Factor failure at pos [index: 0, line: 0, col: 0]", "Primary failure at pos [index: 0, line: 0, col: 0]", "Parens failure at pos [index: 0, line: 0, col: 0]", "Lit!(() failure at pos [index: 0, line: 0, col: 0]"]

On 3/28/12, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote: > On 3/27/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote: >> snip > > Philippe your example on this wiki page doesn't seem to work: https://github.com/PhilippeSigaud/Pegged/wiki/ Actually it seems to work if I remove all the spaces from the input string. Maybe the grammar is just missing another rule that allows spaces?

On 3/28/12, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote: > On 3/28/12, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote: >> On 3/27/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote: >>> snip >> >> Philippe your example on this wiki page doesn't seem to work: https://github.com/PhilippeSigaud/Pegged/wiki/ > > Actually it seems to work if I remove all the spaces from the input string. Maybe the grammar is just missing another rule that allows spaces? Okay I got it, you've recently changed some code. I can see it mentioned in the readme: By default, the grammars do not silently consume spaces, as this is the standard behavior for PEGs. There is an opt-out though, with the simple `<` arrow instead of `<-` (you can see it in the previous example) So yeah, if I change to '<' it works. :)

Forums