March 01, 2012
On 2012-03-01 18:21, Philippe Sigaud wrote:
>  > > But how to associate actions with a rule, in that case? I mean, some
>  > > rules will have actions, some not.
>  >
>  > You could maybe just put D code in the grammar string, which gets
>  > compiled as a string mixin by CTFE?
>
> I could, but one of my driving goals while beginning this project a
> month ao was to use the shiny new lambda syntax directly :)
>
> "rule1", o => o
> "rule2", o => o
>
> etc.
>

Maybe we can take some ideas from the CoffeeScript compiler (written in CoffeeScript) :

Grammar: http://jashkenas.github.com/coffee-script/documentation/docs/grammar.html

Lexer: http://jashkenas.github.com/coffee-script/documentation/docs/lexer.html

-- 
/Jacob Carlborg
March 04, 2012
(2012/03/01 1:19), Andrei Alexandrescu wrote:
> On 2/28/12 8:58 AM, Hisayuki Mima wrote:
>> Certainly, I don't write yet the documentation of my library, ctpg.
>> (But a few examples available here:
>> https://github.com/youkei/ctpg/tree/master/example)
>
> Nice! I have a few comments. I should say that they entail a fair amount
> of work.
>
> The current setup makes things very difficult for a common case -
> generating a parse tree. The user would need to insert a lot of custom
> code to create the appropriate nodes, but that's very easy for the
> parser because it already has the components. The parser should have a
> "build AST" option, in which case it should build all nodes without much
> effort from the user. That's what ANTLR does, and it has a special
> operator "^" to indicate where the root of the tree should be
> (http://www.antlr2.org/doc/trees.html).
>
> So here's what I think your examples should look like:
>
> The syntax could be changed a bit - it should be less like D and more
> like PEG. The "$" is implicit and shouldn't be needed. The ";" is a
> useful terminator for large productions spanning more than one line, so
> let's keep it. I don't understand the use of "!", for example the PEG
> for expression parsing at
> http://en.wikipedia.org/wiki/Parsing_expression_grammar is:
>
> Value ← [0-9]+ / '(' Expr ')'
> Product ← Value (('*' / '/') Value)*
> Sum ← Product (('+' / '-') Product)*
> Expr ← Sum
>
> but your grammar has:
>
> int primary = !"(" addExp !")" / intLit_p;
>
> whereas it seems to me it should be
>
> int primary = "(" addExp ")" / intLit_p;
>
> No?
>
> Here's how I think a parser that also builds an AST looks like:
>
> mixin generateParsers!(
> Options.createAST,
> q{
> root <- addExp;
> addExp <- mulExp (("+" / "-")^ addExp)?
> mulExp <- primary (("*" / "/")^ mulExp)?
> primary <- "(" addExp ")"% / intLit_p;
> }
> );
>
> I used the following notation: a node suffixed with "^" becomes the root
> of the produced AST, and a node suffixed with "%" will be dropped
> altogether (that's because ")" is not needed in the AST after the
> expression has been parsed). With only three characters I informed the
> parser what the shape of the AST will be.
>
> At this point calling parse!root("1+2*3") would return an object of type
> ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs
> node has type ASTNode!"intLit_p" which has inside a value "1". The rhs
> node has type ASTNode!"mulExp", and that in turn has two children nodes
> lhs and rhs, both of type ASTNode!"intLit_p", each with its own value
> ("2" and "3", respectively). All these nodes are dynamically created by
> the parsing process and inherit a common type, e.g. ASTNode!"" or some
> type ASTRootNode.
>
>> So, I'd like to describe it here.
>>
>> To be honest, ctpg is inspired by one module of Scala's standard
>> library, Parser Combinators.
>> One of the differences between Parser Combinators and ctpg is that
>> Parser Combinators generates parsers in run-time, but ctpg generates
>> parsers in compile-time by the power of CTFE and mixin.
>>
>> A parser generated by ctpg is a recursive descent parser, so it does
>> lexical analysis and parsing at a time.
>
> Oh, I forgot this property of PEGs - integrated lexing and parsing. So
> no need for a separate lexer!
>
>> And the type of input which it
>> can accept is string, wstring, dstring and ForwardRange whose element
>> type is char, wchar or dchar.
>
> Great. I think we should actually define the notion of BufferedRange.
> I'll get to that topic later.
>
>> So, dividing lexical analysis and parsing into two processes is
>> difficult in ctpg.
>
> Yes, sorry I forgot about that!
>
>> Importantly, a parser generated by ctpg is statically typed as one of
>> the examples, "parsing simple arithmetic expression" shows.
>> Generally a parser generated by other tool and accepting tokens returns
>> the abstract syntax tree, but it return the evaluated value in the
>> example.
>> In other words, it does lexical analysis, parsing and (type) converting
>> at a time.
>> If you want simply abstract syntax tree, it may be a little pain to use
>> ctpg.
>>
>> That's all I'd like to say about ctpg here.
>
> Why would it be difficult to integrate AST creation with parsing? I'm
> not sure I understand this. My understanding is that you should be able
> to add a couple of simple operators ("^" and "%" as described above) to
> inform AST creation, and you're done!
>
>
> Thanks,
>
> Andrei
>

Certainly, "built AST" option is very interesting.
I don't know whether building AST is a common case or not because I'm unfamiliar with syntax analysis.
But I'd like to complete ctpg as D version of boost::spirit::qi first.
Currently, ctpg can be used probably like boost::spirit::qi. (Both do typed syntax analysis.)
There are some points where ctpg is superior to boost::spirit::qi. For example, The source code written by using ctpg is simple because it is like PEG.
In addition, I'm planning to improve error messages by making excellent use of #line and __LINE__.
I'd like to write the library which is functionally same boost::spirit::qi and written in D style.
Hence, I'd like to make implementing "built AST" option i.e. untyped syntax analysis a low priority.
What is your idea?

P.S. prefix operator "!" is same as postfix operator "%" which you said.
The types which parsers on both sides of "/" operator have should be compatible.
The type of "(" addExp ")" is Tuple!(string, int, string) and the type of intLit_p is int. They are not compatible.
The type of !"(" addExp !")" is int. So the types of !"(" addExp !")" and intLit_p are compatible.

Thanks,
Hisayuki Mima
March 04, 2012
On 3/4/12 8:34 AM, Hisayuki Mima wrote:
> Certainly, "built AST" option is very interesting.
> I don't know whether building AST is a common case or not because I'm
> unfamiliar with syntax analysis.
> But I'd like to complete ctpg as D version of boost::spirit::qi first.
> Currently, ctpg can be used probably like boost::spirit::qi. (Both do
> typed syntax analysis.)
> There are some points where ctpg is superior to boost::spirit::qi. For
> example, The source code written by using ctpg is simple because it is
> like PEG.
> In addition, I'm planning to improve error messages by making excellent
> use of #line and __LINE__.
> I'd like to write the library which is functionally same
> boost::spirit::qi and written in D style.
> Hence, I'd like to make implementing "built AST" option i.e. untyped
> syntax analysis a low priority.
> What is your idea?

It's your project so you define what's fun to work on. Let me just say that I worked on a lot of parsing-related stuff, and most often projects that start as "this grammar is simple enough, let's just do processing during parsing" ultimately required a rewrite to do generate AST, process it, and then use it for computation.

You can get away without an AST for the simplest grammars (e.g. printf etc.) but in that case you compete with hand-written code. I wouldn't think of parsing a serious grammar with more than a few productions without generating an AST. My intuition is that your work will best excel at those grammars.


Andrei

March 05, 2012
Hi Nick,

On Wednesday, 29 February 2012 at 17:16:44 UTC, Nick Sabalausky
wrote:
> One thing I've been wanting to see in a lexer's regex would be a way to
> match things like D's nested comments or:

I have already implemented a lexer generator that can handle
recursion in the token definitions (using multiple lexer states).
See http://www.benhanson.net/lexertl.html

The library is C++ and generates lexers at runtime, but the
concepts should be easily transferable. Basically I have a
boolean for pop in the end state and a separate column for
push_dfa_:

     if (end_state_)
     {
         // Return longest match
         if (pop_)
         {
             start_state_ = results_.stack.top ().first;
             results_.stack.pop ();
         }
         else if (push_dfa_ != results_.npos ())
         {
             results_.stack.push (typename results::id_type_pair
                 (push_dfa_, id_));
         }
.
.
.

I'm happy to answer any questions etc.

Regards,

Ben
March 05, 2012
On Sunday, 4 March 2012 at 21:00:00 UTC, Andrei Alexandrescu
wrote:
> You can get away without an AST for the simplest grammars (e.g. printf etc.) but in that case you compete with hand-written code. I wouldn't think of parsing a serious grammar with more than a few productions without generating an AST. My intuition is that your work will best excel at those grammars.

This chimes nicely with the comments made at the Going Native
conference with regard to clang generating an AST and MSVC and
GCC not. Herb Sutter referred to it as "AST envy" (as in he
wished MSVC took the AST route).

I only mention it as maybe not everyone here watched that vid.
Andrei himself was at the conference of course!

Regards,

Ben
March 07, 2012
"Nick Sabalausky" <a@a.a> wrote in message news:jikfpi$25cc$1@digitalmars.com...
> "Nick Sabalausky" <a@a.a> wrote in message news:jikcit$201o$1@digitalmars.com...
>>
>> Hmm, maybe I need to think about what it would take to make Goldie able to parse at compile-time...
>>
>
> Just gave it a quick shot. It was looking like it might not be too bad, but then I hit:
>
> Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 'interpret.c'
>
> Bleh.
>
> (This was with DMD 2.058)

Filed: http://d.puremagic.com/issues/show_bug.cgi?id=7667


May 27, 2012
>
>
> Generally a parser generated by other tool and accepting tokens returns
> the abstract syntax tree, but it return the evaluated value in the example.
> In other words, it does lexical analysis, parsing and (type) converting at
> a time.
> If you want simply abstract syntax tree, it may be a little pain to use
> ctpg.
>

 Hello Youkei

I am trying to use CTPG for compile time parsing for a DSL I am working on. I have tried the examples you created in the examples directory.

I would like the parser to effect some side effects. For this purpose, I tried including the parser mixin into a class, but I got a strange error saying:

Error: need 'this' to access member parse

I have copied the code I am trying to compile at the end of the email. Let me know what I could be doing wrong here.

Regards
- Puneet


import ctpg;
import std.array: join;
import std.conv: to;

class Foo
{
  int result;
  mixin(generateParsers(q{
        int root = mulExp $;

        int mulExp =
          primary !"*" mulExp >> (lhs, rhs){ return lhs * rhs; }
        / primary;

        int primary = !"(" mulExp !")" / [0-9]+ >> join >> to!int;
      }));

  void frop() {
    result = parse!root("5*8");
  }
}


void main(){
  Foo foo = new Foo();
  foo.frop();
}


May 27, 2012
>
> I would like the parser to effect some side effects. For this purpose, I tried including the parser mixin into a class, but I got a strange error saying:
>
> Error: need 'this' to access member parse
>
>
Ok. I see my folly. At compile time, there would not be any "this" since the class has not been instantiated yet.

I will have to think of other solutions. Basically, I am trying to use the parser to create functions that I can use at run time. But I wanted to create two functions from the same parser. I have succeeded with creating one function. I do not want to create two separate parsers because each of these parsers would add to memory footprint of the compiler.

Any ideas? Maybe I could use tuples here?

Regards
- Puneet


May 27, 2012
I'm not sure I follow all the details of what Andrei's suggesting and what's being talked about here, this parser/lexer stuff is still very new to me, so this may be a bit off-topic. However, I thought I'd weigh in on something I was very impressed with about the Nimrod language's direct AST access/manipulation.

Nim has a "template" which is very much like D's mixin templates, example:

  # Nim
  template foo(b:string) =
    var bar = b

  block main:
    foo("test")
    assert(bar == "test")

and the equivalent in...

  // D
  mixin template foo(string b) {
    auto bar = b;
  }

  void main() {
    mixin foo("test");
    assert(bar == "test");
  }

which is all very straight forward stuff, the cool part comes with Nim's macro's. Nim has a two unique types: expr & stmt (expression & statement). They're direct AST structures which can be passed to template/macro procedures and arbitrarily mutated. Example:

  macro foo(s:stmt): stmt =
    result = newNimNode(nnkStmtList)
    for i in 1 .. s.len-1:
      var str = s[i].toStrLit()
      result.add(newCall("echo", str))

  block main:
    foo:
      bar
      baz

the above code prints:
"
  bar
  baz
"

**Some notes: result is what's returned, and the reason you can use "foo" with a statement body is because any macro/template who's last parameter is type 'stmt' can be called with block semantics; similar to how UFCS works with the first parameter.**

The above *might* look like the following in D:

  macro foo(ASTNode[] stmts...) {
    ASTNode[] result;
    foreach (s; stmts) {
      auto str = s.toASTString();
      result ~= new ASTCall!"writeln"(str);
    }
    return result;
  }

  void main() {
    foo {
      bar;
      baz;
    }
  }

This kind of direct AST manipulation + body semantics opens the doors for a lot of cool things. If you read through Nim's lib documentation you'll notice many of the "language" features are actually just Library procedures in the defaultly included system module. Which is great because contributions to the *language* can be made very easily. Also, the infrastructure to read/write AST is no-doubt immensely useful for IDE support and other such dev tools.

I'm not a huge fan of everything in Nimrod, but this is something they definitely got right, and I think D could gain from their experience.
May 27, 2012
On Thursday, 1 March 2012 at 15:10:36 UTC, Philippe Sigaud wrote:
>>> mixin(Grammar!("Doc <- Node*"
>>> "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
>>> "OpeningTag <- '<' Identifier '>'", OpeningAction,
>>> "ClosingTag <-  `</` Identifier '>'", ClosingAction,
>>> "Text <- (!(OpeningTag / ClosingTag) _)+"));
>>
>>
>> That looks about right, but still has a fair amount of noise. I think the
> approach of putting the entire grammar in one string is best.
>
> Yes, using one string is indeed better. That won't be too difficult to code.

I'm wondering if people have seen LPeg.  It's a Lua library, but the design is interesting in that patterns are first class objects which can be composed with operator overloading.

http://www.inf.puc-rio.br/~roberto/lpeg/