D grammar? (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » D grammar? (page 2)

August 08, 2003

Posted by Mike Wynn
in reply to Walter

Mike Wynn

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:bh0tjn$9cu$1@digitaldaemon.com...
>
> "Ivan Senji" <ivan.senji@public.srce.hr> wrote in message news:bgvq53$28sd$1@digitaldaemon.com...
> > I know I'm asking stupid questions:
> > I like D very much and i would like to write a parser for it (for fun).
> > Does arbitrary lookahead mean LR(1) grammar/parser?
> > Im writing a program that creates a LR(1) parser table from the grammar
> > but i would like to know is LR(1) the right parser?
>
> I can never remember the difference between LL, LR, LALR, etc., probably because I never use parser generators. But it is not (1) because it
requires
> arbitrary lookahead, mostly in trying to figure out if a sequence of
tokens
> is a type or an expression, declaration or statement.
>
LL(n) is a top down parser i.e. recursive decent or predictive rd.
LR is a bottom up parser (shift reduce)
LALR is a lookahead LR

LL parsers have problems with e ::= e + t; so you have to rewrite your grammers to avoid left recursion

as far as I know (my dragon book claims anyway) yacc is an LALR parser
JavaCC was LL(n) and Antlr is either LL(n) or LALR

unlike C or C++ you do not know if an id is an id or a type so you have to go quite a way until you know if you have a variable declare or an assign

simple examples to try out
id[id] a;  // declare
id[id] = s; // assign
this could of course be
id.id[id.id[id.id[id]][id]] etc
and func pointers I think cause similar issues
id(*id)(id, id )   // type or function call that returns func ptr that is
called ? given that

worse is c cast ..  (id)(id)  // func call or cast ? but I think they should
be outlawed

I think you'll just have to try it out, a D parser generator would be a good
tool anyway and I'm sure there are tweeks that can be put in to resolve some
issues and if you build a tree then work out what it is I think LALR should
handle it initially.
it would be a good excersise, especially with all the random good ideas ppl
have, would allow someone to try out a given syntax and see if it would
actually work or not.

August 08, 2003

Posted by Russ Lewis
in reply to Bill Cox

Russ Lewis

Posted in reply to Bill Cox

YACC (and Bison) is LALR(1).  LALR(1) cannot parse C, D, or other similar languages without some nasty hacks.

LALR(1) is LookAhead Left Right (1 symbol of lookahead).  It means that it reads the tokens starting at the left, but interprets them starting from the right.  Basically, it means that it can only evaluate the topmost token(s) on the stack.  The 1 symbol of lookahead means that you can read one additional symbol before deciding how to evaluate the topmost tokens.

As Burton said, the classic problem that YACC has is parsing the difference between expressions and type declarations.  A grammar to parse a statment might include the following rules:

statement:
	  declaration ';'
	| expression ';'
;

declaration:
	  type IDENT
;

type:
	  IDENT
	| type '*'
;

expression:
	  IDENT
	| expression '*' expression
;

Now, as an LALR(1) parser, how do you parse this series of tokens:

IDENT '*' IDENT ';'

This could be either a declaration of a pointer, or an expression where two variables are multiplied together.

An LALR(1) parser MUST be able to parse the top tokens on the stack with only 1 symbol of lookahead.  The grammar requires that the parser make a reduction of the first IDENT token, either IDENT->type or IDENT->expression.  However, the parser cannot tell from the single token of lookahead which is valid - either could work!

Thus, the grammar given above CANNOT be parsed by an LALR(1) parser.

You can try to hack around things, if you want:

statement:
	  IDENT stars IDENT ';'
;

stars:
	  '*'
	| stars '*'
;

This grammar is parsable by an LALR(1) parser.  But it isn't really readable!

Right when D first came out, I decided to write a parser for it.  I started with bison (GNU's yacc alternative).  I never had any success parsing D because of problems like this.  The grammar eventually gets too ugly to maintain, and even with all the ugliness, my grammar still had ambiguities.  In my mind, it's an open question whether it is even possible to devicse an LALR(1) grammar that parses C or D.

So I wrote my own parser generator :)

Bill Cox wrote:
> Walter wrote:
> 
>> "Ivan Senji" <ivan.senji@public.srce.hr> wrote in message
>> news:bgvq53$28sd$1@digitaldaemon.com...
>>
>>> I know I'm asking stupid questions:
>>> I like D very much and i would like to write a parser for it (for fun).
>>> Does arbitrary lookahead mean LR(1) grammar/parser?
>>> Im writing a program that creates a LR(1) parser table from the grammar
>>> but i would like to know is LR(1) the right parser?
>>
>>
>>
>> I can never remember the difference between LL, LR, LALR, etc., probably
>> because I never use parser generators. But it is not (1) because it requires
>> arbitrary lookahead, mostly in trying to figure out if a sequence of tokens
>> is a type or an expression, declaration or statement.
>>
>> The parser (parse.c) is set up to make lookahead easy.
>>
>>
> 
> Hi, Walter.
> 
> I believe LR(1) is what YACC does (maybe it's LL(1)).
> 
> YACC has some abilities to look ahead.  It pushes "unreduced" tokens on the token stack, and waits for a rule to be able to reduce them.  I doubt D's type declarations are a problem.  Generally, grammers become non-YACC friendly when parsing depends on context information.  For example, foo(bar) in an expression can be a function call, or a cast expression in C++.  You can't build a reasonable data structure for it when you read it, and instead have to come back later and patch it.
> 
>  From what I've seen of D, the grammer looks YACC friendly.  I could be talked into verifying this, if you could create a text file of BNF rules for D.
> 
> Bill
>

August 09, 2003

Posted by Walter
in reply to Russ Lewis

Walter

Posted in reply to Russ Lewis

"Russ Lewis" <spamhole-2001-07-16@deming-os.org> wrote in message news:bh15g0$gq9$1@digitaldaemon.com...
> So I wrote my own parser generator :)

You're obviously a kindred spirit!

August 09, 2003

Posted by Walter
in reply to Mike Wynn

Walter

Posted in reply to Mike Wynn

"Mike Wynn" <mike.wynn@l8night.co.uk> wrote in message news:bh13pd$f5f$1@digitaldaemon.com...
> I think you'll just have to try it out, a D parser generator would be a
good
> tool anyway and I'm sure there are tweeks that can be put in to resolve
some
> issues and if you build a tree then work out what it is I think LALR
should
> handle it initially.
> it would be a good excersise, especially with all the random good ideas
ppl
> have, would allow someone to try out a given syntax and see if it would actually work or not.

In my experience, I've been able to crank out a hand-built parser than someone using Bison could do. So I never saw the point to using those tools <g>.

The biggest improvement the D grammar has over C/C++ is that it does not require semantic information in order to lex or to parse. This makes it a *lot* easier to build syntax sensitive editors and other types of code analysis tools.

August 09, 2003

Posted by Fabian Giesen
in reply to Walter

Fabian Giesen

Posted in reply to Walter

> I can never remember the difference between LL, LR, LALR, etc., probably because I never use parser generators. But it is not (1) because it requires arbitrary lookahead, mostly in trying to figure out if a sequence of tokens is a type or an expression, declaration or statement.

Any LR(k) grammar can be converted to a LR(1) grammar that parses the
exact same language, at the expense of the need for more states (and a
more complicated source grammar). That's why most LR parser generators
don't bother providing more than one token of lookahead.

-fg

August 09, 2003

Posted by Mike Wynn
in reply to Walter

Mike Wynn

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:bh1kt5$uua$2@digitaldaemon.com...
>
> "Mike Wynn" <mike.wynn@l8night.co.uk> wrote in message news:bh13pd$f5f$1@digitaldaemon.com...
> > I think you'll just have to try it out, a D parser generator would be a
> good
> > tool anyway and I'm sure there are tweeks that can be put in to resolve
> some
> > issues and if you build a tree then work out what it is I think LALR
> should
> > handle it initially.
> > it would be a good excersise, especially with all the random good ideas
> ppl
> > have, would allow someone to try out a given syntax and see if it would actually work or not.
>
> In my experience, I've been able to crank out a hand-built parser than someone using Bison could do. So I never saw the point to using those
tools
> <g>.
the reason for (as I see it) using Bison, Antlr etc is that you can "prove"
your grammer does not have obscure conflicts.
I agree that a hand crafted parser can be easier, just little things like C
comments are easy to deal with in a hand written parser (once you've read
'/*' ignore everything until you get '*/') but a pain to write in antlr for
example
ML_COMMENT
 : "/*"
  ( { LA(2)!='/' }? '*'
  | '\n' { newline(); }
  | ~('*'|'\n')
  )*
  "*/"
   { $setType(Token.SKIP); }
 ;

but I don't see that that alone makes parser/lexer/treewalker generating
tools redundant they do offer a level of maintainability that a hand written
parser can lack.
it not always a case of who can write the parser first, but also who's is
the most robust and can detect that the original grammer had flaws.
and who could add new features without breaking to much ?
hand written parsers do allow blending of LL and LR techniques but I'm not
100% convinced that that is altogether a good idea, quite a lot of LR
grammers can be converted to LL(n) grammer and if you have a grammer that
can not be written as LR(n) or LL(n) is it a "good" grammer ?


> The biggest improvement the D grammar has over C/C++ is that it does not require semantic information in order to lex or to parse. This makes it a *lot* easier to build syntax sensitive editors and other types of code analysis tools.

that would imply that the grammer should be easy to write.
but you do have to perform 2 passes over the parsed tree due to this lack of
semantic info
in C (or C++) at the time when you parse a type identifier you know what it
is, in D you do not.

August 10, 2003

Posted by Helium
in reply to Walter

Helium

Posted in reply to Walter

>The biggest improvement the D grammar has over C/C++ is that it does not require semantic information in order to lex or to parse. This makes it a *lot* easier to build syntax sensitive editors and other types of code analysis tools.

But are there any tools like this? e.g. a refactoring tool? I haven't seen one.

August 11, 2003

Posted by Peter Hercek
in reply to Ivan Senji

Peter Hercek

Posted in reply to Ivan Senji

Attachments:

d-grammar.txt

Hi Ivan,

It would be great if you (or anybody) would create D grammar in BNF
 or EBNF. I tried it myself a few month ago from the web documentation,
 but the grammar (on the web) had tons of trivial errors and was not
 complete. I tried to get it from sources, but found out that the parser
 is hand writen. Uff, I have given up. I did not feel like courageous
 enough to reverse engineer the souce code to create BNF grammar :o(
 Shame on me, but to reverse engineer the souce code was too much.

I attached the BNF grammar, I was able to extract from the web. Anyway,
 it is few months old. If you (or anybody) can update the file, I would
 like to play with is and generate a *nice* html grammar for D. (I should
 be able to generate antlr grammar description from it too, but I do not
 have this implemented.) I created some XML schema for grammar description
 and XSLT template to generate the html file; it was my fun project to
 learn what it is XSLT. To have idea how the hyperlinked grammar would
 look like check out this C# grammar (there was no source code reverse
 engineering there :o) )

http://peter.hercek.sk/c-sharp-grammar.html

It contains both forward and backward links (ie "show me all the rules,
 which do use this symbol" can be achieved by clicking on the rule head).
 Tested only for IE.

So, if somebody wants to move D grammar on, please respond to this
 message (to group of course). But it requires a lot of source code
 studying unfortunately! I would tell that D grammar is protected through
 obcurity :o). Seriously now, I may be helpfull with some tools to look
 up trivial problems etc.

Peter.


"Ivan Senji" <ivan.senji@public.srce.hr> wrote in message news:bgvq53$28sd$1@digitaldaemon.com...
> I know I'm asking stupid questions:
> I like D very much and i would like to write a parser for it (for fun).
> Does arbitrary lookahead mean LR(1) grammar/parser?
> Im writing a program that creates a LR(1) parser table from the grammar
> but i would like to know is LR(1) the right parser?
>
> "Walter" <walter@digitalmars.com> wrote in message news:bgon6j$1kfd$2@digitaldaemon.com...
> >
> > "Ivan Senji" <ivan.senji@public.srce.hr> wrote in message news:bgntfi$qhg$1@digitaldaemon.com...
> > > Is it LL(1)? Can it be converted to LL(1)?
> >
> > No, it requires arbitrary lookahead.

August 11, 2003

Posted by Bill Cox
in reply to Russ Lewis

Bill Cox

Posted in reply to Russ Lewis

Russ Lewis wrote:
> YACC (and Bison) is LALR(1).  LALR(1) cannot parse C, D, or other similar languages without some nasty hacks.
> 
> LALR(1) is LookAhead Left Right (1 symbol of lookahead).  It means that it reads the tokens starting at the left, but interprets them starting from the right.  Basically, it means that it can only evaluate the topmost token(s) on the stack.  The 1 symbol of lookahead means that you can read one additional symbol before deciding how to evaluate the topmost tokens.
> 
> As Burton said, the classic problem that YACC has is parsing the difference between expressions and type declarations.  A grammar to parse a statment might include the following rules:
> 
> statement:
>       declaration ';'
>     | expression ';'
> ;
> 
> declaration:
>       type IDENT
> ;
> 
> type:
>       IDENT
>     | type '*'
> ;
> 
> expression:
>       IDENT
>     | expression '*' expression
> ;
> 
> Now, as an LALR(1) parser, how do you parse this series of tokens:
> 
> IDENT '*' IDENT ';'
> 
> This could be either a declaration of a pointer, or an expression where two variables are multiplied together.
> 
> An LALR(1) parser MUST be able to parse the top tokens on the stack with only 1 symbol of lookahead.  The grammar requires that the parser make a reduction of the first IDENT token, either IDENT->type or IDENT->expression.  However, the parser cannot tell from the single token of lookahead which is valid - either could work!
> 
> Thus, the grammar given above CANNOT be parsed by an LALR(1) parser.
> 
> You can try to hack around things, if you want:
> 
> statement:
>       IDENT stars IDENT ';'
> ;
> 
> stars:
>       '*'
>     | stars '*'
> ;
> 
> This grammar is parsable by an LALR(1) parser.  But it isn't really readable!
> 
> Right when D first came out, I decided to write a parser for it.  I started with bison (GNU's yacc alternative).  I never had any success parsing D because of problems like this.  The grammar eventually gets too ugly to maintain, and even with all the ugliness, my grammar still had ambiguities.  In my mind, it's an open question whether it is even possible to devicse an LALR(1) grammar that parses C or D.
> 
> So I wrote my own parser generator :)
> 
> 
> 
> Bill Cox wrote:
> 
>> Walter wrote:
>>
>>> "Ivan Senji" <ivan.senji@public.srce.hr> wrote in message
>>> news:bgvq53$28sd$1@digitaldaemon.com...
>>>
>>>> I know I'm asking stupid questions:
>>>> I like D very much and i would like to write a parser for it (for fun).
>>>> Does arbitrary lookahead mean LR(1) grammar/parser?
>>>> Im writing a program that creates a LR(1) parser table from the grammar
>>>> but i would like to know is LR(1) the right parser?
>>>
>>>
>>>
>>>
>>> I can never remember the difference between LL, LR, LALR, etc., probably
>>> because I never use parser generators. But it is not (1) because it requires
>>> arbitrary lookahead, mostly in trying to figure out if a sequence of tokens
>>> is a type or an expression, declaration or statement.
>>>
>>> The parser (parse.c) is set up to make lookahead easy.
>>>
>>>
>>
>> Hi, Walter.
>>
>> I believe LR(1) is what YACC does (maybe it's LL(1)).
>>
>> YACC has some abilities to look ahead.  It pushes "unreduced" tokens on the token stack, and waits for a rule to be able to reduce them.  I doubt D's type declarations are a problem.  Generally, grammers become non-YACC friendly when parsing depends on context information.  For example, foo(bar) in an expression can be a function call, or a cast expression in C++.  You can't build a reasonable data structure for it when you read it, and instead have to come back later and patch it.
>>
>>  From what I've seen of D, the grammer looks YACC friendly.  I could be talked into verifying this, if you could create a text file of BNF rules for D.
>>
>> Bill
>>
> 


Hi, Russ.

Can the latest bison parse your D grammer?  I think an official machine readable grammer is a good thing, even if no one uses it to create an actual parser.  If you have a D grammer, could you post it?

Also, if I remember, you couldn't share your new parser generator due to issues at work.  Is that right?

Bill

August 11, 2003

Posted by Russ Lewis
in reply to Bill Cox

Russ Lewis

Posted in reply to Bill Cox

Bill Cox wrote:
> Can the latest bison parse your D grammer?  I think an official machine readable grammer is a good thing, even if no one uses it to create an actual parser.  If you have a D grammer, could you post it?
> 
> Also, if I remember, you couldn't share your new parser generator due to issues at work.  Is that right?
> 
> Bill

I haven't tried it since they put the GLR code in.  It's a good idea, but I'd probably have to rebuild it from scratch.  I don't (anymore) have a "clean" bison grammar.  I hacked up the first one to try to make it work (and failed), and the new, up-to-date grammar uses the syntax from my new parser.

Sometime in the future I can try to port it back to Bison.  Ick. Anybody ever tried to write a parser for the for statement in Bison? There are 8 different possible alternatives you have to express.

I like my parser because I could parse a for with a single expression. But yeah, the company's still not allowing me release it.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation