Jump to page: 1 2
Thread overview
Writing a Parser
Jan 05, 2008
Dan
Jan 05, 2008
Alan Knowles
Jan 05, 2008
Jascha Wetzel
Jan 05, 2008
0ffh
Re: Writing a Parser - aPaGeD comments
Jan 09, 2008
Alan Knowles
Jan 09, 2008
Jascha Wetzel
Re: Writing a Parser - Walnut and aPaGeD comments
Jan 10, 2008
Dan
Jan 10, 2008
Jascha Wetzel
Jan 12, 2008
Dan
Jan 12, 2008
Jascha Wetzel
Jan 06, 2008
Dan
Jan 10, 2008
Ap Lee
Jan 10, 2008
Ap Lee
Jan 10, 2008
Frank Benoit
Jan 10, 2008
Alan Knowles
January 05, 2008
I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.

At the moment, I'm trying recursive descent parsing.

The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token.

For example, to consume a for loop, you consume something similar to
/for\s*\((.*?)\)\s*\{(.*?)\}/

I have it doing that, but my soul feels heavy with the masses of looped switches it's doing.  Is there any way to ease the pain?

Regards,
Dan
January 05, 2008
"Dan" <murpsoft@hotmail.com> wrote in message news:flmtrv$2jrn$1@digitalmars.com...
>
> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.
>
> At the moment, I'm trying recursive descent parsing.
>
> The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token.
>
> For example, to consume a for loop, you consume something similar to
> /for\s*\((.*?)\)\s*\{(.*?)\}/
>
> I have it doing that, but my soul feels heavy with the masses of looped switches it's doing.  Is there any way to ease the pain?

Separate tokenization and syntax parsing?  It makes things a hell of a lot easier.  You don't even necessarily have to tokenize the source entirely before parsing; just have a lexer which lexes tokens out of the source on demand.  The syntax parsing is then unencumbered from dealing with the raw source and just has to do stuff like "expect 'for', expect left-paren, expect (your condition), expect right-paren" etc.


January 05, 2008
Yes, normally when doing OO patterns for this

class Parser extends Tokenizer {...}

either that or
class Parser {
	Tokenizer reader;
	parser code....
}

Recommended reading
- the parser in DMD's source (not DMDScript) is quite nice - it's a hand coded one and is quite easy to understand.

This is interesting from the perspective of using function pointers to do pattern testing (it's horribly borked from the perspective that it's really the tokenizer..)
http://www.dsource.org/projects/leds/browser/trunk/src/language/php/Parser.d

If you dont want a hand coded parser, have a look at csLex - It's what mono used, and is pretty trivial to modify, and retarget to generate other language code (eg. D code...)

From someone who's written far too many parsers ;)
Regards
Alan




Jarrett Billingsley wrote:
> "Dan" <murpsoft@hotmail.com> wrote in message news:flmtrv$2jrn$1@digitalmars.com...
>> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.
>>
>> At the moment, I'm trying recursive descent parsing.
>>
>> The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token.
>>
>> For example, to consume a for loop, you consume something similar to
>> /for\s*\((.*?)\)\s*\{(.*?)\}/
>>
>> I have it doing that, but my soul feels heavy with the masses of looped switches it's doing.  Is there any way to ease the pain?
> 
> Separate tokenization and syntax parsing?  It makes things a hell of a lot easier.  You don't even necessarily have to tokenize the source entirely before parsing; just have a lexer which lexes tokens out of the source on demand.  The syntax parsing is then unencumbered from dealing with the raw source and just has to do stuff like "expect 'for', expect left-paren, expect (your condition), expect right-paren" etc. 
> 
> 
January 05, 2008
Dan wrote:
> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.
> 
> At the moment, I'm trying recursive descent parsing.
> 
> The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token.
> 
> For example, to consume a for loop, you consume something similar to
> /for\s*\((.*?)\)\s*\{(.*?)\}/
> 
> I have it doing that, but my soul feels heavy with the masses of looped switches it's doing.  Is there any way to ease the pain?

a parser generator :)
writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated.
even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
January 05, 2008
Jascha Wetzel wrote:
> Dan wrote:
>> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.
>>  [...]
>> I have it doing that, but my soul feels heavy with the masses of looped switches it's doing.  Is there any way to ease the pain?
> 
> a parser generator :)
> [...]

Or a parsing combinator library. :)

And it's always nice to roll your own, if you have the extra time.
Not only to have, but to know.

regards, frank
January 06, 2008
Dan Wrote:

> 
> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.
> 
> At the moment, I'm trying recursive descent parsing.

I've got it performing rather nicely now, for it's completeness.

parseOperand()
  returns an { ident, int, double, string, regexp, expression, arrayliteral, objectliteral *value* }
  incomplete for e notation numbers
  incomplete for expression, arrayliteral, objectliteral

parseBinaryOperator()
  returns one of the binary operator tokens

parseOptionalWS()
  consumes whitespace and doesn't return anything

I've got some more written, but they're not tested.  Once I have numbers done, I'll try writing the arrayLiteral, which will of course, parseOperand() "," parseOperand()

: )
January 09, 2008
I did spend some time looking at aPaGeD for this yesteday - here's some feedback that may help (and if you can answer some of the questions that would help me alot as well)

- It' works (yes, but you would be amazed, for the life of me I could not get antlr to do anything - java write once, run nowhere...) - so being able to generate code and working programs from the examples was great, and a real +100 for the project..

- Syntax - on the face of it looks reasonable, and easy to understand. - similar enough to antlr..
(That's the end of the really good stuff)


- Documentation
While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables:

Specifically I've no idea what the meanings of these are, and they are rather critical to the docs....:
"Terminal" "Non-Terminal"

- Regex's
While I can see the benefit's I'd much rather the compiler built them for me.. - part of the beauty of the BNF format is that it's easy to read, and explains regex style situations alot better.. - Otherwise (see below about explaining how they can deal with classic situations...)

- How to handle classic situations
This is the key to success for the Documentation. (and what is seriously missing) - as most people will probably have come from a lexx/yacc background...

These are a few classic examples that the Documentation could do with.

* Top level parser starts.
Most grammers start with a top level statement, eg.
Program:
	Statements;

In which case the application should only start by solving Statements, - the biggest problem I found was that I had no idea how to stop it matching any of the condition rules (that were only relivant to a specific state - eg. see next example)


* Parse a string
This is a common pattern but it's quite difficult to see how to implement it. -- And as above, when I tried, the parser started matching DoubleQuotedStringChars at the start of the file (even though it's only used in  DoubleQuotedString.

DoubleQuotedString;
	QUOTE DoubleQuotedStringChars QUOTE

DoubleQuotedStringChars:
	(DoubleQuotedStringChar)*

DoubleQuotedStringChar:
	"\" ANYCHAR:
	^QUOTE;
	
* Classic groupings:
	(.....)*    eg. many of these matches..
	(.....)+    eg. one or more of these matches..
	(.....)?    eg. one or none of these matches..
	(.....)=> ...    if forward lookup succeeds on (...) try match next combo.


Regards
Alan









Jascha Wetzel wrote:
> Dan wrote:
>> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.
>>
>> At the moment, I'm trying recursive descent parsing.
>>
>> The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token.
>>
>> For example, to consume a for loop, you consume something similar to
>> /for\s*\((.*?)\)\s*\{(.*?)\}/
>>
>> I have it doing that, but my soul feels heavy with the masses of looped switches it's doing.  Is there any way to ease the pain?
> 
> a parser generator :)
> writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated.
> even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
January 09, 2008
Alan Knowles wrote:
> - It' works (yes, but you would be amazed, for the life of me I could not get antlr to do anything - java write once, run nowhere...) - so being able to generate code and working programs from the examples was great, and a real +100 for the project..

I'm glad to hear that, although, i must admit, i was so bold to assume that already ;)

> - Documentation
> While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables:

Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately. Therefore i felt it should be sufficient to describe everything that's specific to apaged.
Once i find the time, i'll extend the documentation with a thorough hands-on guide.

> Specifically I've no idea what the meanings of these are, and they are rather critical to the docs....:
> "Terminal" "Non-Terminal"

A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same. In apaged, a Terminal is always a string or a regexp.
A Non-Terminal is specified by a set of production rules that tell how to match a sequence of Terminals and Non-Terminals. In apaged, a Non-Terminal is always something looking like this:

MyNT
{
  "a" MyOtherNT MyNT;
  "b" Nt2
  { /* D Code ... */ }
}

If you think of parse trees when dealing with your grammar (if not, ignore the rest), Non-Terminals are the inner nodes, and Terminals are the leaves.

> - Regex's
> While I can see the benefit's I'd much rather the compiler built them for me.. - part of the beauty of the BNF format is that it's easy to read, and explains regex style situations alot better.. - Otherwise (see below about explaining how they can deal with classic situations...)

Do you mean regexp-like syntax for rules (i.e. EBNF syntax)?
This is a feature on my todo list, but i still have to find a nice way to merge this with the semantic code (which is non-trivial) - a thing for the future.

If you mean Regex's as in "asdf[a-z0-9]*", then i don't understand what you mean, since apaged does compile them for you. You don't need to use them at all, though. They're merely a convenience feature.

> - How to handle classic situations
> This is the key to success for the Documentation. (and what is seriously missing) - as most people will probably have come from a lexx/yacc background...
> 
> These are a few classic examples that the Documentation could do with.
> 
> * Top level parser starts.
> Most grammers start with a top level statement, eg.
> Program:
>     Statements;
> 
> In which case the application should only start by solving Statements, - the biggest problem I found was that I had no idea how to stop it matching any of the condition rules (that were only relivant to a specific state - eg. see next example)

The docs state that
"The first non-terminal in a grammar file will be treated as the start symbol for the grammar."

I admit that it should be "...in the main grammar file..." since that doesn't apply for imported grammar files.

> * Parse a string
> This is a common pattern but it's quite difficult to see how to implement it. -- And as above, when I tried, the parser started matching DoubleQuotedStringChars at the start of the file (even though it's only used in  DoubleQuotedString.
> 
> DoubleQuotedString;
>     QUOTE DoubleQuotedStringChars QUOTE
> 
> DoubleQuotedStringChars:
>     (DoubleQuotedStringChar)*
> 
> DoubleQuotedStringChar:
>     "\" ANYCHAR:
>     ^QUOTE;

Hm, it shouldn't do that. I'll assume that's because of how QUOTE is defined. I'd need to see the whole grammar.
Besides that, you use 2 unsupported EBNF features in the above grammar:
^QUOTE and
(DoubleQuotedStringChar)*
Apaged doesn't support full EBNF syntax for the reason mentioned above.

> * Classic groupings:
>     (.....)*    eg. many of these matches..
>     (.....)+    eg. one or more of these matches..
>     (.....)?    eg. one or none of these matches..
>     (.....)=> ...    if forward lookup succeeds on (...) try match next combo.

None of those are supported. You need to write them BNF style:
A* becomes

As {
  As A;
  epsilon;
}

A+ becomes

Ap {
  Ap A;
  A;
}

A? becomes

Aq {
  A;
  epsilon;
}

Lookahead isn't supported for rules either. You may lookahead a single Terminal symbol, though, using >"a" or >["a" "b"], as documented.
Rule-level lookahead can also be rewritten BNF style, but it may affect more related rules and therefore depends on your instance of the problem.


I hope that helps a bit. If you read up on BNF vs. EBNF and consider that apaged is BNF based, you should find solutions to all of these problems.
January 10, 2008
: D

So far I've got it doing an LL(0) predictive tokenizer which generates a [still pretty buggy] AST.

I'm quite proud of it at the moment, as I'm certain now that I can accomplish LL(0) scanning for everything but binaryOperators; where it's LL(n) | n E expression.


Jascha Wetzel Wrote:
> Alan Knowles wrote:
> > - Documentation
> > While I know it's a pain to write, the things you have already tend to
> > focus on how the parser is built, and are biased to someone
> > understanding the internals and phrase-ology involved in parsers, rather
> > than an end user - who just knows if I'm looking for this.. - then put
> > this, and the result is available in these variables:
> 
> Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately.

I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet.  To be honest, I'm not interested in it.  Like MathML, it's way too far from the machine to generate an *efficient* parser.  Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.

> A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same.

Linguistics assigns them very different meanings.

> If you think of parse trees when dealing with your grammar

Are those like sentence trees, with noun phrase, verb phrase, etc?

> ignore the rest), Non-Terminals are the inner nodes, and Terminals are the leaves.

That makes sense.

> > - How to handle classic situations

When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly:

{  { /* } */ }  , { } }

Another classical problem is JavaScript RegExp literals or divide:

/bob/i  can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand.

How would you write that?
How would the machine read that?

Both are highly important, but I find most parser writers only care about the former, and that the product is a working AST.

If I only wanted that, I could write the interpreter entirely in javascript regular expressions and it'll only be 40 lines of code.  In fact, I think someone did that already, but I'm sure he wasn't that terse.

Regards,
Dan

January 10, 2008
> Dan Wrote:
>
>>
>> I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry.
>>
>> At the moment, I'm trying recursive descent parsing.
>

Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much.

Good luck.
« First   ‹ Prev
1 2