std.d.lexer: pre-voting review / discussion (page 9)

On 2013-09-12 18:21, Dmitry Olshansky wrote: > Contrary I see this break down pointless - do you really want to use > config without the lexer? std.d.lexer.config might be a bit unnecessary. But yes, in general I would like to see smaller modules. -- /Jacob Carlborg

September 12, 2013

Re: std.d.lexer: pre-voting review / discussion

Posted by H. S. Teoh
in reply to deadalnix

Permalink

H. S. Teoh

Posted in reply to deadalnix

Permalink

On Thu, Sep 12, 2013 at 06:09:41PM +0200, deadalnix wrote:
> On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
> >This can be handled by using an intermediate grammar rule. Reduce all (...) into an intermediate type, say ArgList, so the reduction happens something like this:
> >
> >	int  foo   ()      ()      {}
> >	Type Ident ArgList ArgList ^
> >
> >Then have the rule:
> >
> >	CompileTimeArgs ::= ArgList
> >	RuntimeArgs ::= ArgList
> >	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
> >	FuncDecl ::= Type Ident RuntimeArgs '{' ...
> >
> >So first, all (...) gets parsed to ArgList, but it's not yet fixed whether they are compile-time arguments or runtime arguments. It's only after you see the next '(' or '{' that you decide whether ArgList should reduce to CompileTimeArgs or RuntimeArgs.
> >
> >ArgList itself, of course, will accept all possible parameters (both runtime and compile-time): types, expressions, symbols. Then when you reduce it to RuntimeArgs, you reject stuff that can't be interpreted as parameter declarations.
> >
> 
> And then you got to backtrack the parsing instead of the lexing. You just moved the problem around. You'll have to create some temporary ast nodes that then will fix into what they really are.

No. You can just use ArgListItem for both runtime args and compile-time args. Once you decided which one it is, wrong arguments are rejected at semantic time (which you have to do anyway).

Let's take a concrete example. Say we're parsing this invalid code:

	int foo(alias A)(alias B) {}

You'd go through these steps:

1) Parse initial prefix of declaration:

	int foo(alias A)(alias B) {}
	       ^
	AST:
	FuncDecl
	 |--RetType: int
	 |--Ident: foo
	 \--[ being built ]

2) Parse first (...):

	int foo(alias A)(alias B) {}
	                ^
	AST:
	FuncDecl
	 |--RetType: int
	 |--Ident: foo
	 |--ArgList
	 |   \-- AliasArg
	 |        \-- ident: A
	 \--[ being built ]

   I'm skipping the intermediate steps here, it's obvious how to
   construct AliasArg from the usual parsing process.

3) Parse second (...):

	int foo(alias A)(alias B) {}
	                          ^
	AST:
	FuncDecl
	 |--RetType: int
	 |--Ident: foo
	 |--ArgList
	 |   \-- AliasArg
	 |        \-- ident: A
	 |--ArgList
	 |   \-- AliasArg
	 |        \-- ident: B
	 \--[ being built ]

4) At this point, you now know the first ArgList is CompileTimeArgList, and the second is RuntimeArgList, so you can just change the type fields (along with narrowing FuncDecl to TemplateFuncDecl):

	AST:
	TemplateFuncDecl (was: FuncDecl)
	 |--RetType: int
	 |--Ident: foo
	 |--CompileTimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: A
	 |--RuntimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: B
	 \--[ being built ]

Since you're still constructing FuncDecl, your current parsing context should still have a direct reference to the partially-constructed FuncDecl node, which in turn has a direct reference to both ArgList child nodes. So this is just dereferencing a couple of pointers. No backtracking.

5) Finish parsing the declaration:

	int foo(alias A)(alias B) {}
	                            ^
	AST:
	TemplateFuncDecl
	 |--RetType: int
	 |--Ident: foo
	 |--CompileTimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: A
	 |--RuntimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: B
	 \--FuncBody
	     \-- CompoundStatement
	          \-- [empty body]

6) Run semantic:
   - Create local symbol table for foo, etc..
   - Run semantic on CompileTimeArgList:
      - Check AliasArg for validity
      - Run semantic on AliasArg: add A to function's local symbol
        table, etc.
   - Run semantic on RuntimeArgList:
      - Check AliasArg for validity: ERROR: cannot have alias parameter
        at runtime.
      - (Add B to local symbol table)(skipped due to previous error)
   - (Run semantic on FuncBody)(skipped due to previous error)
   - (Run semantic on RetType (verify return type match, etc.))(skipped
     due to previous error)
   - (Add function to parent scope symbol table)(skipped due to previous
     error)

So, no backtracking is necessary.

Of course, it sounds like DMD's parser doesn't work this way, but that's a limitation of DMD's parser, not an *inherent* need for backtracking.


T

-- 
I see that you JS got Bach.

On 09/12/2013 06:40 PM, H. S. Teoh wrote: > > I vote for Tok!"..." to denote token types. > > Question: what's the implementation of Tok? Does it fit into an enum? > What's the underlying representation? I imagine some kind of canonical > mapping into an integral type would be desired, to maximize runtime > performance. This is just a quick hack. One would maybe want to avoid the unwarranted copies and linear searches at compile time: import std.algorithm; private enum symbols = ["i",".","..",",","0",/+...+/]; struct TokenType{ private uint _tag; // or smaller type private this(int tag){_tag=tag;} string toString(){ // (optional) static array(R)(R s){ string[] r; foreach(x;s) r~=x; return r; } static immutable strs=array(symbols.map!((a)=>`Tok!"`~a~`"`)); return strs[_tag]; } } template Tok(string name)if(symbols.countUntil(name)!=-1){ enum Tok=TokenType(cast(uint)symbols.countUntil(name)); } import std.stdio; void main(){ enum x = Tok!"i"; writeln(x); writeln(Tok!"0"); }

On 09/12/2013 08:17 AM, Walter Bright wrote: > I don't believe that, because you can see about anything for tokens in > lookahead and so have to duplicate nearly the whole lexer anyway for the > 'fast path', but you're free to try it out and prove me wrong. Me neither and if you have to duplicate the code for this it's a double loss. I think it would be more fruitful to batch lexing, i.e. fill a hole memory page (4Kb) with tokens the go back to parsing until you need more.

On 09/12/2013 05:39 PM, Timon Gehr wrote: > > Lookahead is realized as follows in the parser: > > (assume 'code' is the circular buffer range.) > > auto saveState(){muteerr++; return code.pushAnchor();} // saves the > state and mutes all error messages until the state is restored > > void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } > > The 'Anchor' is a trivial wrapper around a size_t. The circular buffer > grows automatically to keep around tokens still reachable by an anchor. > (The range only needs small constant space besides the buffer to support > this functionality, though it is unable to detect usage errors.) Do you think it's possible to use CircularBufferRange itself as anchor. Then calling save() would return a CircularBufferRange and you canould scratch the two functions above. I had some promising experiments in that direction, but the implicit save on copy is annoying because you end up with anchors from temporary copies.

On 09/12/2013 04:21 AM, Martin Nowak wrote: > On 09/12/2013 03:39 AM, Walter Bright wrote: >> On 9/11/2013 6:30 PM, deadalnix wrote: >>> Indeed. What solution do you have in mind ? >> >> The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list. > > Linked list sounds bad. > Do you have a rough idea how often lookahead is needed, i.e. is it > performance relevant? If so it might be worth tuning. Maybe some fixed size stack vector with 64 elements or so and some linked list for the unusual case would help ..

13-Sep-2013 00:46, Robert Schadek пишет: > On 09/12/2013 04:21 AM, Martin Nowak wrote: >> On 09/12/2013 03:39 AM, Walter Bright wrote: >>> On 9/11/2013 6:30 PM, deadalnix wrote: >>>> Indeed. What solution do you have in mind ? >>> >>> The solution dmd uses is to put in an intermediary layer that saves the >>> lookahead tokens in a linked list. >> >> Linked list sounds bad. >> Do you have a rough idea how often lookahead is needed, i.e. is it >> performance relevant? If so it might be worth tuning. > Maybe some fixed size stack vector with 64 elements or so and some > linked list for the unusual case would help .. > And an extra branch to detect which one is currently the case? No, thanks ;) -- Dmitry Olshansky

On 09/12/2013 10:03 PM, Martin Nowak wrote: > On 09/12/2013 05:39 PM, Timon Gehr wrote: >> >> Lookahead is realized as follows in the parser: >> >> (assume 'code' is the circular buffer range.) >> >> auto saveState(){muteerr++; return code.pushAnchor();} // saves the >> state and mutes all error messages until the state is restored >> >> void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } >> >> The 'Anchor' is a trivial wrapper around a size_t. The circular buffer >> grows automatically to keep around tokens still reachable by an anchor. >> (The range only needs small constant space besides the buffer to support >> this functionality, though it is unable to detect usage errors.) > > Do you think it's possible to use CircularBufferRange itself as anchor. Well, this implementation is geared towards the usage pattern found in the top-down parser. If the first anchor is not restored last, it will not generally work. This does not conform to the range interface. Implementing .save requires a more advanced and less efficient implementation. > Then calling save() would return a CircularBufferRange and you could > scratch the two functions above. I had some promising experiments in > that direction, What did you do? The way I think I'd approach it would be to maintain a binary min-heap using a few pointers in each range instance. > but the implicit save on copy is annoying because you > end up with anchors from temporary copies. Is this referring to suboptimal efficiency? I guess that is rather hard to address in a safe way. Basically, you'd want to forget about "anchor management" in case it can be proven that there is another range on the same buffer that stays at most as progressed during the whole lifetime of the range under consideration. Of course, it is always possible to do it in unsafely.

On 09/12/2013 11:03 PM, Dmitry Olshansky wrote: >> Maybe some fixed size stack vector with 64 elements or so and some linked list for the unusual case would help .. >> > And an extra branch to detect which one is currently the case? No, thanks ;) > I would think that branching and using the vector is faster than using the constructing and linking nodes in the LL

On Thursday, September 12, 2013 23:55:23 Robert Schadek wrote: > On 09/12/2013 11:03 PM, Dmitry Olshansky wrote: > >> Maybe some fixed size stack vector with 64 elements or so and some linked list for the unusual case would help .. > > > > And an extra branch to detect which one is currently the case? No, thanks ;) > > I would think that branching and using the vector is faster than using the constructing and linking nodes in the LL Well, it sounds like there are several ideas to try out in this thread. I guess that only benchmarking and profiling will really tell us which (if any) are better though. - Jonathan M Davis

Forums