std.d.lexer: pre-voting review / discussion (page 6)

On Thu, Sep 12, 2013 at 04:21:36AM +0200, Martin Nowak wrote: > On 09/12/2013 03:39 AM, Walter Bright wrote: > >On 9/11/2013 6:30 PM, deadalnix wrote: > >>Indeed. What solution do you have in mind ? > > > >The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list. > > Linked list sounds bad. > Do you have a rough idea how often lookahead is needed, i.e. is it > performance relevant? If so it might be worth tuning. I still don't understand why backtracking is necessary in the first place. I would've thought a modern parser should be well able to encode seen tokens in its state so that backtracking is never necessary. Or does D grammar have tricky bits that cannot be handled this way, that I'm unaware of? T -- The peace of mind---from knowing that viruses which exploit Microsoft system vulnerabilities cannot touch Linux---is priceless. -- Frustrated system administrator.

Am 11.09.2013 17:01, schrieb Dicebot: > std.d.lexer is standard module for lexing D code, written by > Brian Schott a standard benchmark scenario is needed - to compare dmd and std.d.lexer with dmc_dmd, ldc, gdc and msc_dmd just to be clear how near it reaches current dmd lexing speed

On 9/11/2013 6:54 PM, deadalnix wrote: > On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote: >> On 9/11/2013 6:30 PM, deadalnix wrote: >>> Indeed. What solution do you have in mind ? >> >> The solution dmd uses is to put in an intermediary layer that saves the >> lookahead tokens in a linked list. > > But then, you have an extra step when looking up every tokens + memory > management overhead. How big is the performance improvement ? Not really - I use a dirty trick in that a static instance is always the start of the list. But even so, an extra indirection is better than re-lexing. Lexing is a clear bottleneck in the profiling. I've even been thinking of special casing the code that scans comments to use SIMD instructions.

On 9/11/2013 7:21 PM, Martin Nowak wrote: > On 09/12/2013 03:39 AM, Walter Bright wrote: >> On 9/11/2013 6:30 PM, deadalnix wrote: >>> Indeed. What solution do you have in mind ? >> >> The solution dmd uses is to put in an intermediary layer that saves the >> lookahead tokens in a linked list. > > Linked list sounds bad. > Do you have a rough idea how often lookahead is needed, i.e. is it performance > relevant? If so it might be worth tuning. No, I don't have a rough idea, but it doesn't show up in the profile.

September 12, 2013

Re: std.d.lexer: pre-voting review / discussion

Posted by deadalnix
in reply to H. S. Teoh

Permalink

deadalnix

Posted in reply to H. S. Teoh

Permalink

On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:
> I still don't understand why backtracking is necessary in the first
> place. I would've thought a modern parser should be well able to encode
> seen tokens in its state so that backtracking is never necessary. Or
> does D grammar have tricky bits that cannot be handled this way, that
> I'm unaware of?
>

The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity.

In some cases, the complexity is manageable, but in other that don't make any sense (it has to be noted that even full lexing don't make any sens here). For instance :

int foo()() {}
       ^

When you are at the caret position, you don't know if you face a function declaration or a template declaration. You could go for some ambiguous parsing, but each template argument can itself be a type, an expression or a symbol.

In such situation, it is much simpler to skip input to the closing parentheses, check what's coming after and act accordingly. The alternative is to go for some ambiguous function/template parameters parsing and resolve at the end, but as template argument are themselves ambiguous type/expression/symbols, the amount of complexity in the parser is doomed to explode. Note that the example is not self contained, for instance, both template parameters and argument can imply template instantiation, which means ambiguous argument parsing.

SDC does a good deal of ambiguous parsing without backtracking (more than DMD does), but you got to draw the line somewhere.

What I'd like to see is a go to the closing token feature, that can eventually take a fast path to do so, or an efficient token buffering system (I'm not sure which feature would be the fastest, but I'm inclined to think this is the first one, however, that won't be suited for a dmd style parser).

On Thursday, 12 September 2013 at 04:57:50 UTC, Walter Bright wrote: > On 9/11/2013 6:54 PM, deadalnix wrote: >> On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote: >>> On 9/11/2013 6:30 PM, deadalnix wrote: >>>> Indeed. What solution do you have in mind ? >>> >>> The solution dmd uses is to put in an intermediary layer that saves the >>> lookahead tokens in a linked list. >> >> But then, you have an extra step when looking up every tokens + memory >> management overhead. How big is the performance improvement ? > > Not really - I use a dirty trick in that a static instance is always the start of the list. > If I understand you correctly, that mean that lookahead of one token do not trigger any allocation. > But even so, an extra indirection is better than re-lexing. Lexing is a clear bottleneck in the profiling. I've even been thinking of special casing the code that scans comments to use SIMD instructions. See my comment, it is possible, with increased parser complexity, to handle many cases where you don't know what you are parsing yet. Doing so, lookahead is only required to find matching closing token. I suspect that a fast path in the lexer for that precise use case may be faster than buffering tokens, as it allow to save one branch per token.

On 9/11/2013 10:10 PM, deadalnix wrote: > See my comment, it is possible, with increased parser complexity, to handle many > cases where you don't know what you are parsing yet. Doing so, lookahead is only > required to find matching closing token. I suspect that a fast path in the lexer > for that precise use case may be faster than buffering tokens, as it allow to > save one branch per token. I don't believe that, because you can see about anything for tokens in lookahead and so have to duplicate nearly the whole lexer anyway for the 'fast path', but you're free to try it out and prove me wrong.

On Thursday, 12 September 2013 at 03:31:42 UTC, H. S. Teoh wrote: > On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis wrote: >> On Thursday, September 12, 2013 03:37:06 deadalnix wrote: >> > Int, Function, Scope, Import are all valid identifiers. >> >> All of which violate Phobos' naming conventions for enum values (they >> must start with a lowercase letter), which is why we went with adding >> an _ on the end. And it's pretty much as close as you can get to the >> keyword without actually using the keyword, which is a plus IMHO >> (though from the sounds of it, H.S. Teoh would consider that a >> negative due to possible confusion with the keyword). > [...] > > Actually, the main issue I have is that some of the enum values end with > _ while others don't. This is inconsistent. I'd rather have consistency > than superficial resemblance to the keywords as typed. Either *all* of > the enum values should end with _, or *none* of them should. Having a > mixture of both is an eyesore, and leads to people wondering, should I > add a _ at the end or not? > > If people insist that the 'default' keyword absolutely must be > represented as TokenType.default_ (I really don't see why), then *all* > TokenType values should end with _. But honestly, I find that really > ugly. Writing something like kwDefault, or tokenTypeDefault, would be > far better. > > Sigh, Andrei was right. Once the bikeshed is up for painting, even the > rainbow won't suffice. :-P > > > T Delphi would use TokenType ttDefault MyType mtDefault

On 2013-09-12 00:02, H. S. Teoh wrote: > Constructors with many arguments are evil, because you can't just modify > one setting and use defaults for the rest; you have to specify all of > them up to the last default argument you have to change. Currently this is possible: LexerConfig config = { iterStyle: IterationStyle.everything, tokenStyle: TokenStyle.source }; I would really like the following to be possible as well: void foo (LexerConfig config); foo({ iterStyle: IterationStyle.everything, tokenStyle: TokenStyle.source }); -- /Jacob Carlborg

On 2013-09-11 22:37, Brian Schott wrote: > It's possible, but I fear that it would make the code a mess. Really? It seems you only need to change two places in the code: https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L1724 And https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L400 Although I just did a quick search for "line". -- /Jacob Carlborg

Forums