September 12, 2013
On Thu, Sep 12, 2013 at 07:00:09AM +0200, deadalnix wrote:
> On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:
> >I still don't understand why backtracking is necessary in the first place. I would've thought a modern parser should be well able to encode seen tokens in its state so that backtracking is never necessary.  Or does D grammar have tricky bits that cannot be handled this way, that I'm unaware of?
> >
> 
> The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity.
> 
> In some cases, the complexity is manageable, but in other that don't make any sense (it has to be noted that even full lexing don't make any sens here). For instance :
> 
> int foo()() {}
>        ^
> 
> When you are at the caret position, you don't know if you face a function declaration or a template declaration. You could go for some ambiguous parsing, but each template argument can itself be a type, an expression or a symbol.
[...]

This can be handled by using an intermediate grammar rule. Reduce all (...) into an intermediate type, say ArgList, so the reduction happens something like this:

	int  foo   ()      ()      {}
	Type Ident ArgList ArgList ^

Then have the rule:

	CompileTimeArgs ::= ArgList
	RuntimeArgs ::= ArgList
	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
	FuncDecl ::= Type Ident RuntimeArgs '{' ...

So first, all (...) gets parsed to ArgList, but it's not yet fixed whether they are compile-time arguments or runtime arguments. It's only after you see the next '(' or '{' that you decide whether ArgList should reduce to CompileTimeArgs or RuntimeArgs.

ArgList itself, of course, will accept all possible parameters (both runtime and compile-time): types, expressions, symbols. Then when you reduce it to RuntimeArgs, you reject stuff that can't be interpreted as parameter declarations.


T

-- 
There are two ways to write error-free programs; only the third one works.
September 12, 2013
Robert Schadek wrote:

> especially with IdentifierList:

I see the shift/reduce conflict if you was indeed trying to syntactically differantiate between template identifiers and other identifiers.

-manfred

September 12, 2013
deadalnix wrote:
> When you are at the caret position, you don't know

If one ever reaches that position one uses petty lexer/grammar definitions.

-manfred
September 12, 2013
On 09/11/2013 08:49 PM, Walter Bright wrote:
>

>
> 3. I assumed [an instance of] TokenType is a type.

(is(TokenType) is true).

> But it's not, it's an enum.
> Even the document says it's a 'type', but it's not a type.

It's a type tag. The tag uniquely determines the type. (As in 'the type of a token', as opposed to 'the type of an expression'.)

> 4. When naming tokens like .. 'slice', it is giving it a
> syntactic/semantic name rather than a token name. This would be awkward
> if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
> Ditto for the rest. For example that is done better, '*' is called
> 'star', rather than 'dereference'.

FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.

>
> 5. The LexerConfig initialization should be a constructor rather than a sequence of assignments.

Using the { a:2, b:3 }-style initialization syntax?

> 6. No clue how lookahead works with this.

Eg. use a CircularBuffer adapter range. I have an implementation currently coupled with my own lexer implementation. If there is interest, I could factor it out.

Lookahead is realized as follows in the parser:

(assume 'code' is the circular buffer range.)

auto saveState(){muteerr++; return code.pushAnchor();} // saves the state and mutes all error messages until the state is restored

void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

The 'Anchor' is a trivial wrapper around a size_t. The circular buffer grows automatically to keep around tokens still reachable by an anchor. (The range only needs small constant space besides the buffer to support this functionality, though it is unable to detect usage errors.)


This approach is typically more efficient than using a free list on contemporary architectures.

September 12, 2013
On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
> This can be handled by using an intermediate grammar rule. Reduce all
> (...) into an intermediate type, say ArgList, so the reduction
> happens something like this:
>
> 	int  foo   ()      ()      {}
> 	Type Ident ArgList ArgList ^
>
> Then have the rule:
>
> 	CompileTimeArgs ::= ArgList
> 	RuntimeArgs ::= ArgList
> 	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
> 	FuncDecl ::= Type Ident RuntimeArgs '{' ...
>
> So first, all (...) gets parsed to ArgList, but it's not yet fixed
> whether they are compile-time arguments or runtime arguments. It's only
> after you see the next '(' or '{' that you decide whether ArgList should
> reduce to CompileTimeArgs or RuntimeArgs.
>
> ArgList itself, of course, will accept all possible parameters (both
> runtime and compile-time): types, expressions, symbols. Then when you
> reduce it to RuntimeArgs, you reject stuff that can't be interpreted as
> parameter declarations.
>

And then you got to backtrack the parsing instead of the lexing. You just moved the problem around. You'll have to create some temporary ast nodes that then will fix into what they really are.
September 12, 2013
12-Sep-2013 19:39, Timon Gehr пишет:
> On 09/11/2013 08:49 PM, Walter Bright wrote:
>> 4. When naming tokens like .. 'slice', it is giving it a
>> syntactic/semantic name rather than a token name. This would be awkward
>> if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
>> Ditto for the rest. For example that is done better, '*' is called
>> 'star', rather than 'dereference'.
>
> FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when
> interfacing with the parser. Some other kinds of tokens get a canonical
> representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is
> the kind of signed integer literal tokens etc.

I like this.
Not only this has the benefit of not colliding with keywords. I also imagine that it could be incredibly convenient to get back the symbolic representation of a token (when token used as parameter to AST-node say BinaryExpr!(Tok!"+")). And truth be told we all know how tokens look in symbolic form so learning a pack of names for them feels pointless.

>> 6. No clue how lookahead works with this.
>
> Eg. use a CircularBuffer adapter range. I have an implementation
> currently coupled with my own lexer implementation. If there is
> interest, I could factor it out.
>
> Lookahead is realized as follows in the parser:
>
> (assume 'code' is the circular buffer range.)
>
> auto saveState(){muteerr++; return code.pushAnchor();} // saves the
> state and mutes all error messages until the state is restored
>
> void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }
>
> The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
> grows automatically to keep around tokens still reachable by an anchor.
> (The range only needs small constant space besides the buffer to support
> this functionality, though it is unable to detect usage errors.)
>
>
> This approach is typically more efficient than using a free list on
> contemporary architectures.
>

This ^^ is how. In fact std.d.lexer internally does similar thing with non-RA ranges of bytes.


-- 
Dmitry Olshansky
September 12, 2013
12-Sep-2013 12:05, Jacob Carlborg пишет:
> On 2013-09-11 17:01, Dicebot wrote:
>> std.d.lexer is standard module for lexing D code, written by Brian Schott
>
> Finally :)
>
> * How does it handler errors, just returns TokenType.invalid?
>
> * Personally I think the module is too big. I would go with:
>
> - std.d.lexer.token
> - std.d.lexer.tokentype

These could be one module. There is really no meaningful way to use token type separately from token.

> - std.d.lexer.lexer - contains the rest
> - std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig

Contrary I see this break down pointless - do you really want to use config without the lexer?


> - CircularRange, StringCache, possibly put somewhere else. I assume this
> can be used for other things than lexing?
> - Trie related code, same as above

No good public interface defined is the reason. Basically the same as with Trie in the new std.uni module - needs its own review.

>
> * I see that errorMessage throws an exception. Do we really want that? I
> would except it just returns an invalid token.
>
> If we do decide it should throw, it should absolutely _not_ throw a
> plain Exception. Create a new type, LexException or similar. I hate when
> code throws plain Exceptions, it makes it useless to catch them.
>
> I would also expect this LexException to contain a Token. It shouldn't
> be needed to parse the exception message to get line and column
> information.

Better yet to have a std exception hierarchy... so that all parsing modules can be tied to ParseException. So this needs to be resolved in a forward-compatible way.


-- 
Dmitry Olshansky
September 12, 2013
On 09/11/2013 05:01 PM, Dicebot wrote:
> std.d.lexer is standard module for lexing D code, written by Brian Schott
>
> ---- Input ----
>
> Code: https://github.com/Hackerpilot/phobos/tree/master/std/d
>
> Documentation:
> http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
>
> ...

(Commenting on what's visible in the documentation only for now.)

auto config = ...
... .byToken(config) ...

Seems to be a natural candidate for manual partial specialization.

enum config = ...
... .byToken!config() ...

uint line; ushort column; // is there overflow checking?

"Check to see if the token is of the same type and has the same string representation as the given token."

Tokens with the same string representation always are of the same type, so this seems redundant.

Furthermore, I'd expect (!a.opCmp(b)) === (a == b).

Why provide the operator overloads at all? They don't implement essential or natural functionality.


"includeSpecialTokens". It's not clear what this flag does.


"If the input range supports slicing, the caching layer aliases itself away and the lexing process is much more efficient."

It might be more sensible to require the user to manually wrap his range.

"pure nothrow bool isOperator(const TokenType t);
pure nothrow bool isOperator(ref const Token t);
pure nothrow bool isKeyword(const TokenType t);
pure nothrow bool isKeyword(ref const Token t);
..."

IMO we should get rid of these.



TokenType naming seems inconsistent. eg: & is amp, = is assign, == is equal, but &= is bitAndEqual and && is logicAnd

IMO better: & is and, = is assign, &= is andAssign and && is andAnd.

Of course, it might be best to use a template instead. Tok!"&", Tok!"&=" and Tok!"&&".

September 12, 2013
On Thu, Sep 12, 2013 at 08:10:18PM +0400, Dmitry Olshansky wrote:
> 12-Sep-2013 19:39, Timon Gehr пишет:
> >On 09/11/2013 08:49 PM, Walter Bright wrote:
> >>4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer.  Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.
> >
> >FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.
> 
> I like this.
> Not only this has the benefit of not colliding with keywords. I also
> imagine that it could be incredibly convenient to get back the
> symbolic representation of a token (when token used as parameter to
> AST-node say BinaryExpr!(Tok!"+")). And truth be told we all know
> how tokens look in symbolic form so learning a pack of names for
> them feels pointless.

+1.  This is superior to both the ad hoc _ suffix and my ad hoc prefixing approach.  Tok!"default" is maximally readable, and requires no silly convolutions like _ or 'kw' / 'tokenType' prefixes.

I vote for Tok!"..." to denote token types.

Question: what's the implementation of Tok? Does it fit into an enum? What's the underlying representation? I imagine some kind of canonical mapping into an integral type would be desired, to maximize runtime performance.


T

-- 
There are three kinds of people in the world: those who can count, and those who can't.
September 12, 2013
On 2013-09-12 17:39, Timon Gehr wrote:

> Using the { a:2, b:3 }-style initialization syntax?

Unfortunately that's currently not possible to pass to functions, see my other post:

http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe@forum.dlang.org?page=6#post-l0ro6h:249mk:241:40digitalmars.com

-- 
/Jacob Carlborg