September 11, 2013
On Wednesday, September 11, 2013 11:49:44 Walter Bright wrote:
> 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that.

Appending _ is what we do in Phobos when we need to use a keyword, and it's even called out in the style guide: http://dlang.org/dstyle.html

Now, if it makes sense to use something other than a keyword, that's probably better, but in cases where it really makes the most sense to use a keyword but we can't (e.g. std.traits.FunctionAttribute), we append _ to the keyword.

What makes the most sense ni this case, I don't know, because I haven't looked at either the documentation or the code yet, but if Brian is appending _ to keywords, he's following the official style guide.

- Jonathan M Davis
September 11, 2013
On 9/11/2013 12:17 PM, Manfred Nowak wrote:
> Walter Bright wrote:
>> Parsing D requires arbitrary lookahead.
>
> Why---and since which version?

Since the very beginning.

One example is determining if something is a declaration or an expression.

September 11, 2013
On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
> Walter Bright wrote:
> > Parsing D requires arbitrary lookahead.
> 
> Why---and since which version?

IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013.

- Jonathan M Davis
September 11, 2013
On Wednesday, 11 September 2013 at 19:29:48 UTC, Jonathan M Davis wrote:
> On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
>> Walter Bright wrote:
>> > Parsing D requires arbitrary lookahead.
>> 
>> Why---and since which version?
>
> IIRC, it's at least partially cause by the .. token. You have to look ahead to
> figure out whether it's .. or a floating point literal. There may be other cases
> as well, but that particular one was actually complained about by Robert
> Schadek in his talk at dconf 2013.
>
> - Jonathan M Davis

Yeah. D requires lookahead in both lexing and parsing. Some examples:

* String literals such as q{}
* If statements need to determine the difference between if (expression) and if (type identifier = expression)
* Determining if something is a lamba expression or a function literal expression
* Determining if something is a declaration or a statement or an expression.
* Differentiating between (type).identifier and a lambda expression
* Differentiating between a type and an expression inside a typeid expression
* Differentiating between associative array key type and tuple slices in type suffixes
September 11, 2013
On Wednesday, September 11, 2013 15:29:33 Jonathan M Davis wrote:
> On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
> > Walter Bright wrote:
> > > Parsing D requires arbitrary lookahead.
> > 
> > Why---and since which version?
> 
> IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013.

LOL. You were asking about _parsing_ requiring arbitrary lookahead, and I answered as to why lexing requires more than one character of lookahead. I should read more carefully...

- Jonathan M Davis
September 11, 2013
On Wed, Sep 11, 2013 at 11:49:44AM -0700, Walter Bright wrote:
> On 9/11/2013 8:01 AM, Dicebot wrote:
> >std.d.lexer is standard module for lexing D code, written by Brian Schott
> 
> Thank you, Brian! This is important work.
> 
> Not a thorough review, just some notes from reading the doc file:
> 
> 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that.

I agree.


> 2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like:

	enum TokenCategory { BuiltInType, Keyword, ... }

	// Convert TokenType into TokenCategory
	TokenCategory category(TokenType tt) { ... }

Then in user code, you call category() on the token type, and switch over that. This maximizes performance.

Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example:

	enum TokenCategory {
		BuiltInType, Keyword, ...
	}

	enum TokenType {
		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
		...
		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
	}

Then the category function can be simply:

	TokenCategory category(TokenType tt) {
		return cast(TokenCategory)(tt >> 16);
	}

Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.


> 3. I assumed TokenType is a type. But it's not, it's an enum. Even the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).


> 4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)


> 5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is.

I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me.

Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS):

	DLexer lex;
	lex.iterStyle = ...;
	// etc.

	foreach (token; lex.byToken()) {
		...
	}

This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?


> 6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.

What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it?  I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things.


[...]
> 8. 'default_' Again with the awful practice of appending _.

Yeah I didn't like that. Prefixing keywords with "kw" or "kw_" is much better, instead of the confusing way of appending _ to some token types but not others.


> 9. Need to insert intra-page navigation links, such as when 'byToken()' appears in the text, it should be link to where byToken is described.

Agreed.


[...]
> I believe the state of the documentation is a showstopper, and needs to be extensively fleshed out before it can be considered ready for voting.

I think the API can be improved. The LexerConfig -> DLexer rename is an important readability issue IMO.

Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first?

Now, somebody pointed out that there's currently no way to tell it that the data you're feeding to it is a partial slice of the full source. I think this should be an easy fix: LexerConfig (or DLexer after the rename) can have a field for specifying the initial line number / column number, defaulting to (1, 1) but can be changed by the caller for parsing code fragments.

A minor concern I have about the current implementation (I didn't look at the code yet, but this is just based on the documentation), is that there's no way to choose what kind of data you want in the token stream. Right now, it appears that it always computes startIndex, line, and column. What if I don't care about this information, and only want, say, the token type and value (say I'm pretty-printing the source, so I don't care how the original was formatted)?  Would it be possible to skip the additional computations required to populate these fields?


T

-- 
MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs
September 11, 2013
On 11.09.2013 20:49, Walter Bright wrote:
> On 9/11/2013 8:01 AM, Dicebot wrote:
>> std.d.lexer is standard module for lexing D code, written by Brian Schott
>
> Thank you, Brian! This is important work.
>
> Not a thorough review, just some notes from reading the doc file:
>
> 1. I don't like the _ suffix for keywords. Just call it kwimport or
> something like that.
>
> 8. 'default_' Again with the awful practice of appending _.

Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers

Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".
September 11, 2013
On Wed, Sep 11, 2013 at 12:13:07PM -0700, Walter Bright wrote:
> On 9/11/2013 12:10 PM, Brian Schott wrote:
> >The choice of ending token names with underscores was made according to the Phobos style guide.
> >
> >http://dlang.org/dstyle.html
> 
> I didn't realize that was in the style guide. I guess I can't complain about it, then :-)

I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.


T

-- 
One Word to write them all, One Access to find them, One Excel to count them all, And thus to Windows bind them. -- Mike Champion
September 11, 2013
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
> I disagree. I think it's more readable to use a consistent prefix, like
> kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear
> you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break existing Phobos code ;)
September 11, 2013
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
> On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
> >I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.
> 
> Not unless you want to change the style guide and break existing Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.


T

-- 
Being able to learn is a great learning; being able to unlearn is a greater learning.