September 11, 2013
On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
> On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
>> On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
>> >I disagree. I think it's more readable to use a consistent prefix,
>> >like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
>> >clear you're referring to token types, not the actual keyword.
>> 
>> Not unless you want to change the style guide and break existing
>> Phobos code ;)
>
> How would that break Phobos code? Phobos code doesn't even use
> std.d.lexer right now.

Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.
September 11, 2013
On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
> On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
> >On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
> >>On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
> >>>I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.
> >>
> >>Not unless you want to change the style guide and break existing Phobos code ;)
> >
> >How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.
> 
> Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.

This doesn't violate Phobos style guidelines:

	enum TokenType {
		kwInt,
		kwFloat,
		kwDouble,
		...
		kwFunction,
		kwScope,
		... // etc.
	}

The token representing a particular keyword is not the same thing as the keyword itself. There's nothing that says "a D lexer must use token type names that are derived from the D keywords it represents".

In fact, using a kw prefix for tokens representing keywords is *more* consistent, because it doesn't require inserting _ for some enum values but not for others.

I never said anything about changing Phobos style guidelines.


T

-- 
I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser
September 11, 2013
On 9/11/2013 12:55 PM, H. S. Teoh wrote:
>> 3. I assumed TokenType is a type. But it's not, it's an enum. Even
>> the document says it's a 'type', but it's not a type.
>
> I don't understand this complaint. I thought it's pretty clear that it's
> referring to what type of token it is (or would you prefer TokenKind?).

A type is a well-understood concept in D. This is using 'type' to mean something completely different. This is very jarring. Your idea of using 'kind' is vastly superior.


>> 5. The LexerConfig initialization should be a constructor rather than
>> a sequence of assignments. LexerConfig documentation is awfully thin.
>> For example, 'tokenStyle' is explained as being 'Token style',
>> whatever that is.
>
> I'm on the fence about this one. Setting up the configuration before
> starting the lexing process makes sense to me.

Yes, using a constructor :-)

>> 6. No clue how lookahead works with this. Parsing D requires arbitrary
>> lookahead.
>
> What has this got to do with lexing?

The lexer needs to support backing up and going forward again. That's how the dmd lexer works.


> Also, it's unclear what types of input is supported -- the code example
> only uses string input, but what about file input? Does it support
> byLine? Or does it need me to slurp the entire file contents into an
> in-memory buffer first?

The point of having the input be an InputRange is you DO NOT CARE if the input is coming from a string, file, etc.

September 11, 2013
On 9/11/2013 8:01 AM, Dicebot wrote:
> std.d.lexer is standard module for lexing D code, written by Brian Schott

One litmus test of the validity of the design is to see if one can build a D parser out of it.

Otherwise we'll likely face the unpleasant task of needing to deprecate std.d.lexer.

September 11, 2013
On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
>> 2. The example uses an if-then sequence of isBuiltType, isKeyword,
>> etc. Should be an enum so a switch can be done for speed.
>
> I believe this is probably a result of having a separate enum value of
> each token, and at the same time needing to group them into categories
> for syntax highlighting purposes. I'd suggest a function for converting
> the TokenType enum value into a TokenCategory enum. Perhaps something
> like:
>
> 	enum TokenCategory { BuiltInType, Keyword, ... }
>
> 	// Convert TokenType into TokenCategory
> 	TokenCategory category(TokenType tt) { ... }
>
> Then in user code, you call category() on the token type, and switch
> over that. This maximizes performance.
>
> Implementation-wise, I'd suggest either a hash table on TokenType, or
> perhaps even encoding the category into the TokenType enum values, for
> example:
>
> 	enum TokenCategory {
> 		BuiltInType, Keyword, ...
> 	}
>
> 	enum TokenType {
> 		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
> 		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
> 		...
> 		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
> 	}
>
> Then the category function can be simply:
>
> 	TokenCategory category(TokenType tt) {
> 		return cast(TokenCategory)(tt >> 16);
> 	}
>
> Though admittedly, this is a bit hackish. But if you're going for speed,
> this would be quite fast.

There are already plenty of hackish things in that module, so this would fit right in.

>> 4. When naming tokens like .. 'slice', it is giving it a
>> syntactic/semantic name rather than a token name. This would be
>> awkward if .. took on new meanings in D. Calling it 'dotdot' would
>> be clearer. Ditto for the rest. For example that is done better, '*'
>> is called 'star', rather than 'dereference'.
>
> I agree. Especially since '*' can also mean multiplication, depending on
> context. It would be weird (and unnecessarily obscure) for parser code
> to refer to 'dereference' when parsing expressions. :)

If you read the docs/code you'll see that "*" is called "star" :-). The slice -> dotdot rename is pretty simple to do.

>> 5. The LexerConfig initialization should be a constructor rather than
>> a sequence of assignments. LexerConfig documentation is awfully thin.
>> For example, 'tokenStyle' is explained as being 'Token style',
>> whatever that is.
>
> I'm on the fence about this one. Setting up the configuration before
> starting the lexing process makes sense to me.
>
> Perhaps one way to improve this is to rename LexerConfig to DLexer, and
> make byToken a member function (or call it via UFCS):
>
> 	DLexer lex;
> 	lex.iterStyle = ...;
> 	// etc.
>
> 	foreach (token; lex.byToken()) {
> 		...
> 	}
>
> This reads better: you're getting a list of tokens from the lexer 'lex',
> as opposed to getting something from byToken(config), which doesn't
> really *say* what it's doing: is it tokenizing the config object?

byToken is a free function because its first parameter is the range to tokenize. This allows you to use UFCS on the range. (e.g. "sourcCode.byToken()" Putting it in a struct/class would break this.

>> 6. No clue how lookahead works with this. Parsing D requires arbitrary
>> lookahead.
>
> What has this got to do with lexing? The parser needs to keep track of
> its state independently of the lexer anyway, doesn't it?  I'm not sure
> how DMD's parser works, but the way I usually write parsers is that
> their state encodes the series of tokens encountered so far, so they
> don't need the lexer to save any tokens for them. If they need to refer
> to tokens, they create partial AST trees on the parser stack that
> reference said tokens. I don't see why it should be the lexer's job to
> keep track of such things.

For parsing, you'll likely want to use array() to grab all the tokens. But there are other uses such as syntax highlighting that only need one token at a time.

>> 9. Need to insert intra-page navigation links, such as when
>> 'byToken()' appears in the text, it should be link to where byToken
>> is described.
>
> Agreed.

I'll work on this later this evening.

>
> [...]
>> I believe the state of the documentation is a showstopper, and needs
>> to be extensively fleshed out before it can be considered ready for
>> voting.
>
> I think the API can be improved. The LexerConfig -> DLexer rename is an
> important readability issue IMO.

I'm not convinced (yet?).

> Also, it's unclear what types of input is supported -- the code example
> only uses string input, but what about file input? Does it support
> byLine? Or does it need me to slurp the entire file contents into an
> in-memory buffer first?

The input is a range of bytes, which the lexer assumes is UTF-8. The lexer works much faster on arrays than it does on arbitrary ranges though. Dmitry Olshansky implemented a circular buffer in this module that's used to cache the input range internally. This buffer is then used for the lexing process.

> Now, somebody pointed out that there's currently no way to tell it that
> the data you're feeding to it is a partial slice of the full source. I
> think this should be an easy fix: LexerConfig (or DLexer after the
> rename) can have a field for specifying the initial line number / column
> number, defaulting to (1, 1) but can be changed by the caller for
> parsing code fragments.

That's simple enough to add.

> A minor concern I have about the current implementation (I didn't look
> at the code yet, but this is just based on the documentation), is that
> there's no way to choose what kind of data you want in the token stream.
> Right now, it appears that it always computes startIndex, line, and
> column. What if I don't care about this information, and only want, say,
> the token type and value (say I'm pretty-printing the source, so I don't
> care how the original was formatted)?  Would it be possible to skip the
> additional computations required to populate these fields?

It's possible, but I fear that it would make the code a mess.
September 11, 2013
On Wednesday, 11 September 2013 at 20:37:11 UTC, Walter Bright wrote:
> On 9/11/2013 8:01 AM, Dicebot wrote:
>> std.d.lexer is standard module for lexing D code, written by Brian Schott
>
> One litmus test of the validity of the design is to see if one can build a D parser out of it.
>
> Otherwise we'll likely face the unpleasant task of needing to deprecate std.d.lexer.

I'm not sure if you've followed the announce mailing list recently, but this lexer is already used in both the DScanner and DCD projects. There is a parser module that goes along with this lexer included in DScanner (and used by DCD as well). It's not ready for review though. I'm still working on improving its error recovery.

DScanner is a tool for D source that's able to generate syntax-highlighted HTML, perform syntax validation, generate the AST in XML format, generate CTAGS, list imports, and count lines of code.

DCD is an autocompletion client/server program that has integrations for several text editors.
September 11, 2013
On 9/11/2013 1:44 PM, Brian Schott wrote:
> I'm not sure if you've followed the announce mailing list recently, but this
> lexer is already used in both the DScanner and DCD projects.

Awesome! This is just what I asked for.
September 11, 2013
On Wed, Sep 11, 2013 at 01:31:45PM -0700, Walter Bright wrote:
> On 9/11/2013 12:55 PM, H. S. Teoh wrote:
> >>3. I assumed TokenType is a type. But it's not, it's an enum. Even the document says it's a 'type', but it's not a type.
> >
> >I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).
> 
> A type is a well-understood concept in D. This is using 'type' to mean something completely different. This is very jarring. Your idea of using 'kind' is vastly superior.

No, in typical D code, user-defined type names are capitalized, such as InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in their names (we don't call them InputRangeType, for example). So this usage doesn't conflict at all.


> >>5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin.  For example, 'tokenStyle' is explained as being 'Token style', whatever that is.
> >
> >I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me.
> 
> Yes, using a constructor :-)

Constructors with many arguments are evil, because you can't just modify one setting and use defaults for the rest; you have to specify all of them up to the last default argument you have to change.


> >>6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.
> >
> >What has this got to do with lexing?
> 
> The lexer needs to support backing up and going forward again. That's how the dmd lexer works.

Then the parser should be fixed. Keeping track of what tokens have been seen so far is the parser's job, not the lexer's. LALR(1) parsers, for example, deal with multiple similar-looking constructs without requiring more than a single token lookahead. The trick is that you do not assume a particular grammar production until it has been fully disambiguated. The grammar, if not conducive to this, can usually be rearranged to make it possible.  For example, if the grammar has two production rules:

	StatementA ::= TokenA TokenB TokenC TokenD
	StatementB ::= TokenA TokenB TokenC TokenE

then you may claim that you need a 4-token lookahead in order to tell, given TokenA in the current input, which type of statement it's going to be. However, these rules can be rewritten like this:

	StatementABPrefix ::= TokenA TokenB TokenC
	StatementA ::= StatementABPrefix TokenD
	StatementB ::= StatementABPrefix TokenE

Now the parser only needs a 1-token lookahead to do its job, and no backtracking is ever needed. Any information encoded by the first 3 tokens can be stored in the AST node of StatementABPrefix, so it's recoverable without needing to save previous tokens.

Unless, you're saying that D grammar is more complex than LALR(1)... If so, I'd like to see an example of where this is the case.


> >Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first?
> 
> The point of having the input be an InputRange is you DO NOT CARE if the input is coming from a string, file, etc.

Yes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters? Am I allowed to pass in a range of *strings* (representing source lines)? This needs to be clearly stated.

The fact that tokens slice the input also needs to be clearly stated, since that precludes the use of transient ranges like byLine (your token values will be garbled by the time you get to them).


T

-- 
Let's call it an accidental feature. -- Larry Wall
September 11, 2013
Walter Bright wrote:

> Since the very beginning.
> 
> One example is determining if something is a declaration or an expression.

I see now, that you wrote about parsing---not about lexing.

Btw. I wrote an LALR(1)-parser for an early version of D. This means a lookahead of one was sufficient---or I made terrible mistakes.

-manfred
September 11, 2013
On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2@digitalmars.com> said:

> 6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.

Since this is a forward range, you can save() it at one location and continue parsing, and then restart parsing from the savepoint.
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324

This also means that the underlying range used as input to the lexer must be a forward range that you can save() too.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca