September 11, 2013
On Wed, Sep 11, 2013 at 10:37:34PM +0200, Brian Schott wrote:
> On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
> >>2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should be an enum so a switch can be done for speed.
> >
> >I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like:
> >
> >	enum TokenCategory { BuiltInType, Keyword, ... }
> >
> >	// Convert TokenType into TokenCategory
> >	TokenCategory category(TokenType tt) { ... }
> >
> >Then in user code, you call category() on the token type, and switch over that. This maximizes performance.
> >
> >Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example:
> >
> >	enum TokenCategory {
> >		BuiltInType, Keyword, ...
> >	}
> >
> >	enum TokenType {
> >		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
> >		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
> >		...
> >		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
> >	}
> >
> >Then the category function can be simply:
> >
> >	TokenCategory category(TokenType tt) {
> >		return cast(TokenCategory)(tt >> 16);
> >	}
> >
> >Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.
> 
> There are already plenty of hackish things in that module, so this would fit right in.

Better hope it passes code review, though. :)


> >>4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.
> >
> >I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)
> 
> If you read the docs/code you'll see that "*" is called "star" :-). The slice -> dotdot rename is pretty simple to do.

I was just going by what Walter said, but OK, fair enough.


> >>5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin.  For example, 'tokenStyle' is explained as being 'Token style', whatever that is.
> >
> >I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me.
> >
> >Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS):
> >
> >	DLexer lex;
> >	lex.iterStyle = ...;
> >	// etc.
> >
> >	foreach (token; lex.byToken()) {
> >		...
> >	}
> >
> >This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?
> 
> byToken is a free function because its first parameter is the range to tokenize. This allows you to use UFCS on the range. (e.g. "sourcCode.byToken()" Putting it in a struct/class would break this.

Very good point. In that case, would it be possible to make byToken accept configuration parameters inline? That is:

	auto tokens = File(mysource)
		.byLine() // does this actually work??
		.byToken(IterationStyle.everything, ...);

This seems better suited for UFCS-style chains, as opposed to needing to create a separate config object outside the chain just so you can use it inside. And writing LexerConfig(...) inline seems a bit cumbersome:

	auto tokens = File(mysource)
		.byLine() // does this actually work??
		.byToken(LexerConfig(IterationStyle.everything, ...));

though, in retrospect, perhaps not as bad as it initially sounds.


> >>6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.
> >
> >What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it?  I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things.
> 
> For parsing, you'll likely want to use array() to grab all the tokens. But there are other uses such as syntax highlighting that only need one token at a time.

I don't think that's necessary. An appropriately-constructed parser doesn't need arbitrary slicing of tokens; its internal state should encode what tokens have been seen thus far so that it never needs to backtrack.


[...]
> >I think the API can be improved. The LexerConfig -> DLexer rename is an important readability issue IMO.
> 
> I'm not convinced (yet?).

OK, I take that back. UFCS chaining is an important emerging D paradigm that we should support, and the current API seems better suited for that purpose.


> >Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first?
> 
> The input is a range of bytes, which the lexer assumes is UTF-8. The lexer works much faster on arrays than it does on arbitrary ranges though. Dmitry Olshansky implemented a circular buffer in this module that's used to cache the input range internally. This buffer is then used for the lexing process.

OK, if this can be indicated in the docs, that'd be great. :) (A
signature constraint of the form `if (is(ElementType!R == char))` that
should be good enough, I think, since DDOC in DMD git HEAD will include
signature constraints in the generated docs, and it should be clear
enough what is expected.


[...]
> >A minor concern I have about the current implementation (I didn't look at the code yet, but this is just based on the documentation), is that there's no way to choose what kind of data you want in the token stream.  Right now, it appears that it always computes startIndex, line, and column. What if I don't care about this information, and only want, say, the token type and value (say I'm pretty-printing the source, so I don't care how the original was formatted)?  Would it be possible to skip the additional computations required to populate these fields?
> 
> It's possible, but I fear that it would make the code a mess.

Fair enough. It's not a significant overhead anyway (I believe), so it
shouldn't be a big issue.


T

-- 
Once bitten, twice cry...
September 11, 2013
On 09/11/2013 05:01 PM, Dicebot wrote:
> std.d.lexer is standard module for lexing D code, written by Brian Schott
>
Nice!

> Please take this part seriously: "If you identify problems along the
> way, please note if they are minor, serious, or showstoppers."
> (http://wiki.dlang.org/Review/Process). This information later
> will be used to determine if library is ready for voting.
>
I only had a short look at the source code and was left with a good impression, i.e. it looks like a lexer written in D.

I do like how you made the lexer configurable to serve the different needs. Also it was very easy to setup and use.

The appended underscore for keywords is much nicer than messing with uppercase/lowercase which was what I did.

One thing that bothered me was the usage of const(ubyte)[] as input.
If you operate on UTF-8 it should take const(char)[] you can cast it to const(ubyte)[] internally.

Also a convenience function that reads a file and processes UTF BOM marks would be nice (see toUtf8 https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429), but that could as well fit somewhere else into phobos.

> ---- Information request from module author ----
>
> Performance was a major discussed topic in previous thread. Could you
> please provide benchmarking data for version currently ongoing the review?

Nice too, it works as fast as my D lexer (https://github.com/dawgfoto/lexer) which is about as fast as DMD's lexer (when compiled with LDC).
September 11, 2013
On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj wrote:
> On 11.09.2013 20:49, Walter Bright wrote:
>> On 9/11/2013 8:01 AM, Dicebot wrote:
>>> std.d.lexer is standard module for lexing D code, written by Brian Schott
>>
>> Thank you, Brian! This is important work.
>>
>> Not a thorough review, just some notes from reading the doc file:
>>
>> 1. I don't like the _ suffix for keywords. Just call it kwimport or
>> something like that.
>>
>> 8. 'default_' Again with the awful practice of appending _.
>
> Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers
>
> Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".

C# allows this as well. For example, you can have:
void MyVoid(string @int) { ... }
The name is actually int for purposes of reflection and the like, the @ just tells the compiler that it's not a keyword.
September 11, 2013
On 12.09.2013 01:22, Kapps wrote:
> On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj wrote:
>> On 11.09.2013 20:49, Walter Bright wrote:
>>> On 9/11/2013 8:01 AM, Dicebot wrote:
>>>> std.d.lexer is standard module for lexing D code, written by Brian
>>>> Schott
>>>
>>> Thank you, Brian! This is important work.
>>>
>>> Not a thorough review, just some notes from reading the doc file:
>>>
>>> 1. I don't like the _ suffix for keywords. Just call it kwimport or
>>> something like that.
>>>
>>> 8. 'default_' Again with the awful practice of appending _.
>>
>> Delphi designers realized this problem years ago and they came up with
>> a solution:
>> http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers
>>
>>
>> Basically, Delphi allows escaping reserved identifiers with a '&'. I
>> wonder how D solves that problem when interfacing to COM classes if
>> they have for example a function named "scope".
>
> C# allows this as well. For example, you can have:
> void MyVoid(string @int) { ... }
> The name is actually int for purposes of reflection and the like, the @
> just tells the compiler that it's not a keyword.

Yes, and this is important to actually have the real name without a '_' or any other prefix or suffix. Is there an enhancement proposal for this in the bugzilla?
September 11, 2013
On 9/11/2013 3:02 PM, H. S. Teoh wrote:
> No, in typical D code, user-defined type names are capitalized, such as
> InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in
> their names (we don't call them InputRangeType, for example). So this
> usage doesn't conflict at all.

It's not a type as is understood in D. It's an enumerated value.


> Constructors with many arguments are evil, because you can't just modify
> one setting and use defaults for the rest; you have to specify all of
> them up to the last default argument you have to change.

It's not that big a deal, and certainly not evil.


>>>> 6. No clue how lookahead works with this. Parsing D requires
>>>> arbitrary lookahead.
>>>
>>> What has this got to do with lexing?
>>
>> The lexer needs to support backing up and going forward again.
>> That's how the dmd lexer works.
>
> Then the parser should be fixed.

The parser works fine. It doesn't need fixing.


>> The point of having the input be an InputRange is you DO NOT CARE if
>> the input is coming from a string, file, etc.
>
> Yes, but the docs currently doesn't say what form this input range must
> take. Must it be a range of characters?

It does say that.

September 11, 2013
On 9/11/2013 3:09 PM, Michel Fortin wrote:
> On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2@digitalmars.com> said:
>
>> 6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.
>
> Since this is a forward range, you can save() it at one location and continue
> parsing, and then restart parsing from the savepoint.
> https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324
>
> This also means that the underlying range used as input to the lexer must be a
> forward range that you can save() too.

That's correct, but that implies re-lexing the tokens, which has negative performance implications.

September 12, 2013
On Wed, Sep 11, 2013 at 04:42:34PM -0700, Walter Bright wrote:
> On 9/11/2013 3:02 PM, H. S. Teoh wrote:
[...]
> >Yes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters?
> 
> It does say that.

But then the code example proceeds to pass byLine() to it. Is that correct? If it is, then the docs need to be updated, because last time I checked, byLine() isn't a range of char, but a range of char *arrays*.


T

-- 
Real men don't take backups. They put their source on a public FTP-server and let the world mirror it. -- Linus Torvalds
September 12, 2013
On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:
> But then the code example proceeds to pass byLine() to it. Is that
> correct? If it is, then the docs need to be updated, because last time I
> checked, byLine() isn't a range of char, but a range of char *arrays*.
>
>
> T

The example doesn't pass the result of byLine to the byToken function directly.
September 12, 2013
On Wednesday, 11 September 2013 at 19:46:25 UTC, Brian Schott wrote:
> Yeah. D requires lookahead in both lexing and parsing. Some examples:
>
> * String literals such as q{}
> * If statements need to determine the difference between if (expression) and if (type identifier = expression)
> * Determining if something is a lamba expression or a function literal expression
> * Determining if something is a declaration or a statement or an expression.

Does not require lookahead.

> * Differentiating between (type).identifier and a lambda expression

Yup !

> * Differentiating between a type and an expression inside a typeid expression

Does not require lookahead.

> * Differentiating between associative array key type and tuple slices in type suffixes

ditto.
September 12, 2013
On Wednesday, 11 September 2013 at 23:44:55 UTC, Walter Bright wrote:
> On 9/11/2013 3:09 PM, Michel Fortin wrote:
>> On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2@digitalmars.com> said:
>>
>>> 6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.
>>
>> Since this is a forward range, you can save() it at one location and continue
>> parsing, and then restart parsing from the savepoint.
>> https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324
>>
>> This also means that the underlying range used as input to the lexer must be a
>> forward range that you can save() too.
>
> That's correct, but that implies re-lexing the tokens, which has negative performance implications.

Indeed. What solution do you have in mind ?