Request for comments: std.d.lexer

I'm writing a D lexer for possible inclusion in Phobos.

DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input.

I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here.

I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by Philippe Sigaud
in reply to Brian Schott

Permalink

Philippe Sigaud

Posted in reply to Brian Schott

Permalink

On Sun, Jan 27, 2013 at 10:51 AM, Brian Schott <briancschott@gmail.com> wrote:
> I'm writing a D lexer for possible inclusion in Phobos.
>
> DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

Cool! I remember linking to it in the wiki a week ago: Here:

http://wiki.dlang.org/Lexers_Parsers

Feel free to correct the entry.

> It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input.
>
> I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here.

Last time we discussed it, IIRC, some people wanted the lexer to stop
at once, other just wanted an Error token.
I personally prefer an Error token, but that means finding a way to
start lexing again after the error (and hence, finding where the error
ends).
I guess any separator/terminator could be used to re-engage the lexer:
space, semicolon, closing brace, closing parenthesis?

> I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)

OK, here are a few questions:

* Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?

* Also, is there a way to keep comments? Any code wanting the modify
the code might need them.
(edit: Ah, I see it: IterationStyle.IncludeComments)

* I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes.

* I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length?

* How does it fare with non ASCII code?

* A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by deadalnix
in reply to Brian Schott

Permalink

deadalnix

Posted in reply to Brian Schott

Permalink

Very happy to see that !

Some remarks :
 - Many parameters should be compile time parameters. Instead of runtime.
 - I'm not a big fan of byToken name, but let's see what others think of it.
 - I'm not sure this is the role of the lexer to process __IDENTIFIER__ special stuffs.
 - You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .).
 - Ideally, the location format should be configurable.
 - You must return at least a forward range, and not an input range, otherwize a lexer cannot lookahead.
 - I'm not sure about making TokenRange a class.

As a more positive note, I was just in need of something like this today and wondered if such project was going to finally happen. This is a great step forward ! Some stuff still need to be polished, but who get everything right the first time ?

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by deadalnix
in reply to Philippe Sigaud

Permalink

deadalnix

Posted in reply to Philippe Sigaud

Permalink

On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
> Last time we discussed it, IIRC, some people wanted the lexer to stop
> at once, other just wanted an Error token.
> I personally prefer an Error token, but that means finding a way to
> start lexing again after the error (and hence, finding where the error
> ends).
> I guess any separator/terminator could be used to re-engage the lexer:
> space, semicolon, closing brace, closing parenthesis?
>

Oh yes, that is very important. Conditions are perfect to handle stuff like that.

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by deadalnix
in reply to deadalnix

Permalink

deadalnix

Posted in reply to deadalnix

Permalink

On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
> Very happy to see that !
>
> Some remarks :
>  - Many parameters should be compile time parameters. Instead of runtime.
>  - I'm not a big fan of byToken name, but let's see what others think of it.
>  - I'm not sure this is the role of the lexer to process __IDENTIFIER__ special stuffs.
>  - You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .).
>  - Ideally, the location format should be configurable.
>  - You must return at least a forward range, and not an input range, otherwize a lexer cannot lookahead.
>  - I'm not sure about making TokenRange a class.
>
> As a more positive note, I was just in need of something like this today and wondered if such project was going to finally happen. This is a great step forward ! Some stuff still need to be polished, but who get everything right the first time ?

And the famous Job's « one last thing » : I'm not a big fan of having OPERATORS_BEGIN of the same type as regular token types. Now they make valid token. Why not provide a set of function like isOperator ?

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by Brian Schott
in reply to Philippe Sigaud

Permalink

Brian Schott

Posted in reply to Philippe Sigaud

Permalink

On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
> * Having a range interface is good. Any reason why you made byToken a
> class and not a struct? Most (like, 99%) of range in Phobos are
> structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code.

> * Also, is there a way to keep comments? Any code wanting the modify
> the code might need them.
> (edit: Ah, I see it: IterationStyle.IncludeComments)
>
> * I'd distinguish between standard comments and documentation
> comments. These are different beasts, to my eyes.

The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens.

> * I see Token has a startIndex member. Any reason not to have a
> endIndex member? Or can and end index always be deduced from
> startIndex and value.length?

That's the idea.

> * How does it fare with non ASCII code?

Everything is templated on the character type, but I haven't done any testing on UTF-16 or UTF-32. Valgrind still shows functions from std.uni being called, so at the moment I assume it works.

> * A rough estimate of number of tokens/s would be good (I know it'll
> vary). Walter seems to think if a lexer is not able to vomit thousands
> of tokens a seconds, then it's not good. On a related note, does your
> lexer have any problem with 10k+-lines files?

$ time dscanner --sloc ../phobos/std/datetime.d
14950

real	0m0.319s
user	0m0.313s
sys	0m0.006s

$ time dmd -c ../phobos/std/datetime.d

real	0m0.354s
user	0m0.318s
sys	0m0.036s

Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by Dicebot
in reply to Brian Schott

Permalink

Dicebot

Posted in reply to Brian Schott

Permalink

In one of last discussions about standard lexer/parser I remember quite a neat proposal - take a delegate for error handling and provide two out of the box ( one, that throws exception and one that returns Error token)

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by Brian Schott
in reply to deadalnix

Permalink

Brian Schott

Posted in reply to deadalnix

Permalink

On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
> Very happy to see that !
>
> Some remarks :
>  - Many parameters should be compile time parameters. Instead of runtime.

I decided not to do this because the lexer actually calls itself while parsing token strings. If they were compile-time parameters, the compiler would likely generate a lot more code.

>  - I'm not a big fan of byToken name, but let's see what others think of it.

Chosen for consistency with the various functions in std.stdio, but now that you point his out, it's not very consistent with std.algorithm or std.range.

>  - I'm not sure this is the role of the lexer to process __IDENTIFIER__ special stuffs.

According to http://dlang.org/lex#specialtokens this is the correct behavior.

>  - You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .).

The lexer does not operate on slices of its input. It would be possible to special-case for this in the future.

>  - Ideally, the location format should be configurable.
>  - You must return at least a forward range, and not an input range, otherwize a lexer cannot lookahead.

It's easy to wrap this range inside of another that does buffering for lookahead. https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/circularbuffer.d

> And the famous Job's « one last thing » : I'm not a big fan of having OPERATORS_BEGIN of the same type as regular token types. Now they make valid token. Why not provide a set of function like isOperator ?

This eliminates possible uses of the case range statement. It may be the case (ha ha) that nobody cares and would rather have those functions. I'd be fine with changing that.

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by deadalnix
in reply to Brian Schott

Permalink

deadalnix

Posted in reply to Brian Schott

Permalink

On Sunday, 27 January 2013 at 10:55:49 UTC, Brian Schott wrote:
> On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
>> Very happy to see that !
>>
>> Some remarks :
>> - Many parameters should be compile time parameters. Instead of runtime.
>
> I decided not to do this because the lexer actually calls itself while parsing token strings. If they were compile-time parameters, the compiler would likely generate a lot more code.
>

Have you measured ?

>> - You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .).
>
> The lexer does not operate on slices of its input. It would be possible to special-case for this in the future.
>

Even without special casing for slice of the input, generating a new string or reusing an existing one is already an important thing. Depending of the processing that come after, it may be of great importance.

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by Timon Gehr
in reply to Brian Schott

Permalink

Timon Gehr

Posted in reply to Brian Schott

Permalink

On 01/27/2013 11:42 AM, Brian Schott wrote:
> On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
>> * Having a range interface is good. Any reason why you made byToken a
>> class and not a struct? Most (like, 99%) of range in Phobos are
>> structs. Do you need reference semantics?
>
> It implements the InputRange interface from std.range so that users have
> a choice of using template constraints or the OO model in their code.
> ...

The lexer range must be a struct.

> ...
>> * A rough estimate of number of tokens/s would be good (I know it'll
>> vary). Walter seems to think if a lexer is not able to vomit thousands
>> of tokens a seconds, then it's not good. On a related note, does your
>> lexer have any problem with 10k+-lines files?
>
> $ time dscanner --sloc ../phobos/std/datetime.d
> 14950
>
> real    0m0.319s
> user    0m0.313s
> sys    0m0.006s
>
> $ time dmd -c ../phobos/std/datetime.d
>
> real    0m0.354s
> user    0m0.318s
> sys    0m0.036s
>
> Yes, I know that "time" is a terrible benchmarking tool, but they're
> fairly close for whatever that's worth.
>

You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?

Top | Forum index | About this forum

Forums