View mode: basic / threaded / horizontal-split · Log in · Help
January 27, 2013
Request for comments: std.d.lexer
I'm writing a D lexer for possible inclusion in Phobos.

DDOC: 
http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
Code: 
https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

It's currently able to correctly syntax highlight all of Phobos, 
but does a fairly bad job at rejecting or notifying users/callers 
about invalid input.

I'd like to hear arguments on the various ways to handle errors 
in the lexer. In a compiler it would be useful to throw an 
exception on finding something like a string literal that doesn't 
stop before EOF, but a text editor or IDE would probably want to 
be a bit more lenient. Maybe having it run-time (or compile-time 
configurable) like std.csv would be the best option here.

I'm interested in ideas on the API design and other high-level 
issues at the moment. I don't consider this ready for inclusion. 
(The current module being reviewed for inclusion in Phobos is the 
new std.uni.)
January 27, 2013
Re: Request for comments: std.d.lexer
On Sun, Jan 27, 2013 at 10:51 AM, Brian Schott <briancschott@gmail.com> wrote:
> I'm writing a D lexer for possible inclusion in Phobos.
>
> DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
> Code:
> https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

Cool! I remember linking to it in the wiki a week ago:
Here:

http://wiki.dlang.org/Lexers_Parsers

Feel free to correct the entry.

> It's currently able to correctly syntax highlight all of Phobos, but does a
> fairly bad job at rejecting or notifying users/callers about invalid input.
>
> I'd like to hear arguments on the various ways to handle errors in the
> lexer. In a compiler it would be useful to throw an exception on finding
> something like a string literal that doesn't stop before EOF, but a text
> editor or IDE would probably want to be a bit more lenient. Maybe having it
> run-time (or compile-time configurable) like std.csv would be the best
> option here.

Last time we discussed it, IIRC, some people wanted the lexer to stop
at once, other just wanted an Error token.
I personally prefer an Error token, but that means finding a way to
start lexing again after the error (and hence, finding where the error
ends).
I guess any separator/terminator could be used to re-engage the lexer:
space, semicolon, closing brace, closing parenthesis?

> I'm interested in ideas on the API design and other high-level issues at the
> moment. I don't consider this ready for inclusion. (The current module being
> reviewed for inclusion in Phobos is the new std.uni.)

OK, here are a few questions:

* Having a range interface is good. Any reason why you made byToken a
class and not a struct? Most (like, 99%) of range in Phobos are
structs. Do you need reference semantics?

* Also, is there a way to keep comments? Any code wanting the modify
the code might need them.
(edit: Ah, I see it: IterationStyle.IncludeComments)

* I'd distinguish between standard comments and documentation
comments. These are different beasts, to my eyes.

* I see Token has a startIndex member. Any reason not to have a
endIndex member? Or can and end index always be deduced from
startIndex and value.length?

* How does it fare with non ASCII code?

* A rough estimate of number of tokens/s would be good (I know it'll
vary). Walter seems to think if a lexer is not able to vomit thousands
of tokens a seconds, then it's not good. On a related note, does your
lexer have any problem with 10k+-lines files?
January 27, 2013
Re: Request for comments: std.d.lexer
Very happy to see that !

Some remarks :
 - Many parameters should be compile time parameters. Instead of 
runtime.
 - I'm not a big fan of byToken name, but let's see what others 
think of it.
 - I'm not sure this is the role of the lexer to process 
__IDENTIFIER__ special stuffs.
 - You need to provide a way to specify haw textual 
representation of the token (ie value) is set. The best way to do 
it IMO is an alias parameter that return a string when called 
with a string (then the user can choose to keep the value from 
original string, create a copy, always get the same copy with the 
same string, etc . . .).
 - Ideally, the location format should be configurable.
 - You must return at least a forward range, and not an input 
range, otherwize a lexer cannot lookahead.
 - I'm not sure about making TokenRange a class.

As a more positive note, I was just in need of something like 
this today and wondered if such project was going to finally 
happen. This is a great step forward ! Some stuff still need to 
be polished, but who get everything right the first time ?
January 27, 2013
Re: Request for comments: std.d.lexer
On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
> Last time we discussed it, IIRC, some people wanted the lexer 
> to stop
> at once, other just wanted an Error token.
> I personally prefer an Error token, but that means finding a 
> way to
> start lexing again after the error (and hence, finding where 
> the error
> ends).
> I guess any separator/terminator could be used to re-engage the 
> lexer:
> space, semicolon, closing brace, closing parenthesis?
>

Oh yes, that is very important. Conditions are perfect to handle 
stuff like that.
January 27, 2013
Re: Request for comments: std.d.lexer
On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
> Very happy to see that !
>
> Some remarks :
>  - Many parameters should be compile time parameters. Instead 
> of runtime.
>  - I'm not a big fan of byToken name, but let's see what others 
> think of it.
>  - I'm not sure this is the role of the lexer to process 
> __IDENTIFIER__ special stuffs.
>  - You need to provide a way to specify haw textual 
> representation of the token (ie value) is set. The best way to 
> do it IMO is an alias parameter that return a string when 
> called with a string (then the user can choose to keep the 
> value from original string, create a copy, always get the same 
> copy with the same string, etc . . .).
>  - Ideally, the location format should be configurable.
>  - You must return at least a forward range, and not an input 
> range, otherwize a lexer cannot lookahead.
>  - I'm not sure about making TokenRange a class.
>
> As a more positive note, I was just in need of something like 
> this today and wondered if such project was going to finally 
> happen. This is a great step forward ! Some stuff still need to 
> be polished, but who get everything right the first time ?

And the famous Job's « one last thing » : I'm not a big fan of 
having OPERATORS_BEGIN of the same type as regular token types. 
Now they make valid token. Why not provide a set of function like 
isOperator ?
January 27, 2013
Re: Request for comments: std.d.lexer
On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
> * Having a range interface is good. Any reason why you made 
> byToken a
> class and not a struct? Most (like, 99%) of range in Phobos are
> structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that 
users have a choice of using template constraints or the OO model 
in their code.

> * Also, is there a way to keep comments? Any code wanting the 
> modify
> the code might need them.
> (edit: Ah, I see it: IterationStyle.IncludeComments)
>
> * I'd distinguish between standard comments and documentation
> comments. These are different beasts, to my eyes.

The standard at http://dlang.org/lex.html doesn't differentiate 
between them. It's trivial to write a function that checks if a 
token starts with "///", "/**", or "/++" while iterating over the 
tokens.

> * I see Token has a startIndex member. Any reason not to have a
> endIndex member? Or can and end index always be deduced from
> startIndex and value.length?

That's the idea.

> * How does it fare with non ASCII code?

Everything is templated on the character type, but I haven't done 
any testing on UTF-16 or UTF-32. Valgrind still shows functions 
from std.uni being called, so at the moment I assume it works.

> * A rough estimate of number of tokens/s would be good (I know 
> it'll
> vary). Walter seems to think if a lexer is not able to vomit 
> thousands
> of tokens a seconds, then it's not good. On a related note, 
> does your
> lexer have any problem with 10k+-lines files?

$ time dscanner --sloc ../phobos/std/datetime.d
14950

real	0m0.319s
user	0m0.313s
sys	0m0.006s

$ time dmd -c ../phobos/std/datetime.d

real	0m0.354s
user	0m0.318s
sys	0m0.036s

Yes, I know that "time" is a terrible benchmarking tool, but 
they're fairly close for whatever that's worth.
January 27, 2013
Re: Request for comments: std.d.lexer
In one of last discussions about standard lexer/parser I remember 
quite a neat proposal - take a delegate for error handling and 
provide two out of the box ( one, that throws exception and one 
that returns Error token)
January 27, 2013
Re: Request for comments: std.d.lexer
On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
> Very happy to see that !
>
> Some remarks :
>  - Many parameters should be compile time parameters. Instead 
> of runtime.

I decided not to do this because the lexer actually calls itself 
while parsing token strings. If they were compile-time 
parameters, the compiler would likely generate a lot more code.

>  - I'm not a big fan of byToken name, but let's see what others 
> think of it.

Chosen for consistency with the various functions in std.stdio, 
but now that you point his out, it's not very consistent with 
std.algorithm or std.range.

>  - I'm not sure this is the role of the lexer to process 
> __IDENTIFIER__ special stuffs.

According to http://dlang.org/lex#specialtokens this is the 
correct behavior.

>  - You need to provide a way to specify haw textual 
> representation of the token (ie value) is set. The best way to 
> do it IMO is an alias parameter that return a string when 
> called with a string (then the user can choose to keep the 
> value from original string, create a copy, always get the same 
> copy with the same string, etc . . .).

The lexer does not operate on slices of its input. It would be 
possible to special-case for this in the future.

>  - Ideally, the location format should be configurable.
>  - You must return at least a forward range, and not an input 
> range, otherwize a lexer cannot lookahead.

It's easy to wrap this range inside of another that does 
buffering for lookahead. 
https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/circularbuffer.d

> And the famous Job's « one last thing » : I'm not a big fan of 
> having OPERATORS_BEGIN of the same type as regular token types. 
> Now they make valid token. Why not provide a set of function 
> like isOperator ?

This eliminates possible uses of the case range statement. It may 
be the case (ha ha) that nobody cares and would rather have those 
functions. I'd be fine with changing that.
January 27, 2013
Re: Request for comments: std.d.lexer
On Sunday, 27 January 2013 at 10:55:49 UTC, Brian Schott wrote:
> On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
>> Very happy to see that !
>>
>> Some remarks :
>> - Many parameters should be compile time parameters. Instead 
>> of runtime.
>
> I decided not to do this because the lexer actually calls 
> itself while parsing token strings. If they were compile-time 
> parameters, the compiler would likely generate a lot more code.
>

Have you measured ?

>> - You need to provide a way to specify haw textual 
>> representation of the token (ie value) is set. The best way to 
>> do it IMO is an alias parameter that return a string when 
>> called with a string (then the user can choose to keep the 
>> value from original string, create a copy, always get the same 
>> copy with the same string, etc . . .).
>
> The lexer does not operate on slices of its input. It would be 
> possible to special-case for this in the future.
>

Even without special casing for slice of the input, generating a 
new string or reusing an existing one is already an important 
thing. Depending of the processing that come after, it may be of 
great importance.
January 27, 2013
Re: Request for comments: std.d.lexer
On 01/27/2013 11:42 AM, Brian Schott wrote:
> On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
>> * Having a range interface is good. Any reason why you made byToken a
>> class and not a struct? Most (like, 99%) of range in Phobos are
>> structs. Do you need reference semantics?
>
> It implements the InputRange interface from std.range so that users have
> a choice of using template constraints or the OO model in their code.
> ...

The lexer range must be a struct.

> ...
>> * A rough estimate of number of tokens/s would be good (I know it'll
>> vary). Walter seems to think if a lexer is not able to vomit thousands
>> of tokens a seconds, then it's not good. On a related note, does your
>> lexer have any problem with 10k+-lines files?
>
> $ time dscanner --sloc ../phobos/std/datetime.d
> 14950
>
> real    0m0.319s
> user    0m0.313s
> sys    0m0.006s
>
> $ time dmd -c ../phobos/std/datetime.d
>
> real    0m0.354s
> user    0m0.318s
> sys    0m0.036s
>
> Yes, I know that "time" is a terrible benchmarking tool, but they're
> fairly close for whatever that's worth.
>

You are measuring lexing speed against compilation speed. A reasonably 
well performing lexer is around one order of magnitude faster on 
std.datetime. Maybe you should profile a little?
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home