View mode: basic / threaded / horizontal-split · Log in · Help
August 02, 2012
std.d.lexer requirements
Given the various proposals for a lexer module for Phobos, I thought I'd share 
some characteristics it ought to have.

First of all, it should be suitable for, at a minimum:

1. compilers

2. syntax highlighting editors

3. source code formatters

4. html creation

To that end:

1. It should accept as input an input range of UTF8. I feel it is a mistake to 
templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use 
an 'adapter' range to convert the input to UTF8. (This is what component 
programming is all about.)

2. It should output an input range of tokens

3. tokens should be values, not classes

4. It should avoid memory allocation as much as possible

5. It should read or write any mutable global state outside of its "Lexer"
instance

6. A single "Lexer" instance should be able to serially accept input ranges, 
sharing and updating one identifier table

7. It should accept a callback delegate for errors. That delegate should decide 
whether to:
   1. ignore the error (and "Lexer" will try to recover and continue)
   2. print an error message (and "Lexer" will try to recover and continue)
   3. throw an exception, "Lexer" is done with that input range

8. Lexer should be configurable as to whether it should collect information 
about comments and ddoc comments or not

9. Comments and ddoc comments should be attached to the next following token, 
they should not themselves be tokens

10. High speed matters a lot

11. Tokens should have begin/end line/column markers, though most of the time 
this can be implicitly determined

12. It should come with unittests that, using -cov, show 100% coverage


Basically, I don't want anyone to be motivated to do a separate one after seeing 
this one.
August 02, 2012
Re: std.d.lexer requirements
On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:
> 3. tokens should be values, not classes

I agree with everything but this point. Tokens become quite large 
when they have all kinds of string and position information on 
them, much larger than the typical recommended sizes for 
pass-by-value. It's still possible to provide an interface for 
them that doesn't require as much copying (ref returns and 
getters and whatnot), but it's fiddly and way too easy to copy 
these huge Token structs around, especially if the output is a 
range of Token structs.
August 02, 2012
Re: std.d.lexer requirements
> 5. It should read or write any mutable global state outside of 
> its "Lexer"
> instance

I assume you mean 'shouldn't'. :P
August 02, 2012
Re: std.d.lexer requirements
Le 02/08/2012 02:10, Walter Bright a écrit :
> 6. A single "Lexer" instance should be able to serially accept input
> ranges, sharing and updating one identifier table
>

I see the lexer as a function that take an range of char as input and 
give back a range of token. Does it make sense to make an instance of a 
lexer ?

> 7. It should accept a callback delegate for errors. That delegate should
> decide whether to:
> 1. ignore the error (and "Lexer" will try to recover and continue)
> 2. print an error message (and "Lexer" will try to recover and continue)
> 3. throw an exception, "Lexer" is done with that input range
>

Off topic, but it look like the condition proposal from H.S. Teoh and 
myself.

> Basically, I don't want anyone to be motivated to do a separate one
> after seeing this one.

That would be awesome !
August 02, 2012
Re: std.d.lexer requirements
On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
> 1. It should accept as input an input range of UTF8. I feel it is a mistake
> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
> should use an 'adapter' range to convert the input to UTF8. (This is what
> component programming is all about.)

But that's not how ranges of characters work. They're ranges of dchar. Ranges 
don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create 
special wrappers around string or wstring to have ranges of UTF-8. The way 
that it's normally done is to have ranges of dchar and then special-case 
range-based functions for strings. Then the function can operate on any range 
of dchar but still operates on strings efficiently.

So, it's quite possible to make a lexer operate on ranges of dchar but be 
operating primarily on ASCII by special-casing strings and avoiding decoding 
as much as possible. My lexer does a good job of this, I believe, so it's 
thoroughly optimized for strings, but it's still operating on ranges of dchar, 
not ranges of UTF-8.

> 2. It should output an input range of tokens
> 
> 3. tokens should be values, not classes
> 
> 4. It should avoid memory allocation as much as possible
> 
> 5. It should read or write any mutable global state outside of its "Lexer"
> instance
> 
> 6. A single "Lexer" instance should be able to serially accept input ranges,
> sharing and updating one identifier table

When you say identifier table, do you mean symbol table, or something else?

Isn't the symbol table something for the parser to worry about? Other than 
parsers, I can't think of anything which would even _care_ about a symbol 
table. And depending on how a symbol table were done, you'd want it to take 
scoping rules into account (_I'd_ certainly expect it to), meaning that it 
only includes identifiers which are relevant to the current scope. And if 
that's the case, it _really_ doesn't belong in the lexer. The code using the 
lexer can easily have its own table that it populates according to however it 
wants it populated by processing the identifier tokens as they get lexed. Not 
to mention, don't symbol tables usually include type information that a lexer 
_wouldn't_ have?

Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care 
about scoping), but I definitely question that that's the right place for it.

If you mean a table that simply lists identifiers rather than a symbol table, 
I'm not quite sure why you'd want it in addition to the symbol table, and 
again, I'd expect only parsers to care about that sort of thing, so I'd expect 
the parser to be implementing it.

So, what exactly are you looking for the lexer to implement as far as an 
identifier table goes, and why is it in the lexer rather than the parser?

> 7. It should accept a callback delegate for errors. That delegate should
> decide whether to:
>     1. ignore the error (and "Lexer" will try to recover and continue)
>     2. print an error message (and "Lexer" will try to recover and continue)
> 3. throw an exception, "Lexer" is done with that input range

I'm currently treating errors as tokens. It then becomes easy for the code 
using the lexer to just ignore the errors, to process them immediately, or to 
put off dealing with them until the lexing is complete. So, the code using the 
lexer can handle errors however and whenever it likes without having to worry 
about delegates or exceptions. Since tokens are lexed lazily, the fact that an 
error is reported as a token doesn't require that the lexing continue, but it 
also makes it _easy_ to continue lexing, ignoring or saving the error. I'm 
inclined to think that that's a superior approach to using delegates and 
exceptions.

> 8. Lexer should be configurable as to whether it should collect information
> about comments and ddoc comments or not
>
> 9. Comments and ddoc comments should be attached to the next following
> token, they should not themselves be tokens

Why? I'd argue for just processing them as tokens just like I'm doing with 
errors. The code using the lexer can then process them or ignore them as it 
likes (it can even specifically look for them and ignore all other tokens if 
it's something which only operates on comments). And if you want to see which 
token follows the comment, it's the next one in the range. So, I don't see 
much need to attach comments to other tokens. What's the advantage of not 
treating them as tokens?

> 12. It should come with unittests that, using -cov, show 100% coverage

100% is actually unreasonable, because there are lines which should _never_ be 
hit (e.g. an assert(0) line in a switch statement). All lines _other_ than 
those sort of lines should have code coverage, but depending on the lexer 
implementation, 100% is impossible. std.datetime has a ton of unit tests and 
IIRC it still only manages 98% because of assert(0) lines. So, I agree with 
the spirit of your statement, but it's not necessarily reasonable to do 
exactly what you're saying.

- Jonathan M Davis
August 02, 2012
Re: std.d.lexer requirements
On 8/1/2012 5:21 PM, Jakob Ovrum wrote:
> On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:
>> 3. tokens should be values, not classes
>
> I agree with everything but this point. Tokens become quite large when they have
> all kinds of string and position information on them, much larger than the
> typical recommended sizes for pass-by-value. It's still possible to provide an
> interface for them that doesn't require as much copying (ref returns and getters
> and whatnot), but it's fiddly and way too easy to copy these huge Token structs
> around, especially if the output is a range of Token structs.

Doing a class requires a heap allocation per token, which will have disastrous 
performance consequences.
August 02, 2012
Re: std.d.lexer requirements
On 8/1/2012 6:30 PM, Bernard Helyer wrote:
>> 5. It should read or write any mutable global state outside of its "Lexer"
>> instance
>
> I assume you mean 'shouldn't'. :P

Good catch.
August 02, 2012
Re: std.d.lexer requirements
On 8/1/2012 7:04 PM, deadalnix wrote:
> Le 02/08/2012 02:10, Walter Bright a écrit :
>> 6. A single "Lexer" instance should be able to serially accept input
>> ranges, sharing and updating one identifier table
>>
>
> I see the lexer as a function that take an range of char as input and give back
> a range of token. Does it make sense to make an instance of a lexer ?

Yes, as the lexer will have some state and some configuration parameters.
August 02, 2012
Re: std.d.lexer requirements
On Thursday, 2 August 2012 at 04:00:54 UTC, Walter Bright wrote:
> Doing a class requires a heap allocation per token, which will 
> have disastrous performance consequences.

No it doesn't. That's an implication of using the NewExpression, 
not classes.
August 02, 2012
Re: std.d.lexer requirements
On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
> On 8/1/2012 7:04 PM, deadalnix wrote:
> > Le 02/08/2012 02:10, Walter Bright a écrit :
> >> 6. A single "Lexer" instance should be able to serially accept input
> >> ranges, sharing and updating one identifier table
> > 
> > I see the lexer as a function that take an range of char as input and give
> > back a range of token. Does it make sense to make an instance of a lexer
> > ?
> Yes, as the lexer will have some state and some configuration parameters.

Couldn't that just be part of the range type? A separate lexer type shouldn't 
be necessary.

- Jonathan M Davis
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home