August 02, 2012
On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
>> 1. It should accept as input an input range of UTF8. I feel it is a mistake
>> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
>> should use an 'adapter' range to convert the input to UTF8. (This is what
>> component programming is all about.)
>
> But that's not how ranges of characters work. They're ranges of dchar. Ranges
> don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create
> special wrappers around string or wstring to have ranges of UTF-8. The way
> that it's normally done is to have ranges of dchar and then special-case
> range-based functions for strings. Then the function can operate on any range
> of dchar but still operates on strings efficiently.

I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer.

The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char.

If lexer is required to accept dchar ranges, its performance will drop at least in half, and people are going to go reinvent their own lexers.


> So, it's quite possible to make a lexer operate on ranges of dchar but be
> operating primarily on ASCII by special-casing strings and avoiding decoding
> as much as possible. My lexer does a good job of this, I believe, so it's
> thoroughly optimized for strings, but it's still operating on ranges of dchar,
> not ranges of UTF-8.
>
>> 2. It should output an input range of tokens
>>
>> 3. tokens should be values, not classes
>>
>> 4. It should avoid memory allocation as much as possible
>>
>> 5. It should read or write any mutable global state outside of its "Lexer"
>> instance
>>
>> 6. A single "Lexer" instance should be able to serially accept input ranges,
>> sharing and updating one identifier table
>
> When you say identifier table, do you mean symbol table, or something else?

All identifiers are entered into a hashtable, and are referred to by pointers into that hashtable for the rest of dmd. This makes symbol lookups incredibly fast, as no string comparisons are done.


> Isn't the symbol table something for the parser to worry about? Other than
> parsers, I can't think of anything which would even _care_ about a symbol
> table. And depending on how a symbol table were done, you'd want it to take
> scoping rules into account (_I'd_ certainly expect it to), meaning that it
> only includes identifiers which are relevant to the current scope. And if
> that's the case, it _really_ doesn't belong in the lexer. The code using the
> lexer can easily have its own table that it populates according to however it
> wants it populated by processing the identifier tokens as they get lexed. Not
> to mention, don't symbol tables usually include type information that a lexer
> _wouldn't_ have?

The symbol table becomes a symbol table of pointers into the lexer's identifier table.


> Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care
> about scoping), but I definitely question that that's the right place for it.

The symbol table isn't in the lexer, the identifier string table is.


> If you mean a table that simply lists identifiers rather than a symbol table,
> I'm not quite sure why you'd want it in addition to the symbol table, and
> again, I'd expect only parsers to care about that sort of thing, so I'd expect
> the parser to be implementing it.

Because it is gawdammed fast to do it that way. An identifier is treated as a string exactly once, in the lexer, and thereafter by its unique handle.


> So, what exactly are you looking for the lexer to implement as far as an
> identifier table goes, and why is it in the lexer rather than the parser?

Very simple. A mapping of [identifier string] => [unique value], and a pointer servers nicely as that unique value.


>> 7. It should accept a callback delegate for errors. That delegate should
>> decide whether to:
>>      1. ignore the error (and "Lexer" will try to recover and continue)
>>      2. print an error message (and "Lexer" will try to recover and continue)
>> 3. throw an exception, "Lexer" is done with that input range
>
> I'm currently treating errors as tokens. It then becomes easy for the code
> using the lexer to just ignore the errors, to process them immediately, or to
> put off dealing with them until the lexing is complete. So, the code using the
> lexer can handle errors however and whenever it likes without having to worry
> about delegates or exceptions. Since tokens are lexed lazily, the fact that an
> error is reported as a token doesn't require that the lexing continue, but it
> also makes it _easy_ to continue lexing, ignoring or saving the error. I'm
> inclined to think that that's a superior approach to using delegates and
> exceptions.

I don't agree. It means that the parser has to check everywhere for error tokens. Besides the ugliness, it means a slowdown for those checks.


>> 8. Lexer should be configurable as to whether it should collect information
>> about comments and ddoc comments or not
>>
>> 9. Comments and ddoc comments should be attached to the next following
>> token, they should not themselves be tokens
>
> Why?

Because then the parser would have to add checks for 0 or more comment tokens between every other token. It uglifies the parser pretty badly, and slows it down.

> I'd argue for just processing them as tokens just like I'm doing with
> errors. The code using the lexer can then process them or ignore them as it
> likes (it can even specifically look for them and ignore all other tokens if
> it's something which only operates on comments). And if you want to see which
> token follows the comment, it's the next one in the range. So, I don't see
> much need to attach comments to other tokens. What's the advantage of not
> treating them as tokens?

Performance and simplicity. Your grammar will be awfully ugly unless you insert another range to filter out those tokens - but that'll be a slowdown.


>> 12. It should come with unittests that, using -cov, show 100% coverage
>
> 100% is actually unreasonable, because there are lines which should _never_ be
> hit (e.g. an assert(0) line in a switch statement). All lines _other_ than
> those sort of lines should have code coverage, but depending on the lexer
> implementation, 100% is impossible. std.datetime has a ton of unit tests and
> IIRC it still only manages 98% because of assert(0) lines. So, I agree with
> the spirit of your statement, but it's not necessarily reasonable to do
> exactly what you're saying.

I know, but I expect all unreachable code to be justified. The lexer is a well defined problem, and this should not be difficult.


August 02, 2012
On Wed, Aug 01, 2012 at 09:06:42PM -0700, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
> > On 8/1/2012 7:04 PM, deadalnix wrote:
> > > Le 02/08/2012 02:10, Walter Bright a écrit :
> > >> 6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier table
> > > 
> > > I see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?
> > Yes, as the lexer will have some state and some configuration parameters.
> 
> Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.
[...]

Whether it's part of the range type or a separate lexer type, *definitely* make it possible to have multiple instances. One of the biggest flaws of otherwise-good lexer generators like lex and flex (C/C++) is that the core code assumes a single instance, and multi-instances were glued on after the fact, making it a royal pain to work with anything that needs lexing multiple things at the same time.


T

-- 
The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners... -- Slashdotter
August 02, 2012
On 8/1/2012 9:06 PM, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
>> On 8/1/2012 7:04 PM, deadalnix wrote:
>>> Le 02/08/2012 02:10, Walter Bright a écrit :
>>>> 6. A single "Lexer" instance should be able to serially accept input
>>>> ranges, sharing and updating one identifier table
>>>
>>> I see the lexer as a function that take an range of char as input and give
>>> back a range of token. Does it make sense to make an instance of a lexer
>>> ?
>> Yes, as the lexer will have some state and some configuration parameters.
>
> Couldn't that just be part of the range type? A separate lexer type shouldn't
> be necessary.

No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.

August 02, 2012
On 8/1/2012 9:41 PM, H. S. Teoh wrote:
> Whether it's part of the range type or a separate lexer type,
> *definitely* make it possible to have multiple instances. One of the
> biggest flaws of otherwise-good lexer generators like lex and flex
> (C/C++) is that the core code assumes a single instance, and
> multi-instances were glued on after the fact, making it a royal pain to
> work with anything that needs lexing multiple things at the same time.

Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.

August 02, 2012
On 8/1/2012 9:02 PM, Jakob Ovrum wrote:
> No it doesn't. That's an implication of using the NewExpression, not classes.

See my other reply to you on this topic.

August 02, 2012
On Wednesday, August 01, 2012 21:30:44 Walter Bright wrote:
> On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
> > On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
> >> 1. It should accept as input an input range of UTF8. I feel it is a
> >> mistake
> >> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
> >> should use an 'adapter' range to convert the input to UTF8. (This is what
> >> component programming is all about.)
> > 
> > But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently.
> 
> I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer.
> 
> The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char.
> 
> If lexer is required to accept dchar ranges, its performance will drop at least in half, and people are going to go reinvent their own lexers.

1. The current design of Phobos is to have ranges of dchar, because it fosters code correctness (though it can harm efficiency). It's arguably too late to do otherwise. Certainly, doing otherwise now would break a lot of code. If the lexer tried to operate on UTF-8 as part of its API rather than operating on ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos at all.

2. The lexer does _not_ have to have its performance tank by accepting ranges of dchar. It's true that the performance will be harmed for ranges which _aren't_ strings, but for strings (as would be by far the most common use case) it can be very efficient by special-casing them.

And as much as there are potential performance issues with Phobos' choice of treating strings as ranges of dchar, if it were to continue to treat them as ranges of code units, it's pretty much a guarantee that there would be a _ton_ of bugs caused by it. Most programmers have absolutely no clue about how unicode works and don't really want to know. They want strings to just work. Phobos' approach of defaulting to correct but making it possible to make the code faster through specializations is very much in line with D's typical approach of making things safe by default but allowing the programmer to do unsafe things for optimizations when they know what they're doing.

> All identifiers are entered into a hashtable, and are referred to by pointers into that hashtable for the rest of dmd. This makes symbol lookups incredibly fast, as no string comparisons are done.

Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing else will care about it. At most, it should be an optional feature of the lexer. But it certainly could be added that way.

- Jonathan M Davis
August 02, 2012
On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:
> > Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.
> 
> No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.

Then just pass the same identifier table to the function which creates the token range. That doesn't require another type.

- Jonathan M Davis
August 02, 2012
On Wednesday, August 01, 2012 21:54:01 Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:
> > > Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.
> > 
> > No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.
> 
> Then just pass the same identifier table to the function which creates the token range. That doesn't require another type.

If you end up with enough pieces of state that you need to carry around between files/strings being lexed, then creating another object makes sense, but I would hope that the amount of global state like that would be minimal.

- Jonathan M Davis
August 02, 2012
On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
> Then just pass the same identifier table to the function which creates the
> token range. That doesn't require another type.

You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.

Please keep in mind that a lexer is not something you just pass a few short strings to. It's very very very performance critical as all those extra instructions add up to long delays when you're shoving millions of lines of code into its maw.

For the same reason you're also not going to want the lexer putting pressure on the GC. It could bring your whole system down.

To get a high performance lexer, you're going to be counting the average number of instructions executed per input character. Each one counts. Shaving one off is a victory. You should also be thinking about memory cache access patterns.
August 02, 2012
On 8/1/2012 9:52 PM, Jonathan M Davis wrote:
> 1. The current design of Phobos is to have ranges of dchar, because it fosters
> code correctness (though it can harm efficiency). It's arguably too late to do
> otherwise. Certainly, doing otherwise now would break a lot of code. If the
> lexer tried to operate on UTF-8 as part of its API rather than operating on
> ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos
> at all.

The lexer must use char or it will not be acceptable as anything but a toy for performance reasons.


> 2. The lexer does _not_ have to have its performance tank by accepting ranges
> of dchar. It's true that the performance will be harmed for ranges which
> _aren't_ strings, but for strings (as would be by far the most common use
> case) it can be very efficient by special-casing them.

Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings.

Always always think of the lexer as having a firehose of data shoved into its maw, and it better be thirsty!


> And as much as there are potential performance issues with Phobos' choice of
> treating strings as ranges of dchar, if it were to continue to treat them as
> ranges of code units, it's pretty much a guarantee that there would be a _ton_
> of bugs caused by it. Most programmers have absolutely no clue about how
> unicode works and don't really want to know. They want strings to just work.
> Phobos' approach of defaulting to correct but making it possible to make the
> code faster through specializations is very much in line with D's typical
> approach of making things safe by default but allowing the programmer to do
> unsafe things for optimizations when they know what they're doing.

I expect std.d.lexer to handle UTF8 correctly, so I don't think this should be an issue in this particular case. dmd's lexer does handle UTF8 correctly.

Note also that the places where non-ASCII characters can appear in correct D code is severely limited, and there are even fewer places where multibyte characters need to be decoded at all, and the lexer takes full advantage of this to boost its speed.

For example, non-ASCII characters can appear in comments, but they DO NOT need to be decoded, and even just having the test for a non-ASCII character in the comment scanner will visibly slow down the lexer.

>> All identifiers are entered into a hashtable, and are referred to by
>> pointers into that hashtable for the rest of dmd. This makes symbol lookups
>> incredibly fast, as no string comparisons are done.
>
> Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing
> else will care about it. At most, it should be an optional feature of the
> lexer. But it certainly could be added that way.

I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.