August 07, 2012
On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
> But you can also configure the lexer to return an error token instead of using
> the delegate if that's what you prefer. But Walter is right in that if you
> have to check every token for whether it's an error, that will incur overhead.
> So, depending on your use case, that could be unacceptable.

It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks.

I don't see any advantage to it.


August 07, 2012
On Tuesday, August 07, 2012 02:54:42 Walter Bright wrote:
> On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
> > But you can also configure the lexer to return an error token instead of using the delegate if that's what you prefer. But Walter is right in that if you have to check every token for whether it's an error, that will incur overhead. So, depending on your use case, that could be unacceptable.
> 
> It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks.
> 
> I don't see any advantage to it.

It's easier to see where in the range of tokens the errors occur. A delegate is disconnected from the point where the range is being consumed, whereas if tokens are used for errors, then the function consuming the range can see exactly where in the range of tokens the error is (and potentially handle it differently based on that information).

Regardless, I was asked to keep that option in there by at least one person (Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate entirely.

- Jonathan M Davis
August 07, 2012
Walter Bright , dans le message (digitalmars.D:174393), a écrit :
> If the delegate returns, then the lexer recovers.

That's an option, if there is only one way to recover (which is a reasonable assumption).

You wanted the delegate to "decide what to do with the errors (ignore, throw exception, or quit)".

Throwing is handled, but not ignore/quit. Jonathan's solution (delegate returning a bool) is good. It could also be a delegate returning an int, 0 meaning continue, and any other value being an error code that can be retrieved later. It could also be a number of characters to skip (0 meaning break).

August 07, 2012
Walter Bright , dans le message (digitalmars.D:174394), a écrit :
> On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
>> But you can also configure the lexer to return an error token instead of using the delegate if that's what you prefer. But Walter is right in that if you have to check every token for whether it's an error, that will incur overhead. So, depending on your use case, that could be unacceptable.
> 
> It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks.

It's not necessarily ugly, because of the powerful range design. You can branch the lexer to a range adapter that just ignore error tokens, or throw when it meats an error token.

For example, just use:
auto tokens = data.lexer.throwOnErrorToken;

I don't think this is more ugly than:
auto tokens = data.lexer!(complex signature) { throw LexException; };

But yes, there is overhead, so I understand returning error tokens is not satisfactory for everyone.

> I don't see any advantage to it.

Storing the error somewhere can be of use.
For example, you may want to lex a whole file into an array of tokens,
and then deal with you errors as you analyse the array of tokens.
Of course, you can alway make a delegate to store the error somewhere,
but it is easier if this somewhere is in your token pile.

What I don't see any advantage is using a delegate that can only return
or throw. A policy makes the job:
auto tokens = data.lexer!ExceptionPolicy.throwException;
That's clean too.

If you want the delegate to be of any use, then it must have data to process. That's why I said we have to worry about the signature of the delegate.

-- 
Christophe

August 07, 2012
On Tue, Aug 7, 2012 at 12:06 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:

> Regardless, I was asked to keep that option in there by at least one person (Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate entirely.

IIRC, I was not the only one, as people here interested in coding an IDE asked for it too. A lexer is useful for more than 'just' parsing D afterwards: an IDE could easily color tokens according to their type and an error token is just was is needed to highlight errors.

Also, what I proposed was a *static* decision: with SkipErrors { no, yes }. With a static if inside its guts, the lexer could change its behavior accordingly. Make skipError.yes the default and Walter get its speed. It's just that an IDE or another parser could use

auto lex = std.lexer.Lexer!(SkipError.no)(input);


Walter, with all due respect, you sometimes give the impression to forget we are talking about D and go back to deeply entrenched C-isms. Compile-time decisions can be used to avoid any overhead as long as you have a clear idea of what the two code paths should look like.

And, as Christophe said, ranges are a powerful API. In another thread Simen and me did some comparison between C-like code and code using only ranges upon ranges upon ranges. A (limited!) difference in speed appeared only for very long calculations.
August 07, 2012
On 2012-08-07 12:06, Jonathan M Davis wrote:

> It's easier to see where in the range of tokens the errors occur. A delegate
> is disconnected from the point where the range is being consumed, whereas if
> tokens are used for errors, then the function consuming the range can see
> exactly where in the range of tokens the error is (and potentially handle it
> differently based on that information).

Just pass the same token to the delegate that you would have returned otherwise?

-- 
/Jacob Carlborg
August 07, 2012
On 8/7/2012 3:06 AM, Jonathan M Davis wrote:
> It's easier to see where in the range of tokens the errors occur. A delegate
> is disconnected from the point where the range is being consumed, whereas if
> tokens are used for errors, then the function consuming the range can see
> exactly where in the range of tokens the error is (and potentially handle it
> differently based on that information).

The delegate has a context pointer giving it a reference to whatever context the code calling the Lexer needs.

August 07, 2012
On 8/7/2012 7:15 AM, Philippe Sigaud wrote:
> Also, what I proposed was a *static* decision: with SkipErrors { no,
> yes }. With a static if inside its guts, the lexer could change its
> behavior accordingly.

Yes, I understand about static if decisions :-) hell I invented them!


> Walter, with all due respect, you sometimes give the impression to
> forget we are talking about D and go back to deeply entrenched C-isms.

Delegates are not C-isms.


> Compile-time decisions can be used to avoid any overhead as long as
> you have a clear idea of what the two code paths should look like.

Yes, I understand that. There's also a point about adding too much complexity to the interface. The delegate callback reduces complexity in the interface.

> And, as Christophe said, ranges are a powerful API. In another thread
> Simen and me did some comparison between C-like code and code using
> only ranges upon ranges upon ranges. A (limited!) difference in speed
> appeared only for very long calculations.

That's good, and you really don't need to sell me on ranges - I'm already sold.


August 07, 2012
On Tue, Aug 7, 2012 at 9:38 PM, Walter Bright <newshound2@digitalmars.com> wrote:

> Yes, I understand about static if decisions :-) hell I invented them!

And what a wonderful decision that was!

> Yes, I understand that. There's also a point about adding too much complexity to the interface. The delegate callback reduces complexity in the interface.

OK, then let's let Jonathan work, and we will see how it goes.


>> And, as Christophe said, ranges are a powerful API. In another thread Simen and me did some comparison between C-like code and code using only ranges upon ranges upon ranges. A (limited!) difference in speed appeared only for very long calculations.
>
>
> That's good, and you really don't need to sell me on ranges - I'm already sold.

Well, you gave the impression a bit upstream in this thread that having to filter a token range to eliminate errors was an atrocity (millions of tokens!).

As far as I'm concerned, the recent good news was to (re?)discover than complex calls of ranges upon ranges could still be calculated by CTFE. That's really neat.
August 07, 2012
On Tuesday, August 07, 2012 12:38:26 Walter Bright wrote:
> Yes, I understand that. There's also a point about adding too much complexity to the interface. The delegate callback reduces complexity in the interface.

It doesn't really affect much to allow choosing between returning a token and using a delegate, especially if ignoring errors is treated as a separate option rather than simply using a delegate that skips them (which may or may not be beneficial - it's faster without the delegate, but it's actually kind of hard to get lexing errors).

What worries me more is stuff like providing a way to have the range calculate the current position itself (as Christophe suggested IIRC) or having it provide an efficient way to determine the number of code units between two ranges so that you can slice the range lexed to put in the Token. Determining the number of code units is easily done with ptr for strings, but for everything else, you generally have to count as code units are consumed, which isn't really an issue for small tokens (especially those like symbols where the length is known without counting) but does add up for arbitrarily long ones such as comments or string literals. So, providing a way to calculate it more efficiently where possible might be desirable, but it's yet another layer of complication, and I don't know that it's actually possible to provide such a function in enough situations for it to be worth providing that functionality.

I expect that the configuration stuff is going to have to be adjusted after I'm done, since I'm not sure that it's entirely clear what's worth configuring or not.

- Jonathan M Davis