August 06, 2012
On 2012-08-06 21:00, Philippe Sigaud wrote:

> Yes, well we don't have a condition system. And using exceptions
> during lexing would most probably kill its efficiency.
> Errors in lexing are not uncommon. The usual D idiom of having an enum
> StopOnError { no, yes } should be enough.

Especially when if the lexer is used in an IDE. The code is constantly changing and will be invalid quite often.

-- 
/Jacob Carlborg
August 06, 2012
On 06-Aug-12 22:03, deadalnix wrote:
> Le 04/08/2012 15:45, Dmitry Olshansky a écrit :
>> On 04-Aug-12 15:48, Jonathan M Davis wrote:
>>> On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
>>>> I see it as a compile-time policy, that will fit nicely and solve both
>>>> issues. Just provide a templates with a few hooks, and add a Noop
>>>> policy
>>>> that does nothing.
>>>
>>> It's starting to look like figuring out what should and shouldn't be
>>> configurable and how to handle it is going to be the largest problem
>>> in the
>>> lexer...
>>>
>>
>> Let's add some meat to my post.
>> I've seen it mostly as follows:
>>
>> //user defines mixin template that is mixed in inside lexer
>> template MyConfig()
>> {
>> enum identifierTable = true; // means there would be calls to
>> table.insert on each identifier
>> enum countLines = true; //adds line, column properties to the
>> lexer/Tokens
>>
>> //statically bound callbacks, inside one can use say:
>> // skip() - to skip a char (popFront)
>> // get() - to read next char (via popFront, front)
>> // line, col - as readonly properties
>> // (skip & get do the counting if enabled)
>>
>> bool onError()
>> {
>> skip(); //the most dumb recovery, just skip a char
>> return true; //go on with tokenizing, false - stop prematurely
>> }
>>
>> ...
>> }
>>
>> usage:
>>
>>
>> {
>> auto my_supa_table = ...; //some kind of container (should a set on
>> strings and support .insert("blah"); )
>>
>> auto dlex = Lexer!(MyConfig)(table);
>> auto all_tokens = array(dlex(joiner(stdin.byChunk(4096))));
>>
>> //or if we had no interest in table but only tokens:
>> auto noop = Lexer!(NoopLex)();
>> ...
>> }
>>
>
> It seems way too much.
>
> The most complex thing that is needed is the policy to allocate
> identifiers in tokens.

Editor that highlights text may choose not to build identifier table at all. One may see it as a safe mode (low resource mode) for more advance IDE.

> The second parameter is a bool to tokenize comments or not. Is that
> enough ?
No.

And doing Tokens as special comment token is frankly bad idea. See Walter's comments in this thread.

Also e.g. For compiler only DDoc ones are ever useful, not so for IDE. Filtering them out later is inefficient, as it would be far better not to create them in the first place.

> The onError look like a typical use case for conditions as explained in
> the huge thread on Exception.

mm I lost track of that discussion. Either way I see statically bound function as good enough hook into the process as it can do anything useful: skip wrong chars, throw exception, stop parsing prematurely, whatever - pick your poison.

-- 
Dmitry Olshansky
August 06, 2012
On 8/6/2012 12:00 PM, Philippe Sigaud wrote:
> Yes, well we don't have a condition system. And using exceptions
> during lexing would most probably kill its efficiency.
> Errors in lexing are not uncommon. The usual D idiom of having an enum
> StopOnError { no, yes } should be enough.


That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.


August 06, 2012
On 2012-08-06 22:26, Dmitry Olshansky wrote:

> No.
>
> And doing Tokens as special comment token is frankly bad idea. See
> Walter's comments in this thread.
>
> Also e.g. For compiler only DDoc ones are ever useful, not so for IDE.
> Filtering them out later is inefficient, as it would be far better not
> to create them in the first place.

The Eclipse plugin Descent can show formatted DDoc comments.

-- 
/Jacob Carlborg
August 06, 2012
On 07-Aug-12 01:48, Jacob Carlborg wrote:
> On 2012-08-06 22:26, Dmitry Olshansky wrote:
>
>> No.
>>
>> And doing Tokens as special comment token is frankly bad idea. See
>> Walter's comments in this thread.
>>
>> Also e.g. For compiler only DDoc ones are ever useful, not so for IDE.
>> Filtering them out later is inefficient, as it would be far better not
>> to create them in the first place.
>
> The Eclipse plugin Descent can show formatted DDoc comments.
>
I've meant that IDE may be interested in more then just DDoc, at least to highlight things properly.

-- 
Dmitry Olshansky
August 07, 2012
On Thursday, 2 August 2012 at 04:48:56 UTC, Walter Bright wrote:
> On 8/1/2012 9:41 PM, H. S. Teoh wrote:
>> Whether it's part of the range type or a separate lexer type,
>> *definitely* make it possible to have multiple instances. One of the
>> biggest flaws of otherwise-good lexer generators like lex and flex
>> (C/C++) is that the core code assumes a single instance, and
>> multi-instances were glued on after the fact, making it a royal pain to
>> work with anything that needs lexing multiple things at the same time.
>
> Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.

The following is an incredibly fast multithreaded hash table. It is both lock-free and fence-free. Would something like that solve your problem?

http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf

August 07, 2012
Walter Bright , dans le message (digitalmars.D:174360), a écrit :
> On 8/6/2012 12:00 PM, Philippe Sigaud wrote:
>> Yes, well we don't have a condition system. And using exceptions
>> during lexing would most probably kill its efficiency.
>> Errors in lexing are not uncommon. The usual D idiom of having an enum
>> StopOnError { no, yes } should be enough.
> 
> 
> That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.

It may be easier to take into accound few cases (return error token and throwing is enough, so that is a basic static if), than to define a way to integrate a delegate (what should be the delegate's signature, what value to return to query for stopping, how to provide ways to recovers, etc).
August 07, 2012
On Tuesday, August 07, 2012 08:00:24 Christophe Travert wrote:
> Walter Bright , dans le message (digitalmars.D:174360), a écrit :

> > That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.
> 
> It may be easier to take into accound few cases (return error token and throwing is enough, so that is a basic static if), than to define a way to integrate a delegate (what should be the delegate's signature, what value to return to query for stopping, how to provide ways to recovers, etc).

For the moment at least, I'm doing this

bool delegate(string errorMsg, SourcePos pos) errorHandler;

where SourcePos is a struct which holds the line and col of the place in the source code where the bad token starts. It it returns true, the token is skipped. If it returns false, an exception is thrown - and of course the delegate can throw its own exception if it wants to.

But you can also configure the lexer to return an error token instead of using the delegate if that's what you prefer. But Walter is right in that if you have to check every token for whether it's an error, that will incur overhead. So, depending on your use case, that could be unacceptable.

- Jonathan M Davis
August 07, 2012
On 8/6/2012 5:14 PM, Jason House wrote:
> The following is an incredibly fast multithreaded hash table. It is both
> lock-free and fence-free. Would something like that solve your problem?
>
> http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
>

It might if I understood it! There do seem to be some cases where fences are required. This would take considerable study.

August 07, 2012
On 8/7/2012 1:00 AM, Christophe Travert wrote:
>> That's why I suggested supplying a callback delegate to decide what to do with
>> errors (ignore, throw exception, or quit) and have the delegate itself do that.
>> That way, there is no customization of the Lexer required.
>
> It may be easier to take into accound few cases (return error token and
> throwing is enough, so that is a basic static if), than to define a way
> to integrate a delegate (what should be the delegate's signature, what
> value to return to query for stopping, how to provide ways to recovers,
> etc).


If the delegate returns, then the lexer recovers.

The delegate is passed the error message and the location.

I don't see it is more complex than that.