August 04, 2012
On Saturday, 4 August 2012 at 11:58:09 UTC, Jonathan M Davis wrote:
> On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
>> I see it as a compile-time policy, that will fit nicely and solve both
>> issues. Just provide a templates with a few hooks, and add a Noop policy
>> that does nothing.
>
> It's starting to look like figuring out what should and shouldn't be
> configurable and how to handle it is going to be the largest problem in the
> lexer...
>
> - Jonathan M Davis

If we have a really fast lexer that is highly compile-time configureable and has a readable codebase, then this would be a really good showcase for D.


August 04, 2012
On 04-Aug-12 15:48, Jonathan M Davis wrote:
> On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
>> I see it as a compile-time policy, that will fit nicely and solve both
>> issues. Just provide a templates with a few hooks, and add a Noop policy
>> that does nothing.
>
> It's starting to look like figuring out what should and shouldn't be
> configurable and how to handle it is going to be the largest problem in the
> lexer...
>

Let's add some meat to my post.
I've seen it mostly as follows:

//user defines mixin template that is mixed in inside lexer
template MyConfig()
{
	enum identifierTable = true; // means there would be calls to table.insert on each identifier
	enum countLines = true; //adds line, column properties to the lexer/Tokens

	//statically bound callbacks, inside one can use say:
	// skip() - to skip a char (popFront)
	// get() - to read next char (via popFront, front)
	// line, col - as readonly properties
	// (skip & get do the counting if enabled)
	
	bool onError()
	{
		skip(); //the most dumb recovery, just skip a char
		return true; //go on with tokenizing, false - stop prematurely
	}

	...
}

usage:


{
	auto my_supa_table = ...; //some kind of container (should a  set on strings and support .insert("blah"); )

	auto dlex = Lexer!(MyConfig)(table);
	auto all_tokens = array(dlex(joiner(stdin.byChunk(4096))));

	//or if we had no interest in table but only tokens:
	auto noop = Lexer!(NoopLex)();
	...		
}

-- 
Dmitry Olshansky
August 04, 2012
Jonathan M Davis , dans le message (digitalmars.D:174223), a écrit :
> On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
>> I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.
> 
> It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...

Yes, I figured-out a policy could be used to, but since the begining of the thread, that makes a lot of things to configure! Jonathan would have trouble trying to make them all. Choices have to be made. That's why I proposed to use adapter range to enable to do the buffering instead of slicing, and to build the look up table. Done correctly, it can make the core of the lexer imlementation clean without loosing efficiency (I hope). If this policy for parsing literals if the only thing that remains to be configured directly in the core of the lexer with static if, then it's reasonable.
August 04, 2012
On 08/02/2012 03:09 AM, Bernard Helyer wrote:
> http://i.imgur.com/oSXTc.png
>
> Posted without comment.

Hell yeah Alexander Brandon.
August 05, 2012
To help with performance comparisons I ripped dmd's lexer out and got it building as a few .d files.  It's very crude. It's got tons of casts (more than the original c++ version).  I attempted no cleanup or any other change than the minimum I could to get it to build and run.  Obviously there's tons of room for cleanup, but that's not the point... it's just useful as a baseline.

The branch:
    https://github.com/braddr/phobos/tree/dmd_lexer

The commit with the changes:
    https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab

On my desktop (far from idle, it's running 2 of the auto testers), it consistently takes 0.187s to lex all of the .d files in phobos.

Later,
Brad

On 8/1/2012 5:10 PM, Walter Bright wrote:
> Given the various proposals for a lexer module for Phobos, I thought I'd share some characteristics it ought to have.
> 
> First of all, it should be suitable for, at a minimum:
> 
> 1. compilers
> 
> 2. syntax highlighting editors
> 
> 3. source code formatters
> 
> 4. html creation
> 
> To that end:
> 
> 1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)
> 
> 2. It should output an input range of tokens
> 
> 3. tokens should be values, not classes
> 
> 4. It should avoid memory allocation as much as possible
> 
> 5. It should read or write any mutable global state outside of its "Lexer" instance
> 
> 6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier table
> 
> 7. It should accept a callback delegate for errors. That delegate should decide whether to:
>    1. ignore the error (and "Lexer" will try to recover and continue)
>    2. print an error message (and "Lexer" will try to recover and continue)
>    3. throw an exception, "Lexer" is done with that input range
> 
> 8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not
> 
> 9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokens
> 
> 10. High speed matters a lot
> 
> 11. Tokens should have begin/end line/column markers, though most of the time this can be implicitly determined
> 
> 12. It should come with unittests that, using -cov, show 100% coverage
> 
> 
> Basically, I don't want anyone to be motivated to do a separate one after seeing this one.

August 05, 2012
On Saturday, August 04, 2012 17:45:58 Dmitry Olshansky wrote:
> On 04-Aug-12 15:48, Jonathan M Davis wrote:
> > On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
> >> I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.
> > 
> > It's starting to look like figuring out what should and shouldn't be
> > configurable and how to handle it is going to be the largest problem in
> > the
> > lexer...
> 
> Let's add some meat to my post.
> I've seen it mostly as follows:
[snip]

It would probably be a bit more user friendly to pass a struct as a template argument (which you can't do in the normal sense, but you can pass it as an alias). Regardless, the problem isn't so much how to provide a configuration as how the configuration options affect the lexer and what it does. I'll figure it out, but it does definitely complicate things. I wasn't originally planning an having anything be configurable. But if we want it to be such that no one will want to write another one (as Walter is looking for), then it's going to need to be configurable enough to make it efficient for all of the common lexing scenarios rather than efficient for one particular scenario.

- Jonathan M Davis
August 05, 2012
On 8/5/2012 12:59 AM, Brad Roberts wrote:
> To help with performance comparisons I ripped dmd's lexer out and got it building as a few .d files.  It's very crude.
> It's got tons of casts (more than the original c++ version).  I attempted no cleanup or any other change than the
> minimum I could to get it to build and run.  Obviously there's tons of room for cleanup, but that's not the point...
> it's just useful as a baseline.
>
> The branch:
>      https://github.com/braddr/phobos/tree/dmd_lexer
>
> The commit with the changes:
>      https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab
>
> On my desktop (far from idle, it's running 2 of the auto testers), it consistently takes 0.187s to lex all of the .d
> files in phobos.
>
> Later,
> Brad

Thanks, Brad!
August 06, 2012
On 8/1/12 21:10 , Walter Bright wrote:
> 8. Lexer should be configurable as to whether it should collect
> information about comments and ddoc comments or not
>
> 9. Comments and ddoc comments should be attached to the next following
> token, they should not themselves be tokens

I believe there should be an option to get comments as tokens. Or otherwise the attached comments should have source location...
August 06, 2012
Le 04/08/2012 15:45, Dmitry Olshansky a écrit :
> On 04-Aug-12 15:48, Jonathan M Davis wrote:
>> On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
>>> I see it as a compile-time policy, that will fit nicely and solve both
>>> issues. Just provide a templates with a few hooks, and add a Noop policy
>>> that does nothing.
>>
>> It's starting to look like figuring out what should and shouldn't be
>> configurable and how to handle it is going to be the largest problem
>> in the
>> lexer...
>>
>
> Let's add some meat to my post.
> I've seen it mostly as follows:
>
> //user defines mixin template that is mixed in inside lexer
> template MyConfig()
> {
> enum identifierTable = true; // means there would be calls to
> table.insert on each identifier
> enum countLines = true; //adds line, column properties to the lexer/Tokens
>
> //statically bound callbacks, inside one can use say:
> // skip() - to skip a char (popFront)
> // get() - to read next char (via popFront, front)
> // line, col - as readonly properties
> // (skip & get do the counting if enabled)
>
> bool onError()
> {
> skip(); //the most dumb recovery, just skip a char
> return true; //go on with tokenizing, false - stop prematurely
> }
>
> ...
> }
>
> usage:
>
>
> {
> auto my_supa_table = ...; //some kind of container (should a set on
> strings and support .insert("blah"); )
>
> auto dlex = Lexer!(MyConfig)(table);
> auto all_tokens = array(dlex(joiner(stdin.byChunk(4096))));
>
> //or if we had no interest in table but only tokens:
> auto noop = Lexer!(NoopLex)();
> ...
> }
>

It seems way too much.

The most complex thing that is needed is the policy to allocate identifiers in tokens. It can be made by passing a function that have a string as parameter and a string as return value. The default one would be an identity function.

The second parameter is a bool to tokenize comments or not. Is that enough ?

The onError look like a typical use case for conditions as explained in the huge thread on Exception.
August 06, 2012
On Mon, Aug 6, 2012 at 8:03 PM, deadalnix <deadalnix@gmail.com> wrote:

> The most complex thing that is needed is the policy to allocate identifiers in tokens. It can be made by passing a function that have a string as parameter and a string as return value. The default one would be an identity function.

I think one should pass it an empty symbol table and the lexer should fill it and associate each identifier with a unique ID, ID which would appear in the Identifier token.


> The second parameter is a bool to tokenize comments or not. Is that enough ?
>
> The onError look like a typical use case for conditions as explained in the huge thread on Exception.

Yes, well we don't have a condition system. And using exceptions
during lexing would most probably kill its efficiency.
Errors in lexing are not uncommon. The usual D idiom of having an enum
StopOnError { no, yes } should be enough.