January 27, 2013
On Sun, Jan 27, 2013 at 11:42 AM, Brian Schott <briancschott@gmail.com> wrote:
> On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
>>
>> * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?
>
>
> It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code.

Hmm. You're the first person I see pushing this. Personally, I'd go for a struct, if only because I don't need reference semantics.

>> * Also, is there a way to keep comments? Any code wanting the modify
>> the code might need them.
>> (edit: Ah, I see it: IterationStyle.IncludeComments)
>>
>> * I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes.
>
>
> The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens.

Yes but, the standard lexer was done with DMD in mind, and DMD has a different code path for generating comments.

It's your project, sure, but I'd appreciate tokens differentiating
between the many comments.
Oh, and recognizing some inner Ddoc tokens, like ---- delimiters for
documentation code blocks. That way, code-in-doc could use the lexer
also.

Pretty please?


>> * I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length?
>
>
> That's the idea.

Does it work for UTF-16 and UTF-32 strings?
January 27, 2013
On 2013-01-27 11:42, Brian Schott wrote:

>> * I see Token has a startIndex member. Any reason not to have a
>> endIndex member? Or can and end index always be deduced from
>> startIndex and value.length?
>
> That's the idea.

Always good to try and minimize the size of a token when the lexer should output thousands of tokens per second.

-- 
/Jacob Carlborg
January 27, 2013
Let's add another question:

what about treating q{ } token strings as... well, a list of tokens? IDE would like this: no need to reparse the string, the token are there directly.
January 27, 2013
On 2013-01-27 10:51, Brian Schott wrote:
> I'm writing a D lexer for possible inclusion in Phobos.
>
> DDOC:
> http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
> Code:
> https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d
>
>
> It's currently able to correctly syntax highlight all of Phobos, but
> does a fairly bad job at rejecting or notifying users/callers about
> invalid input.
>
> I'd like to hear arguments on the various ways to handle errors in the
> lexer. In a compiler it would be useful to throw an exception on finding
> something like a string literal that doesn't stop before EOF, but a text
> editor or IDE would probably want to be a bit more lenient. Maybe having
> it run-time (or compile-time configurable) like std.csv would be the
> best option here.
>
> I'm interested in ideas on the API design and other high-level issues at
> the moment. I don't consider this ready for inclusion. (The current
> module being reviewed for inclusion in Phobos is the new std.uni.)

How about changing the type of TokenType to ushort, if all members fit. Just to minimize the size of a token.

-- 
/Jacob Carlborg
January 27, 2013
On 2013-01-27 12:51, Philippe Sigaud wrote:
> Let's add another question:
>
> what about treating q{ } token strings as... well, a list of tokens?
> IDE would like this: no need to reparse the string, the token are
> there directly.

Perhaps an option for this.

-- 
/Jacob Carlborg
January 27, 2013
Am Sun, 27 Jan 2013 12:38:33 +0100
schrieb Timon Gehr <timon.gehr@gmx.ch>:

> On 01/27/2013 11:42 AM, Brian Schott wrote:
> > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
> >> * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?
> >
> > It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ...
> 
> The lexer range must be a struct.
> 
> > ...
> >> * A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?
> >
> > $ time dscanner --sloc ../phobos/std/datetime.d
> > 14950
> >
> > real    0m0.319s
> > user    0m0.313s
> > sys    0m0.006s
> >
> > $ time dmd -c ../phobos/std/datetime.d
> >
> > real    0m0.354s
> > user    0m0.318s
> > sys    0m0.036s
> >
> > Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.
> >
> 
> You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?
> 

Profiling is always a good idea, but to be fair: His dmd was probably
compiled with gcc and -O2 if it's a normal release build.
So to compare that he should use gdc, -O2 -release -fno-bounds-check
and probably more flags to compile the d code.
January 27, 2013
Am 27.01.2013 13:31, schrieb Johannes Pfau:
> Am Sun, 27 Jan 2013 12:38:33 +0100
> schrieb Timon Gehr <timon.gehr@gmx.ch>:
>
>> On 01/27/2013 11:42 AM, Brian Schott wrote:
>> > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
>> >> * Having a range interface is good. Any reason why you made
>> >> byToken a class and not a struct? Most (like, 99%) of range in
>> >> Phobos are structs. Do you need reference semantics?
>> >
>> > It implements the InputRange interface from std.range so that users
>> > have a choice of using template constraints or the OO model in
>> > their code. ...
>>
>> The lexer range must be a struct.
>>
>> > ...
>> >> * A rough estimate of number of tokens/s would be good (I know
>> >> it'll vary). Walter seems to think if a lexer is not able to vomit
>> >> thousands of tokens a seconds, then it's not good. On a related
>> >> note, does your lexer have any problem with 10k+-lines files?
>> >
>> > $ time dscanner --sloc ../phobos/std/datetime.d
>> > 14950
>> >
>> > real    0m0.319s
>> > user    0m0.313s
>> > sys    0m0.006s
>> >
>> > $ time dmd -c ../phobos/std/datetime.d
>> >
>> > real    0m0.354s
>> > user    0m0.318s
>> > sys    0m0.036s
>> >
>> > Yes, I know that "time" is a terrible benchmarking tool, but they're
>> > fairly close for whatever that's worth.
>> >
>>
>> You are measuring lexing speed against compilation speed. A
>> reasonably well performing lexer is around one order of magnitude
>> faster on std.datetime. Maybe you should profile a little?
>>
>
> Profiling is always a good idea, but to be fair: His dmd was probably
> compiled with gcc and -O2 if it's a normal release build.
> So to compare that he should use gdc, -O2 -release -fno-bounds-check
> and probably more flags to compile the d code.
>

that makes no sense - we need a tiny piece of benchmark code inside of dmd frontend (and gdc) - these results are the only reliable/compareable benchmark

someone knows the place where such benchmarking can take place in the dmd frontend code?
January 27, 2013
> that makes no sense - we need a tiny piece of benchmark code inside of
> dmd frontend (and gdc) - these results are the only reliable/compareable
> benchmark
>
> someone knows the place where such benchmarking can take place in the
> dmd frontend code?
>

the dmd lexer is in

https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

with an small unit test
January 27, 2013
On 1/27/2013 1:51 AM, Brian Schott wrote:
> I'm interested in ideas on the API design and other high-level issues at the
> moment. I don't consider this ready for inclusion. (The current module being
> reviewed for inclusion in Phobos is the new std.uni.)

Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
January 27, 2013
On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
> Walter seems to think if a lexer is not able to vomit thousands
> of tokens a seconds, then it's not good.

Speed is critical for a lexer.

This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.