Request for comments: std.d.lexer (page 2)

On Sun, Jan 27, 2013 at 11:42 AM, Brian Schott <briancschott@gmail.com> wrote: > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote: >> >> * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics? > > > It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. Hmm. You're the first person I see pushing this. Personally, I'd go for a struct, if only because I don't need reference semantics. >> * Also, is there a way to keep comments? Any code wanting the modify >> the code might need them. >> (edit: Ah, I see it: IterationStyle.IncludeComments) >> >> * I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes. > > > The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens. Yes but, the standard lexer was done with DMD in mind, and DMD has a different code path for generating comments. It's your project, sure, but I'd appreciate tokens differentiating between the many comments. Oh, and recognizing some inner Ddoc tokens, like ---- delimiters for documentation code blocks. That way, code-in-doc could use the lexer also. Pretty please? >> * I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length? > > > That's the idea. Does it work for UTF-16 and UTF-32 strings?

On 2013-01-27 11:42, Brian Schott wrote: >> * I see Token has a startIndex member. Any reason not to have a >> endIndex member? Or can and end index always be deduced from >> startIndex and value.length? > > That's the idea. Always good to try and minimize the size of a token when the lexer should output thousands of tokens per second. -- /Jacob Carlborg

On 2013-01-27 10:51, Brian Schott wrote: > I'm writing a D lexer for possible inclusion in Phobos. > > DDOC: > http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html > Code: > https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d > > > It's currently able to correctly syntax highlight all of Phobos, but > does a fairly bad job at rejecting or notifying users/callers about > invalid input. > > I'd like to hear arguments on the various ways to handle errors in the > lexer. In a compiler it would be useful to throw an exception on finding > something like a string literal that doesn't stop before EOF, but a text > editor or IDE would probably want to be a bit more lenient. Maybe having > it run-time (or compile-time configurable) like std.csv would be the > best option here. > > I'm interested in ideas on the API design and other high-level issues at > the moment. I don't consider this ready for inclusion. (The current > module being reviewed for inclusion in Phobos is the new std.uni.) How about changing the type of TokenType to ushort, if all members fit. Just to minimize the size of a token. -- /Jacob Carlborg

On 2013-01-27 12:51, Philippe Sigaud wrote: > Let's add another question: > > what about treating q{ } token strings as... well, a list of tokens? > IDE would like this: no need to reparse the string, the token are > there directly. Perhaps an option for this. -- /Jacob Carlborg

Am Sun, 27 Jan 2013 12:38:33 +0100 schrieb Timon Gehr <timon.gehr@gmx.ch>: > On 01/27/2013 11:42 AM, Brian Schott wrote: > > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote: > >> * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics? > > > > It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ... > > The lexer range must be a struct. > > > ... > >> * A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files? > > > > $ time dscanner --sloc ../phobos/std/datetime.d > > 14950 > > > > real 0m0.319s > > user 0m0.313s > > sys 0m0.006s > > > > $ time dmd -c ../phobos/std/datetime.d > > > > real 0m0.354s > > user 0m0.318s > > sys 0m0.036s > > > > Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth. > > > > You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little? > Profiling is always a good idea, but to be fair: His dmd was probably compiled with gcc and -O2 if it's a normal release build. So to compare that he should use gdc, -O2 -release -fno-bounds-check and probably more flags to compile the d code.

January 27, 2013

Re: Request for comments: std.d.lexer

Posted by dennis luehring
in reply to Johannes Pfau

Permalink

dennis luehring

Posted in reply to Johannes Pfau

Permalink

Am 27.01.2013 13:31, schrieb Johannes Pfau:
> Am Sun, 27 Jan 2013 12:38:33 +0100
> schrieb Timon Gehr <timon.gehr@gmx.ch>:
>
>> On 01/27/2013 11:42 AM, Brian Schott wrote:
>> > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
>> >> * Having a range interface is good. Any reason why you made
>> >> byToken a class and not a struct? Most (like, 99%) of range in
>> >> Phobos are structs. Do you need reference semantics?
>> >
>> > It implements the InputRange interface from std.range so that users
>> > have a choice of using template constraints or the OO model in
>> > their code. ...
>>
>> The lexer range must be a struct.
>>
>> > ...
>> >> * A rough estimate of number of tokens/s would be good (I know
>> >> it'll vary). Walter seems to think if a lexer is not able to vomit
>> >> thousands of tokens a seconds, then it's not good. On a related
>> >> note, does your lexer have any problem with 10k+-lines files?
>> >
>> > $ time dscanner --sloc ../phobos/std/datetime.d
>> > 14950
>> >
>> > real    0m0.319s
>> > user    0m0.313s
>> > sys    0m0.006s
>> >
>> > $ time dmd -c ../phobos/std/datetime.d
>> >
>> > real    0m0.354s
>> > user    0m0.318s
>> > sys    0m0.036s
>> >
>> > Yes, I know that "time" is a terrible benchmarking tool, but they're
>> > fairly close for whatever that's worth.
>> >
>>
>> You are measuring lexing speed against compilation speed. A
>> reasonably well performing lexer is around one order of magnitude
>> faster on std.datetime. Maybe you should profile a little?
>>
>
> Profiling is always a good idea, but to be fair: His dmd was probably
> compiled with gcc and -O2 if it's a normal release build.
> So to compare that he should use gdc, -O2 -release -fno-bounds-check
> and probably more flags to compile the d code.
>

that makes no sense - we need a tiny piece of benchmark code inside of dmd frontend (and gdc) - these results are the only reliable/compareable benchmark

someone knows the place where such benchmarking can take place in the dmd frontend code?

> that makes no sense - we need a tiny piece of benchmark code inside of > dmd frontend (and gdc) - these results are the only reliable/compareable > benchmark > > someone knows the place where such benchmarking can take place in the > dmd frontend code? > the dmd lexer is in https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c with an small unit test

On 1/27/2013 1:51 AM, Brian Schott wrote: > I'm interested in ideas on the API design and other high-level issues at the > moment. I don't consider this ready for inclusion. (The current module being > reviewed for inclusion in Phobos is the new std.uni.) Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.

On 1/27/2013 2:17 AM, Philippe Sigaud wrote: > Walter seems to think if a lexer is not able to vomit thousands > of tokens a seconds, then it's not good. Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.

Forums