January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On Sun, Jan 27, 2013 at 11:42 AM, Brian Schott <briancschott@gmail.com> wrote: > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote: >> >> * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics? > > > It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. Hmm. You're the first person I see pushing this. Personally, I'd go for a struct, if only because I don't need reference semantics. >> * Also, is there a way to keep comments? Any code wanting the modify >> the code might need them. >> (edit: Ah, I see it: IterationStyle.IncludeComments) >> >> * I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes. > > > The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens. Yes but, the standard lexer was done with DMD in mind, and DMD has a different code path for generating comments. It's your project, sure, but I'd appreciate tokens differentiating between the many comments. Oh, and recognizing some inner Ddoc tokens, like ---- delimiters for documentation code blocks. That way, code-in-doc could use the lexer also. Pretty please? >> * I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length? > > > That's the idea. Does it work for UTF-16 and UTF-32 strings? |
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 2013-01-27 11:42, Brian Schott wrote: >> * I see Token has a startIndex member. Any reason not to have a >> endIndex member? Or can and end index always be deduced from >> startIndex and value.length? > > That's the idea. Always good to try and minimize the size of a token when the lexer should output thousands of tokens per second. -- /Jacob Carlborg |
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | Let's add another question: what about treating q{ } token strings as... well, a list of tokens? IDE would like this: no need to reparse the string, the token are there directly. |
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 2013-01-27 10:51, Brian Schott wrote: > I'm writing a D lexer for possible inclusion in Phobos. > > DDOC: > http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html > Code: > https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d > > > It's currently able to correctly syntax highlight all of Phobos, but > does a fairly bad job at rejecting or notifying users/callers about > invalid input. > > I'd like to hear arguments on the various ways to handle errors in the > lexer. In a compiler it would be useful to throw an exception on finding > something like a string literal that doesn't stop before EOF, but a text > editor or IDE would probably want to be a bit more lenient. Maybe having > it run-time (or compile-time configurable) like std.csv would be the > best option here. > > I'm interested in ideas on the API design and other high-level issues at > the moment. I don't consider this ready for inclusion. (The current > module being reviewed for inclusion in Phobos is the new std.uni.) How about changing the type of TokenType to ushort, if all members fit. Just to minimize the size of a token. -- /Jacob Carlborg |
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Philippe Sigaud | On 2013-01-27 12:51, Philippe Sigaud wrote: > Let's add another question: > > what about treating q{ } token strings as... well, a list of tokens? > IDE would like this: no need to reparse the string, the token are > there directly. Perhaps an option for this. -- /Jacob Carlborg |
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Timon Gehr | Am Sun, 27 Jan 2013 12:38:33 +0100
schrieb Timon Gehr <timon.gehr@gmx.ch>:
> On 01/27/2013 11:42 AM, Brian Schott wrote:
> > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
> >> * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?
> >
> > It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ...
>
> The lexer range must be a struct.
>
> > ...
> >> * A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?
> >
> > $ time dscanner --sloc ../phobos/std/datetime.d
> > 14950
> >
> > real 0m0.319s
> > user 0m0.313s
> > sys 0m0.006s
> >
> > $ time dmd -c ../phobos/std/datetime.d
> >
> > real 0m0.354s
> > user 0m0.318s
> > sys 0m0.036s
> >
> > Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.
> >
>
> You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?
>
Profiling is always a good idea, but to be fair: His dmd was probably
compiled with gcc and -O2 if it's a normal release build.
So to compare that he should use gdc, -O2 -release -fno-bounds-check
and probably more flags to compile the d code.
|
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johannes Pfau | Am 27.01.2013 13:31, schrieb Johannes Pfau:
> Am Sun, 27 Jan 2013 12:38:33 +0100
> schrieb Timon Gehr <timon.gehr@gmx.ch>:
>
>> On 01/27/2013 11:42 AM, Brian Schott wrote:
>> > On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
>> >> * Having a range interface is good. Any reason why you made
>> >> byToken a class and not a struct? Most (like, 99%) of range in
>> >> Phobos are structs. Do you need reference semantics?
>> >
>> > It implements the InputRange interface from std.range so that users
>> > have a choice of using template constraints or the OO model in
>> > their code. ...
>>
>> The lexer range must be a struct.
>>
>> > ...
>> >> * A rough estimate of number of tokens/s would be good (I know
>> >> it'll vary). Walter seems to think if a lexer is not able to vomit
>> >> thousands of tokens a seconds, then it's not good. On a related
>> >> note, does your lexer have any problem with 10k+-lines files?
>> >
>> > $ time dscanner --sloc ../phobos/std/datetime.d
>> > 14950
>> >
>> > real 0m0.319s
>> > user 0m0.313s
>> > sys 0m0.006s
>> >
>> > $ time dmd -c ../phobos/std/datetime.d
>> >
>> > real 0m0.354s
>> > user 0m0.318s
>> > sys 0m0.036s
>> >
>> > Yes, I know that "time" is a terrible benchmarking tool, but they're
>> > fairly close for whatever that's worth.
>> >
>>
>> You are measuring lexing speed against compilation speed. A
>> reasonably well performing lexer is around one order of magnitude
>> faster on std.datetime. Maybe you should profile a little?
>>
>
> Profiling is always a good idea, but to be fair: His dmd was probably
> compiled with gcc and -O2 if it's a normal release build.
> So to compare that he should use gdc, -O2 -release -fno-bounds-check
> and probably more flags to compile the d code.
>
that makes no sense - we need a tiny piece of benchmark code inside of dmd frontend (and gdc) - these results are the only reliable/compareable benchmark
someone knows the place where such benchmarking can take place in the dmd frontend code?
|
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to dennis luehring | > that makes no sense - we need a tiny piece of benchmark code inside of > dmd frontend (and gdc) - these results are the only reliable/compareable > benchmark > > someone knows the place where such benchmarking can take place in the > dmd frontend code? > the dmd lexer is in https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c with an small unit test |
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 1/27/2013 1:51 AM, Brian Schott wrote:
> I'm interested in ideas on the API design and other high-level issues at the
> moment. I don't consider this ready for inclusion. (The current module being
> reviewed for inclusion in Phobos is the new std.uni.)
Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
|
January 27, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Philippe Sigaud | On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
> Walter seems to think if a lexer is not able to vomit thousands
> of tokens a seconds, then it's not good.
Speed is critical for a lexer.
This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.
|
Copyright © 1999-2021 by the D Language Foundation