| Thread overview | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
August 01, 2012 What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Okay, so I've seen several comments from several people regarding the need for a D lexer in Phobos. I figure I should contribute something to this NG other than misdirected anger, so here it is. SDC has a lexer, and it's pretty much complete. It handles unicode and script lines, and #line and friends. It's currently MIT, but I've been meaning to re license to to boost, so that's not an issue. It used to have some number lexing code stolen from DMD, but I removed that when we moved to MIT. https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d TokenStream would need to become a range, name and specific interface details requested from you fine people. opKirbyRape will, with great regret, have to go. Documentation will need to be buffed, and it'll need to be renamed into Phobos style. I'm willing to do the work if people think it's worthwhile, and I can get some directed suggestions. -Bernard. | ||||
August 01, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Bernard Helyer | I have been informed that deadalnix, that wily Frenchman, has already built a range abstraction on top of it. So that's a plus. | |||
August 01, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Bernard Helyer | On Wednesday, 1 August 2012 at 23:06:19 UTC, Bernard Helyer wrote:
> Okay, so I've seen several comments from several people
> regarding the need for a D lexer in Phobos. I figure
> I should contribute something to this NG other than
> misdirected anger, so here it is.
>
> SDC has a lexer, and it's pretty much complete. It handles
> unicode and script lines, and #line and friends.
>
> It's currently MIT, but I've been meaning to re license to
> to boost, so that's not an issue. It used to have some number
> lexing code stolen from DMD, but I removed that when we moved
> to MIT.
>
> https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d
>
> TokenStream would need to become a range, name and specific
> interface details requested from you fine people.
>
> opKirbyRape will, with great regret, have to go.
>
> Documentation will need to be buffed, and it'll need to be
> renamed into Phobos style.
>
> I'm willing to do the work if people think it's worthwhile,
> and I can get some directed suggestions.
>
> -Bernard.
Some of the other comments I brought up on IRC:
* Currently files are read in their entirety first, then parsed. It is worth exploring the idea of reading it in chunks lazily.
* The current result (TokenStream) is a wrapper over a GC-allocated array of Token class instances, each instance with its own GC allocation (new Token). It is worth exploring an alternative allocation strategy for the tokens.
There are a *lot* of little things that need to be done, but everything important is in place.
| |||
August 02, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Bernard Helyer | Le 02/08/2012 01:14, Bernard Helyer a écrit :
> I have been informed that deadalnix, that wily Frenchman, has
> already built a range abstraction on top of it. So that's a
> plus.
It shouldn't be included in phobos, but can be useful to test things during dev.
| |||
August 02, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jakob Ovrum | On 8/1/2012 4:18 PM, Jakob Ovrum wrote: > * Currently files are read in their entirety first, then parsed. It is worth > exploring the idea of reading it in chunks lazily. Using an input range will take care of that nicely. > * The current result (TokenStream) is a wrapper over a GC-allocated array of > Token class instances, each instance with its own GC allocation (new Token). It > is worth exploring an alternative allocation strategy for the tokens. That's just not going to produce a high performance lexer. The way to do it is in the Lexer instance, have a value which is the current Token instance. That way, in the normal case, one NEVER has to allocate a token instance. Only when lookahead is done is storage allocation required, and that list should be held by Lexer and recycled as tokens get consumed. This is how the dmd lexer works. Doing one allocation per token is never going to scale to trying to shove millions upon millions of lines of code through it. | |||
August 02, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
> That's just not going to produce a high performance lexer.
>
> The way to do it is in the Lexer instance, have a value which is the current Token instance. That way, in the normal case, one NEVER has to allocate a token instance.
>
> Only when lookahead is done is storage allocation required, and that list should be held by Lexer and recycled as tokens get consumed. This is how the dmd lexer works.
>
> Doing one allocation per token is never going to scale to trying to shove millions upon millions of lines of code through it.
Which is exactly why I'm pointing out the current, poor approach. Having a single array with contiguous Tokens for lookahead is completely doable even when Token is a class with some simple GC.malloc and emplace composition. I think SDC's Token class is too big to be useful as a struct, you'd pretty much never want to pass it anywhere by value.
| |||
August 02, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jakob Ovrum | On 8/1/2012 10:31 PM, Jakob Ovrum wrote:
> On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
>> That's just not going to produce a high performance lexer.
>>
>> The way to do it is in the Lexer instance, have a value which is the current
>> Token instance. That way, in the normal case, one NEVER has to allocate a
>> token instance.
>>
>> Only when lookahead is done is storage allocation required, and that list
>> should be held by Lexer and recycled as tokens get consumed. This is how the
>> dmd lexer works.
>>
>> Doing one allocation per token is never going to scale to trying to shove
>> millions upon millions of lines of code through it.
>
> Which is exactly why I'm pointing out the current, poor approach. Having a
> single array with contiguous Tokens for lookahead is completely doable even when
> Token is a class with some simple GC.malloc and emplace composition. I think
> SDC's Token class is too big to be useful as a struct, you'd pretty much never
> want to pass it anywhere by value.
Using a class implies an extra level of indirection, and the other issue is the only point to using a class is if you're going to derive from it and override its methods. I don't see that for a Token.
Use pass-by-ref for the Token.
| |||
August 02, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
> Using a class implies an extra level of indirection, […]
> Use pass-by-ref for the Token.
How is pass-by-ref not an extra level of indirection?
David
| |||
August 02, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jakob Ovrum | On 2012-08-02 07:31, Jakob Ovrum wrote: > Which is exactly why I'm pointing out the current, poor approach. Having > a single array with contiguous Tokens for lookahead is completely doable > even when Token is a class with some simple GC.malloc and emplace > composition. I think SDC's Token class is too big to be useful as a > struct, you'd pretty much never want to pass it anywhere by value. If you change Token to a struct it takes 64bytes on a LP64 platform. I don't know if that is too big to be passed around by value. -- /Jacob Carlborg | |||
August 02, 2012 Re: What would need to be done to get sdc.lexer to std.lexer quality? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On 2012-08-02 09:11, Jacob Carlborg wrote: > If you change Token to a struct it takes 64 bytes on a LP64 platform. I > don't know if that is too big to be passed around by value. Just for comparison, the type used for tokens in Clang is only 24 bytes. The main reason is the small source location. It's only 32 bites wide, it uses an uint as some kind of offset or id. http://clang.llvm.org/doxygen/classclang_1_1Token.html http://clang.llvm.org/doxygen/classclang_1_1SourceLocation.html -- /Jacob Carlborg | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply