Jump to page: 1 2 3
Thread overview
What would need to be done to get sdc.lexer to std.lexer quality?
Aug 01, 2012
Bernard Helyer
Aug 01, 2012
Bernard Helyer
Aug 02, 2012
deadalnix
Aug 01, 2012
Jakob Ovrum
Aug 02, 2012
Walter Bright
Aug 02, 2012
Jakob Ovrum
Aug 02, 2012
Walter Bright
Aug 02, 2012
David Nadlinger
Aug 02, 2012
Walter Bright
Aug 02, 2012
Jakob Ovrum
Aug 02, 2012
deadalnix
Aug 02, 2012
Jacob Carlborg
Aug 02, 2012
Jacob Carlborg
Aug 02, 2012
Bernard Helyer
Aug 02, 2012
Jacob Carlborg
Aug 02, 2012
Bernard Helyer
Aug 02, 2012
Bernard Helyer
Aug 02, 2012
Walter Bright
Aug 03, 2012
Bernard Helyer
Aug 03, 2012
Walter Bright
Aug 03, 2012
Nathan M. Swan
August 01, 2012
Okay, so I've seen several comments from several people
regarding the need for a D lexer in Phobos. I figure
I should contribute something to this NG other than
misdirected anger, so here it is.

SDC has a lexer, and it's pretty much complete. It handles
unicode and script lines, and #line and friends.

It's currently MIT, but I've been meaning to re license to
to boost, so that's not an issue. It used to have some number
lexing code stolen from DMD, but I removed that when we moved
to MIT.

https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d

TokenStream would need to become a range, name and specific
interface details requested from you fine people.

opKirbyRape will, with great regret, have to go.

Documentation will need to be buffed, and it'll need to be
renamed into Phobos style.

I'm willing to do the work if people think it's worthwhile,
and I can get some directed suggestions.

-Bernard.
August 01, 2012
I have been informed that deadalnix, that wily Frenchman, has
already built a range abstraction on top of it. So that's a
plus.
August 01, 2012
On Wednesday, 1 August 2012 at 23:06:19 UTC, Bernard Helyer wrote:
> Okay, so I've seen several comments from several people
> regarding the need for a D lexer in Phobos. I figure
> I should contribute something to this NG other than
> misdirected anger, so here it is.
>
> SDC has a lexer, and it's pretty much complete. It handles
> unicode and script lines, and #line and friends.
>
> It's currently MIT, but I've been meaning to re license to
> to boost, so that's not an issue. It used to have some number
> lexing code stolen from DMD, but I removed that when we moved
> to MIT.
>
> https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d
> https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d
>
> TokenStream would need to become a range, name and specific
> interface details requested from you fine people.
>
> opKirbyRape will, with great regret, have to go.
>
> Documentation will need to be buffed, and it'll need to be
> renamed into Phobos style.
>
> I'm willing to do the work if people think it's worthwhile,
> and I can get some directed suggestions.
>
> -Bernard.

Some of the other comments I brought up on IRC:

 * Currently files are read in their entirety first, then parsed. It is worth exploring the idea of reading it in chunks lazily.
 * The current result (TokenStream) is a wrapper over a GC-allocated array of Token class instances, each instance with its own GC allocation (new Token). It is worth exploring an alternative allocation strategy for the tokens.

There are a *lot* of little things that need to be done, but everything important is in place.


August 02, 2012
Le 02/08/2012 01:14, Bernard Helyer a écrit :
> I have been informed that deadalnix, that wily Frenchman, has
> already built a range abstraction on top of it. So that's a
> plus.

It shouldn't be included in phobos, but can be useful to test things during dev.
August 02, 2012
On 8/1/2012 4:18 PM, Jakob Ovrum wrote:
>   * Currently files are read in their entirety first, then parsed. It is worth
> exploring the idea of reading it in chunks lazily.

Using an input range will take care of that nicely.

>   * The current result (TokenStream) is a wrapper over a GC-allocated array of
> Token class instances, each instance with its own GC allocation (new Token). It
> is worth exploring an alternative allocation strategy for the tokens.

That's just not going to produce a high performance lexer.

The way to do it is in the Lexer instance, have a value which is the current Token instance. That way, in the normal case, one NEVER has to allocate a token instance.

Only when lookahead is done is storage allocation required, and that list should be held by Lexer and recycled as tokens get consumed. This is how the dmd lexer works.

Doing one allocation per token is never going to scale to trying to shove millions upon millions of lines of code through it.
August 02, 2012
On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
> That's just not going to produce a high performance lexer.
>
> The way to do it is in the Lexer instance, have a value which is the current Token instance. That way, in the normal case, one NEVER has to allocate a token instance.
>
> Only when lookahead is done is storage allocation required, and that list should be held by Lexer and recycled as tokens get consumed. This is how the dmd lexer works.
>
> Doing one allocation per token is never going to scale to trying to shove millions upon millions of lines of code through it.

Which is exactly why I'm pointing out the current, poor approach. Having a single array with contiguous Tokens for lookahead is completely doable even when Token is a class with some simple GC.malloc and emplace composition. I think SDC's Token class is too big to be useful as a struct, you'd pretty much never want to pass it anywhere by value.
August 02, 2012
On 8/1/2012 10:31 PM, Jakob Ovrum wrote:
> On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
>> That's just not going to produce a high performance lexer.
>>
>> The way to do it is in the Lexer instance, have a value which is the current
>> Token instance. That way, in the normal case, one NEVER has to allocate a
>> token instance.
>>
>> Only when lookahead is done is storage allocation required, and that list
>> should be held by Lexer and recycled as tokens get consumed. This is how the
>> dmd lexer works.
>>
>> Doing one allocation per token is never going to scale to trying to shove
>> millions upon millions of lines of code through it.
>
> Which is exactly why I'm pointing out the current, poor approach. Having a
> single array with contiguous Tokens for lookahead is completely doable even when
> Token is a class with some simple GC.malloc and emplace composition. I think
> SDC's Token class is too big to be useful as a struct, you'd pretty much never
> want to pass it anywhere by value.

Using a class implies an extra level of indirection, and the other issue is the only point to using a class is if you're going to derive from it and override its methods. I don't see that for a Token.

Use pass-by-ref for the Token.

August 02, 2012
On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
> Using a class implies an extra level of indirection, […]
> Use pass-by-ref for the Token.

How is pass-by-ref not an extra level of indirection?

David
August 02, 2012
On 2012-08-02 07:31, Jakob Ovrum wrote:

> Which is exactly why I'm pointing out the current, poor approach. Having
> a single array with contiguous Tokens for lookahead is completely doable
> even when Token is a class with some simple GC.malloc and emplace
> composition. I think SDC's Token class is too big to be useful as a
> struct, you'd pretty much never want to pass it anywhere by value.

If you change Token to a struct it takes 64bytes on a LP64 platform. I don't know if that is too big to be passed around by value.

-- 
/Jacob Carlborg
August 02, 2012
On 2012-08-02 09:11, Jacob Carlborg wrote:

> If you change Token to a struct it takes 64 bytes on a LP64 platform. I
> don't know if that is too big to be passed around by value.

Just for comparison, the type used for tokens in Clang is only 24 bytes. The main reason is the small source location. It's only 32 bites wide, it uses an uint as some kind of offset or id.

http://clang.llvm.org/doxygen/classclang_1_1Token.html
http://clang.llvm.org/doxygen/classclang_1_1SourceLocation.html

-- 
/Jacob Carlborg
« First   ‹ Prev
1 2 3