Request for comments: std.d.lexer (page 9)

On 2013-01-30 10:49, Brian Schott wrote: > Results: > > $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d > > ------------------------ > Total time (ms): 13861.8 > Repetitions : 200 > Sample mode : 69 (90 ocurrences) > Median time : 69.0745 > Avg time : 69.3088 > Std dev. : 0.670203 > Minimum : 68.613 > Maximum : 72.635 > 95% conf.int. : [67.9952, 70.6223] e = 1.31357 > 99% conf.int. : [67.5824, 71.0351] e = 1.72633 > EstimatedAvg95%: [69.2159, 69.4016] e = 0.0928836 > EstimatedAvg99%: [69.1867, 69.4308] e = 0.12207 > > If my math is right, that means it's getting 4.9 million tokens/second > now. According to Valgrind the only way to really improve things now is > to require that the input to the lexer support slicing. (Remember the > secret of Tango's XML parser...) The bottleneck is now on the calls to > .idup to construct the token strings from slices of the buffer. How many tokens would that be in total? -- /Jacob Carlborg

On 2013-01-31 13:14, Jacob Carlborg wrote: > Just thinking out loud here. Would it be possible to lex a file in parallel? > Cutting it in half (or similar) and lex both pieces simultaneously in parallel. Do you know where you can safely cut it without having it lexed beforehand? :)

January 31, 2013

Re: Request for comments: std.d.lexer

Posted by dennis luehring
in reply to Jacob Carlborg

Permalink

dennis luehring

Posted in reply to Jacob Carlborg

Permalink

Am 31.01.2013 13:14, schrieb Jacob Carlborg:
> On 2013-01-27 10:51, Brian Schott wrote:
>> I'm writing a D lexer for possible inclusion in Phobos.
>>
>> DDOC:
>> http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
>> Code:
>> https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d
>>
>>
>> It's currently able to correctly syntax highlight all of Phobos, but
>> does a fairly bad job at rejecting or notifying users/callers about
>> invalid input.
>>
>> I'd like to hear arguments on the various ways to handle errors in the
>> lexer. In a compiler it would be useful to throw an exception on finding
>> something like a string literal that doesn't stop before EOF, but a text
>> editor or IDE would probably want to be a bit more lenient. Maybe having
>> it run-time (or compile-time configurable) like std.csv would be the
>> best option here.
>>
>> I'm interested in ideas on the API design and other high-level issues at
>> the moment. I don't consider this ready for inclusion. (The current
>> module being reviewed for inclusion in Phobos is the new std.uni.)
>
> Just thinking out loud here. Would it be possible to lex a file in
> parallel? Cutting it in half (or similar) and lex both pieces
> simultaneously in parallel.

why not only the symbols at the border needs to be "connected" then

the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can also depend on the speed of the filesystem

On 2013-01-31 13:34, FG wrote: > Do you know where you can safely cut it without having it lexed > beforehand? :) I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut. Although I have no idea how much trouble it would given you and how much you would gain. -- /Jacob Carlborg

On 2013-01-31 13:35, dennis luehring wrote: > why not only the symbols at the border needs to be "connected" then > > the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can > also depend on the speed of the filesystem That would require some profiling to figure out. -- /Jacob Carlborg

Am 31.01.2013 13:48, schrieb Jacob Carlborg: > On 2013-01-31 13:35, dennis luehring wrote: > >> why not only the symbols at the border needs to be "connected" then >> >> the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can >> also depend on the speed of the filesystem > > That would require some profiling to figure out. i would say it "can" help alot so the design should be able to use split-parts and combine symboles at the border and also file-based lexing should be threadable so that 16 files can be handled by my 16 cores system full in parallel :) but its also size dependend - it makes no sense to split in too small parts, could be counter-productive

On Thu, Jan 31, 2013 at 01:48:02PM +0100, Jacob Carlborg wrote: > On 2013-01-31 13:34, FG wrote: > > >Do you know where you can safely cut it without having it lexed beforehand? :) > > I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut. [...] Doesn't work if the middle happens to be inside a string literal containing code. Esp. a q{} literal (you wouldn't be able to tell where it starts/ends without scanning the entire file, because the {}'s nest). T -- Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.

On 2013-01-31 16:54, H. S. Teoh wrote: > Doesn't work if the middle happens to be inside a string literal > containing code. Esp. a q{} literal (you wouldn't be able to tell where > it starts/ends without scanning the entire file, because the {}'s nest). That would be a problem. -- /Jacob Carlborg

On 2013-01-31 13:57, dennis luehring wrote: > i would say it "can" help alot so the design should be able to use > split-parts and combine symboles at the border > > and also file-based lexing should be threadable so that 16 files can be > handled by my 16 cores system full in parallel :) This is kind of obvious, I think. That's why I started to think at this less obvious case. > but its also size dependend - it makes no sense to split in too small > parts, could be counter-productive Of course not. That would require some profiling as well to find a sweet spot. -- /Jacob Carlborg

On Thursday, 31 January 2013 at 12:48:03 UTC, Jacob Carlborg wrote: > On 2013-01-31 13:34, FG wrote: > >> Do you know where you can safely cut it without having it lexed >> beforehand? :) > > I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut. > > Although I have no idea how much trouble it would given you and how much you would gain. I don't think it worth the complexity. You can lex both file in parallel with 2 lexer instance if you want to make things faster.

Forums