January 31, 2013
On 2013-01-30 10:49, Brian Schott wrote:

> Results:
>
> $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d
>
> ------------------------
> Total time (ms): 13861.8
> Repetitions    : 200
> Sample mode    : 69 (90 ocurrences)
> Median time    : 69.0745
> Avg time       : 69.3088
> Std dev.       : 0.670203
> Minimum        : 68.613
> Maximum        : 72.635
> 95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
> 99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
> EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
> EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207
>
> If my math is right, that means it's getting 4.9 million tokens/second
> now. According to Valgrind the only way to really improve things now is
> to require that the input to the lexer support slicing. (Remember the
> secret of Tango's XML parser...) The bottleneck is now on the calls to
> .idup to construct the token strings from slices of the buffer.

How many tokens would that be in total?

-- 
/Jacob Carlborg
January 31, 2013
On 2013-01-31 13:14, Jacob Carlborg wrote:
> Just thinking out loud here. Would it be possible to lex a file in parallel?
> Cutting it in half (or similar) and lex both pieces simultaneously in parallel.

Do you know where you can safely cut it without having it lexed beforehand? :)
January 31, 2013
Am 31.01.2013 13:14, schrieb Jacob Carlborg:
> On 2013-01-27 10:51, Brian Schott wrote:
>> I'm writing a D lexer for possible inclusion in Phobos.
>>
>> DDOC:
>> http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
>> Code:
>> https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d
>>
>>
>> It's currently able to correctly syntax highlight all of Phobos, but
>> does a fairly bad job at rejecting or notifying users/callers about
>> invalid input.
>>
>> I'd like to hear arguments on the various ways to handle errors in the
>> lexer. In a compiler it would be useful to throw an exception on finding
>> something like a string literal that doesn't stop before EOF, but a text
>> editor or IDE would probably want to be a bit more lenient. Maybe having
>> it run-time (or compile-time configurable) like std.csv would be the
>> best option here.
>>
>> I'm interested in ideas on the API design and other high-level issues at
>> the moment. I don't consider this ready for inclusion. (The current
>> module being reviewed for inclusion in Phobos is the new std.uni.)
>
> Just thinking out loud here. Would it be possible to lex a file in
> parallel? Cutting it in half (or similar) and lex both pieces
> simultaneously in parallel.

why not only the symbols at the border needs to be "connected" then

the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can also depend on the speed of the filesystem



January 31, 2013
On 2013-01-31 13:34, FG wrote:

> Do you know where you can safely cut it without having it lexed
> beforehand? :)

I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut.

Although I have no idea how much trouble it would given you and how much you would gain.

-- 
/Jacob Carlborg
January 31, 2013
On 2013-01-31 13:35, dennis luehring wrote:

> why not only the symbols at the border needs to be "connected" then
>
> the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can
> also depend on the speed of the filesystem

That would require some profiling to figure out.

-- 
/Jacob Carlborg
January 31, 2013
Am 31.01.2013 13:48, schrieb Jacob Carlborg:
> On 2013-01-31 13:35, dennis luehring wrote:
>
>> why not only the symbols at the border needs to be "connected" then
>>
>> the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can
>> also depend on the speed of the filesystem
>
> That would require some profiling to figure out.

i would say it "can" help alot so the design should be able to use split-parts and combine symboles at the border

and also file-based lexing should be threadable so that 16 files can be handled by my 16 cores system full in parallel :)

but its also size dependend - it makes no sense to split in too small parts, could be counter-productive
January 31, 2013
On Thu, Jan 31, 2013 at 01:48:02PM +0100, Jacob Carlborg wrote:
> On 2013-01-31 13:34, FG wrote:
> 
> >Do you know where you can safely cut it without having it lexed beforehand? :)
> 
> I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut.
[...]

Doesn't work if the middle happens to be inside a string literal containing code. Esp. a q{} literal (you wouldn't be able to tell where it starts/ends without scanning the entire file, because the {}'s nest).


T

-- 
Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.
January 31, 2013
On 2013-01-31 16:54, H. S. Teoh wrote:

> Doesn't work if the middle happens to be inside a string literal
> containing code. Esp. a q{} literal (you wouldn't be able to tell where
> it starts/ends without scanning the entire file, because the {}'s nest).

That would be a problem.

-- 
/Jacob Carlborg
January 31, 2013
On 2013-01-31 13:57, dennis luehring wrote:

> i would say it "can" help alot so the design should be able to use
> split-parts and combine symboles at the border
>
> and also file-based lexing should be threadable so that 16 files can be
> handled by my 16 cores system full in parallel :)

This is kind of obvious, I think. That's why I started to think at this less obvious case.

> but its also size dependend - it makes no sense to split in too small
> parts, could be counter-productive

Of course not. That would require some profiling as well to find a sweet spot.

-- 
/Jacob Carlborg
February 01, 2013
On Thursday, 31 January 2013 at 12:48:03 UTC, Jacob Carlborg wrote:
> On 2013-01-31 13:34, FG wrote:
>
>> Do you know where you can safely cut it without having it lexed
>> beforehand? :)
>
> I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut.
>
> Although I have no idea how much trouble it would given you and how much you would gain.

I don't think it worth the complexity. You can lex both file in parallel with 2 lexer instance if you want to make things faster.