January 28, 2013
On 1/28/2013 1:04 PM, Timon Gehr wrote:
> Maybe. I map identifiers to unique id's later,

This makes symbol lookups very fast.
January 28, 2013
On Monday, 28 January 2013 at 22:00:30 UTC, Philippe Sigaud wrote:
>
> Owww, C++! My eyes, they're melting! My hairs, on fire! Why did you forfeit your immortal soul?


lol. Mainly because I got tired of running into compiler bugs, to be honest. :|


> Anyway, thanks. I'll look at it more closely. It seems to be the kind of C++ I can still read *and* understand.


Yeah, my goal was to make it mostly C-compatible, so I kept the C++-iness to a minimum.

It's more like "C"++  rather than "C++", I think. :P
January 30, 2013
On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
> Better, but still slow.

I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses.

Results:

$ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d

------------------------
Total time (ms): 13861.8
Repetitions    : 200
Sample mode    : 69 (90 ocurrences)
Median time    : 69.0745
Avg time       : 69.3088
Std dev.       : 0.670203
Minimum        : 68.613
Maximum        : 72.635
95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207

If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.

> I guess that at some point
>
> pure nothrow TokenType lookupTokenType(const string input)
>
> might become a bottleneck. (DMD does not generate near-optimal string switches, I think.)

Right now that's a fairly small box on KCachegrind.

January 30, 2013
Am 30.01.2013 10:49, schrieb Brian Schott:
> On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
>> Better, but still slow.
>
> I implemented the various suggestions from a past thread and made
> the lexer only work ubyte[] (to aviod phobos converting
> everything to dchar all the time) and gave the tokenizer instance
> a character buffer that it re-uses.
>
> Results:
>
> $ avgtime -q -r 200 ./dscanner --tokenCount
> ../phobos/std/datetime.d
>
> .....
>
> If my math is right, that means it's getting 4.9 million
> tokens/second now. According to Valgrind the only way to really
> improve things now is to require that the input to the lexer
> support slicing. (Remember the secret of Tango's XML parser...)
> The bottleneck is now on the calls to .idup to construct the
> token strings from slices of the buffer.

but you still need to compare that to the current dmd lexer - nothing else can be a benchmark reference
January 30, 2013
On 2013-01-30 10:49, Brian Schott wrote:
> If my math is right, that means it's getting 4.9 million tokens/second now.
> According to Valgrind the only way to really improve things now is to require
> that the input to the lexer support slicing. (Remember the secret of Tango's XML
> parser...) The bottleneck is now on the calls to .idup to construct the token
> strings from slices of the buffer.

Yep. I'm eager to see what timings you get with the whole file kept in memory and sliced, without copying token strings.
January 30, 2013
29-Jan-2013 01:04, Timon Gehr пишет:
> On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
>> 28-Jan-2013 15:48, Johannes Pfau пишет:
>>> ...
>>>
>>> But to be fair that doesn't fit ranges very well. If you don't want to
>>> do any allocation but still keep identifiers etc in memory this
>>> basically means you have to keep the whole source in memory and this is
>>> conceptually an array and not a range.
>>>
>>
>> Not the whole source but to construct a table of all identifiers. The
>> source is awfully redundant because of repeated identifiers, spaces,
>> comments and what not. The set of unique identifiers is rather small.
>>
>
> Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My
> own lexer-parser combination slices directly into the original source
> code, for every token, in order to remember the exact source location,
> and the last time I have measured, it ran faster than DMD's. I keep the
> source around for error reporting anyway:
>
> tt.d:27:5: error: no member 'n' for type 'A'
>      a.n=2;
>      ^~~
>
> Since the tokens point directly into the source code, it is not
> necessary to construct any other data structures in order to allow fast
> retrieval of the appropriate source code line.
>
> But it's clear that a general-purpose library might not want to impose
> this restriction on storage upon it's clients. I think it is somewhat
> helpful for speed though. The other thing I do is buffering tokens in a
> contiguous ring buffer that grows if a lot of lookahead is requested.
>
>> I think the best course of action is to just provide a hook to trigger
>> on every identifier encountered. That could be as discussed earlier a
>> delegate.
>>
>> ...
>
> Maybe. I map identifiers to unique id's later, in the identifier AST
> node constructor, though.


In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk. Identifiers themselves then would have to be stored with sentinels (e.g. zeros) at end like in C. Then the 'map' process comes for free.

The only overhead is actually copying the bytes to this buffer but it only happens once per unique identifier and in exchange you get to lex anything not only 'the whole module in one memory block' kind of thing.
On the plus side it's also more cache friendly.

Overall I think it's a good trade-off, but I never had the time to exercise it.

-- 
Dmitry Olshansky
January 30, 2013
30-Jan-2013 13:49, Brian Schott пишет:
> On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
>> Better, but still slow.
>
> I implemented the various suggestions from a past thread and made the
> lexer only work ubyte[] (to aviod phobos converting everything to dchar
> all the time) and gave the tokenizer instance a character buffer that it
> re-uses.
>
> Results:
>
> $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d
>
> ------------------------
> Total time (ms): 13861.8
> Repetitions    : 200
> Sample mode    : 69 (90 ocurrences)
> Median time    : 69.0745
> Avg time       : 69.3088
> Std dev.       : 0.670203
> Minimum        : 68.613
> Maximum        : 72.635
> 95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
> 99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
> EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
> EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207
>
> If my math is right, that means it's getting 4.9 million tokens/second
> now. According to Valgrind the only way to really improve things now is
> to require that the input to the lexer support slicing. (Remember the
> secret of Tango's XML parser...) The bottleneck is now on the calls to
> .idup to construct the token strings from slices of the buffer.
>

idup --> allocation

Instead I suggest to try and allocate a big block of fixed size (say about 16-64K) upfront and copy identifiers one by one there. When it fills just allocate another one and move on.

If identifier is exceptionally long then you can just idup it as before.

This should bring down the number of calls to GC significantly.

>> I guess that at some point
>>
>> pure nothrow TokenType lookupTokenType(const string input)
>>
>> might become a bottleneck. (DMD does not generate near-optimal string
>> switches, I think.)
>
> Right now that's a fairly small box on KCachegrind.
>


-- 
Dmitry Olshansky
January 30, 2013
On 2013-01-30 17:50, Dmitry Olshansky wrote:
> Instead I suggest to try and allocate a big block of fixed size (say about
> 16-64K) upfront and copy identifiers one by one there. When it fills just
> allocate another one and move on.

Yeah, similar to what I suggested, except that probably it would be good to also have a look-up structure for identifiers, so that only unique strings go into those blocks.

I wonder what would be a practical data structure for such look-ups:
A trie is great except for the implementation hurdle to keep it also in one or a few memory blocks to prevent frequent allocations.
A simpler approach would be to make it an array of (hash, string_slice) pairs, sorted by hash (of the identifier) - lots of string scanning though.

What do you think?

January 30, 2013
30-Jan-2013 21:21, FG пишет:
> On 2013-01-30 17:50, Dmitry Olshansky wrote:
>> Instead I suggest to try and allocate a big block of fixed size (say
>> about
>> 16-64K) upfront and copy identifiers one by one there. When it fills just
>> allocate another one and move on.
>
> Yeah, similar to what I suggested, except that probably it would be good
> to also have a look-up structure for identifiers, so that only unique
> strings go into those blocks.
>
> I wonder what would be a practical data structure for such look-ups:
> A trie is great except for the implementation hurdle to keep it also in
> one or a few memory blocks to prevent frequent allocations.

Mm trie shouldn't have that many allocations. Typically each node is fixed-sized. Also I don't see keeping it in many memory blocks as a show stopper. The trick is exactly the same as I mentioned above but blocks got to be quite a bit large (say 8 times :) ).

Truth be told the trie should be optimal for this, but a lot of effort is required to implement a fast trie compared to rolling out a simple hash-table. That's something I hope to change once my Trie template is polished enough for Phobos proposal.

> A simpler approach would be to make it an array of (hash, string_slice)
> pairs, sorted by hash (of the identifier) - lots of string scanning though.
>

Mm since you need to find what's unique as you lex the file I suspect that sorting is out of the window. I'd be modest and for starters just use a hash-table + tune the hash function a bit.


-- 
Dmitry Olshansky
January 31, 2013
On 2013-01-27 10:51, Brian Schott wrote:
> I'm writing a D lexer for possible inclusion in Phobos.
>
> DDOC:
> http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
> Code:
> https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d
>
>
> It's currently able to correctly syntax highlight all of Phobos, but
> does a fairly bad job at rejecting or notifying users/callers about
> invalid input.
>
> I'd like to hear arguments on the various ways to handle errors in the
> lexer. In a compiler it would be useful to throw an exception on finding
> something like a string literal that doesn't stop before EOF, but a text
> editor or IDE would probably want to be a bit more lenient. Maybe having
> it run-time (or compile-time configurable) like std.csv would be the
> best option here.
>
> I'm interested in ideas on the API design and other high-level issues at
> the moment. I don't consider this ready for inclusion. (The current
> module being reviewed for inclusion in Phobos is the new std.uni.)

Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel.

-- 
/Jacob Carlborg