January 28, 2013
On 1/28/2013 1:05 AM, Jacob Carlborg wrote:
> If think slicing the buffer will force the whole buffer to remain in memory and
> not be collected by the GC. If one keeps the whole buffer in memory anyway this
> won't be a problem.

Well, the source buffer can be large, and will span a lot of memory cache lines, making accessing those slices slow.

January 28, 2013
Am 28.01.2013 11:13, schrieb Walter Bright:
> On 1/28/2013 1:30 AM, Mehrdad wrote:
>> I'm wondering if there's any way to make it independent of the code/grammar
>
> Just run under a profiler, then fix the hot spots.
>

mehrdad speaks about benchmarking not optimization :)
January 28, 2013
On 2013-01-28 11:14, Walter Bright wrote:

> Well, the source buffer can be large, and will span a lot of memory
> cache lines, making accessing those slices slow.

Would it be better to .dup the slices? Won't that be just as slow as using the appender?

-- 
/Jacob Carlborg
January 28, 2013
28-Jan-2013 14:39, Jacob Carlborg пишет:
> On 2013-01-28 11:14, Walter Bright wrote:
>
>> Well, the source buffer can be large, and will span a lot of memory
>> cache lines, making accessing those slices slow.
>
> Would it be better to .dup the slices? Won't that be just as slow as
> using the appender?
>

It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s).

-- 
Dmitry Olshansky
January 28, 2013
On 1/28/2013 2:16 AM, dennis luehring wrote:
> Am 28.01.2013 11:13, schrieb Walter Bright:
>> On 1/28/2013 1:30 AM, Mehrdad wrote:
>>> I'm wondering if there's any way to make it independent of the code/grammar
>>
>> Just run under a profiler, then fix the hot spots.
>>
>
> mehrdad speaks about benchmarking not optimization :)

I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.
January 28, 2013
On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:
> On 1/28/2013 2:16 AM, dennis luehring wrote:
>> Am 28.01.2013 11:13, schrieb Walter Bright:
>>> On 1/28/2013 1:30 AM, Mehrdad wrote:
>>>> I'm wondering if there's any way to make it independent of the code/grammar
>>>
>>> Just run under a profiler, then fix the hot spots.
>>>
>>
>> mehrdad speaks about benchmarking not optimization :)
>
> I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.



I was talking about parsers actually :| on a second thought maybe I shouldn't have asked that on a thread about lexing...
January 28, 2013
On Sunday, 27 January 2013 at 20:15:33 UTC, Philippe Sigaud wrote:

>> This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.
>
> How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.


Here's my (VERY) simple NFA-based "regex"-based lexer, which performs **no** heap allocations (unless the pattern is very complicated):

https://gist.github.com/b10ae22ab822c87467a3

Yes, the code is ugly and the capabilities are far too basic, but it was good enough for what I needed.

The point it illustrates is how you can lex without any heap allocations.
January 28, 2013
Am Sun, 27 Jan 2013 11:48:23 -0800
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
> > Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.
> 
> Speed is critical for a lexer.
> 
> This means, for example, you'll need to squeeze pretty much all storage allocation out of it.

But to be fair that doesn't fit ranges very well. If you don't want to do any allocation but still keep identifiers etc in memory this basically means you have to keep the whole source in memory and this is conceptually an array and not a range.


But you can of course write a lexer which accepts buffered ranges and does some allocation for those and is special cased for arrays to not allocate at all. (Unbuffered ranges should be supported using a generic BufferedRange)
January 28, 2013
28-Jan-2013 15:48, Johannes Pfau пишет:
> Am Sun, 27 Jan 2013 11:48:23 -0800
> schrieb Walter Bright <newshound2@digitalmars.com>:
>
>> On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
>>> Walter seems to think if a lexer is not able to vomit thousands
>>> of tokens a seconds, then it's not good.
>>
>> Speed is critical for a lexer.
>>
>> This means, for example, you'll need to squeeze pretty much all
>> storage allocation out of it.
>
> But to be fair that doesn't fit ranges very well. If you don't want to
> do any allocation but still keep identifiers etc in memory this
> basically means you have to keep the whole source in memory and this is
> conceptually an array and not a range.
>

Not the whole source but to construct a table of all identifiers. The source is awfully redundant because of repeated identifiers, spaces, comments and what not. The set of unique identifiers is rather small.

I think the best course of action is to just provide a hook to trigger on every identifier encountered. That could be as discussed earlier a delegate.

> But you can of course write a lexer which accepts buffered ranges and
> does some allocation for those and is special cased for arrays to not
> allocate at all. (Unbuffered ranges should be supported using a
> generic BufferedRange)
>



-- 
Dmitry Olshansky
January 28, 2013
On 2013-01-28 08:47, Jacob Carlborg wrote:
> If we're talking about dynamic allocation you can make sure you're just using
> value types. A token could look like:
>
> struct Token
> {
>      TokenKind kind;
>      string value;
>      uint index;
> }
>
> For the "value" you could just slice the buffer. But I think this will prevent
> the whole buffer from being collected.


I was also thinking about using slices to limit string allocation. So far the combined size of source files in D projects is so small, that it wouldn't hurt to mmap the files and slice them. It is possible though that someone would create a huge file, even if only just to see this program crash. :)

In that case something else may be useful. Allocate special arrays for holding value strings, for example 256 kB per array. Token.value will be a slice of such array. Additionally have a lookup Trie to help reuse repeating values - if a string is already in an array, the Trie leaf will store its position (slice) and the token will only have to copy that info. If the lookup doesn't turn up the string, the string will be added to the end of the array using Appender or, if it doesn't fit, a new array will be created.