January 28, 2013
On 2013-01-27 21:15, Philippe Sigaud wrote:

>> This means, for example, you'll need to squeeze pretty much all storage
>> allocation out of it. A lexer that does an allocation per token will is not
>> going to do very well at all.
>
> How does one do that? Honest question, I'm not really concerned by
> extreme speed, most of the time, and have lots to learn in this area.

If we're talking about dynamic allocation you can make sure you're just using value types. A token could look like:

struct Token
{
    TokenKind kind;
    string value;
    uint index;
}

For the "value" you could just slice the buffer. But I think this will prevent the whole buffer from being collected.

-- 
/Jacob Carlborg
January 28, 2013
Am Mon, 28 Jan 2013 01:53:02 +0100
schrieb "Brian Schott" <briancschott@gmail.com>:

> 
> The bottleneck in std.d.lexer as it stands is the appender instances that assemble Token.value during iteration and front() on the array of char[]. (As I'm sure everyone expected)

This sounds like a valid use case for buffered ranges (which we don't have yet, afaik). When used correctly you could avoid the appender completely. Instead slice the input buffer and copy it if necessary. (If the complete file is kept in memory copying could also be avoided)

January 28, 2013
On Sunday, 27 January 2013 at 19:48:23 UTC, Walter Bright wrote:
> On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
>> Walter seems to think if a lexer is not able to vomit thousands
>> of tokens a seconds, then it's not good.
>
> Speed is critical for a lexer.
>
> This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.


Semi-unrelated question -- how would you benchmark a _parser_?

Is there any way to get a _number_ as an answer, or is comparing against a rival parser the only way of benchmarking a parser?
January 28, 2013
On 2013-01-28 09:51, Johannes Pfau wrote:

> This sounds like a valid use case for buffered ranges (which we don't
> have yet, afaik). When used correctly you could avoid the appender
> completely. Instead slice the input buffer and copy it if necessary.
> (If the complete file is kept in memory copying could also be
> avoided)

If think slicing the buffer will force the whole buffer to remain in memory and not be collected by the GC. If one keeps the whole buffer in memory anyway this won't be a problem.

-- 
/Jacob Carlborg
January 28, 2013
On 2013-01-28 10:03, Mehrdad wrote:

> Semi-unrelated question -- how would you benchmark a _parser_?
>
> Is there any way to get a _number_ as an answer, or is comparing against
> a rival parser the only way of benchmarking a parser?

It should always be possible to get a number, but it would be hard to determine if that's a good number or not without comparing with anything else.

-- 
/Jacob Carlborg
January 28, 2013
On Monday, 28 January 2013 at 09:07:15 UTC, Jacob Carlborg wrote:
> On 2013-01-28 10:03, Mehrdad wrote:
>
>> Semi-unrelated question -- how would you benchmark a _parser_?
>>
>> Is there any way to get a _number_ as an answer, or is comparing against
>> a rival parser the only way of benchmarking a parser?
>
> It should always be possible to get a number, but it would be hard to determine if that's a good number or not without comparing with anything else.


Sorry I think my question was vague -- what I meant was, how do you measure it, i.e. in what units?

Lines per second? Tokens per second? Syntax nodes per second? etc.
January 28, 2013
On 2013-01-28 10:09, Mehrdad wrote:

> Sorry I think my question was vague -- what I meant was, how do you
> measure it, i.e. in what units?
>
> Lines per second? Tokens per second? Syntax nodes per second? etc.

The easy way out would be to just measure how long time it takes to parse a given piece of code and compare with some other parse using the same code.

-- 
/Jacob Carlborg
January 28, 2013
On Monday, 28 January 2013 at 09:21:51 UTC, Jacob Carlborg wrote:
> On 2013-01-28 10:09, Mehrdad wrote:
>
>> Sorry I think my question was vague -- what I meant was, how do you
>> measure it, i.e. in what units?
>>
>> Lines per second? Tokens per second? Syntax nodes per second? etc.
>
> The easy way out would be to just measure how long time it takes to parse a given piece of code and compare with some other parse using the same code.


Yeah that's exactly what I meant by not comparing xD

I'm wondering if there's any way to make it independent of the code/grammar
January 28, 2013
On 1/27/2013 12:15 PM, Philippe Sigaud wrote:
> On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright
> <newshound2@digitalmars.com> wrote:
>> On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
>>>
>>> Walter seems to think if a lexer is not able to vomit thousands
>>> of tokens a seconds, then it's not good.
>>
>>
>> Speed is critical for a lexer.
>
> Something I never tought about: given a 10k lines module, how many
> tokens does that represent, roughly. Ten times as much?

I don't know.

>> This means, for example, you'll need to squeeze pretty much all storage
>> allocation out of it. A lexer that does an allocation per token will is not
>> going to do very well at all.
>
> How does one do that? Honest question, I'm not really concerned by
> extreme speed, most of the time, and have lots to learn in this area.
>

Just study the dmd lexer.c source code.
January 28, 2013
On 1/28/2013 1:30 AM, Mehrdad wrote:
> I'm wondering if there's any way to make it independent of the code/grammar

Just run under a profiler, then fix the hot spots.