May 14, 2012
Le 14/05/2012 17:00, Roman D. Boiko a écrit :
> On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
>> I think you are wasting much more memory and performance by storing
>> all the tokens in the lexer.
>>
>> Imagine I want to implement a simple syntax highlighter: just
>> highlight keywords. How can I tell DCT to *not* store all tokens
>> because I need each one in turn? And since I'll be highlighting in the
>> editor I will need column and line information. That means I'll have
>> to do that O(log(n)) operation for every token.
>>
>> So you see, for the simplest use case of a lexer the performance of
>> DCT is awful.
>>
>> Now imagine I want to build an AST. Again, I consume the tokens one by
>> one, probably peeking in some cases. If I want to store line and
>> column information I just copy them to the AST. You say the tokens are
>> discarded but their data is not, and that's why their data is usually
>> copied.
>
> Currently I think about making token a class instead of struct.
>
> A token (from
> https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:
>
> // Represents lexed token
> struct Token
> {
> size_t startIndex; // position of the first code unit in the source string
> string spelling; // characters from which this token has been lexed
> TokenKind kind; // enum; each keyword and operator, have a dedicated kind
> ubyte annotations; // meta information like whether a token is valid, or
> an integer literal is signed, long, hexadecimal, etc.
> }
>
> Making it a class would give several benefits:
>
> * allow not to worry about allocating a big array of tokens. E.g., on
> 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes
> 13.5MB in an array of almost 500K tokens. It would require 4 times
> smaller chunk of contiguous memory if it was an array of class objects,
> because each would consume only 8 bytes instead of 32.
>

Why is this a benefice ?

> * allow subclassing, for example, for storing strongly typed literal
> values; this flexibility could also facilitate future extensibility (but
> it's difficult to predict which kind of extension may be needed)
>

I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI.

> * there would be no need to copy data from tokens into AST, passing an
> object would be enough (again, copy 8 instead of 32 bytes); the same
> applies to passing into methods - no need to pass by ref to minimise
> overhead
>

Yes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization.

> It would incur some additional memory overhead (at least 8 bytes per
> token), but that's hardly significant. Also there is additional price
> for accessing token members because of indirection, and, possibly, worse
> cache friendliness (token instances may be allocated anywhere in memory,
> not close to each other).
>
> These considerations are mostly about performance. I think there is also
> some impact on design, but couldn't find anything significant (given
> that currently I see a token as merely a datastructure without
> associated behavior).
>
> Could anybody suggest other pros and cons? Which option would you choose?

You are over engineering the whole stuff.
May 14, 2012
On Monday, 14 May 2012 at 16:30:21 UTC, deadalnix wrote:
> Le 14/05/2012 17:00, Roman D. Boiko a écrit :
>> Making it a class would give several benefits:
>> * allow not to worry about allocating a big array of tokens. E.g., on
>> 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes
>> 13.5MB in an array of almost 500K tokens. It would require 4 times
>> smaller chunk of contiguous memory if it was an array of class objects,
>> because each would consume only 8 bytes instead of 32.
> Why is this a benefice ?
NNTP error: 400 load at 23.60, try later prevented me from answering :)

Because it might be difficult to find a big chunk of available
memory (3.5M vs 14M for this particular case).

>> * allow subclassing, for example, for storing strongly typed literal
>> values; this flexibility could also facilitate future extensibility (but
>> it's difficult to predict which kind of extension may be needed)
>>
>
> I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI.
Agree.

>> * there would be no need to copy data from tokens into AST, passing an
>> object would be enough (again, copy 8 instead of 32 bytes); the same
>> applies to passing into methods - no need to pass by ref to minimise
>> overhead
>>
>
> Yes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization.
It looks so. Thanks.

>> It would incur some additional memory overhead (at least 8 bytes per
>> token), but that's hardly significant. Also there is additional price
>> for accessing token members because of indirection, and, possibly, worse
>> cache friendliness (token instances may be allocated anywhere in memory,
>> not close to each other).
>>
>> These considerations are mostly about performance. I think there is also
>> some impact on design, but couldn't find anything significant (given
>> that currently I see a token as merely a datastructure without
>> associated behavior).
>>
>> Could anybody suggest other pros and cons? Which option would you choose?
>
> You are over engineering the whole stuff.
I'm trying to solve this and other tradeoffs. I'd like to
simplify but satisfy my design goals.
May 14, 2012
On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:
>> You are over engineering the whole stuff.
> I'm trying to solve this and other tradeoffs. I'd like to
> simplify but satisfy my design goals.

What if there were two different lex:er modes... with different struct:s.

1. For an IDE with on the fly lexing:
  Assumption, the error rate is high.(need to keep much info)

2. For the compiler
Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow.

So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;)

try
{
  lex(mode.compiler);
}
catch
{
  lex(mode.ide); // calculates column etc. what ever info it needs.
}

May 14, 2012
On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
> On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:
>>> You are over engineering the whole stuff.
>> I'm trying to solve this and other tradeoffs. I'd like to
>> simplify but satisfy my design goals.
>
> What if there were two different lex:er modes... with different struct:s.
>
> 1. For an IDE with on the fly lexing:
>   Assumption, the error rate is high.(need to keep much info)
>
> 2. For the compiler
> Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow.
>
> So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;)
>
> try
> {
>   lex(mode.compiler);
> }
> catch
> {
>   lex(mode.ide); // calculates column etc. what ever info it needs.
> }
So far it doesn't seem expensive to tolerate errors and proceed.
The only thing I miss is some sort of specification when to stop
including characters into token spelling and start a new token. I
don't think I'll use backtracking for that in the nearest future.
If I did, I would really separate part of lexer and provide two
implementations for that part. Given this, accepting errors and
moving on simply requires some finite set of rules about
boundaries of invalid tokens. I also think structural code
editing concepts will help here, but I didn't do any research on
this topic yet.

The problem with multiple lexer implementations is that it might
become much more difficult to maintain them.
May 14, 2012
On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:
> On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
>> What if there were two different lex:er modes... with different struct:s.
>>
>> 1. For an IDE with on the fly lexing:
>>  Assumption, the error rate is high.(need to keep much info)
>>
>> 2. For the compiler
>> Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow.
>>
>> So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;)
...
> The problem with multiple lexer implementations is that it might
> become much more difficult to maintain them.
Just to clarify: different modes in lexer in my view are like two
different implementations combined in a non-trivial way (unless
the difference is minor). So complexity goes from two factors:
different implementations and how to combine them. I try to avoid
this.
May 15, 2012
Le 14/05/2012 21:21, Roman D. Boiko a écrit :
> On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:
>> On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
>>> What if there were two different lex:er modes... with different
>>> struct:s.
>>>
>>> 1. For an IDE with on the fly lexing:
>>> Assumption, the error rate is high.(need to keep much info)
>>>
>>> 2. For the compiler
>>> Assumption, the error rate is non existent, and if there is an error
>>> it really doesn't matter if it's slow.
>>>
>>> So... when choosing the "compiler mode"... and there actually is an
>>> error, then just lex it again, to produce a pretty error message ;)
> ...
>> The problem with multiple lexer implementations is that it might
>> become much more difficult to maintain them.
> Just to clarify: different modes in lexer in my view are like two
> different implementations combined in a non-trivial way (unless
> the difference is minor). So complexity goes from two factors:
> different implementations and how to combine them. I try to avoid
> this.

If both have the same interface, metaprograming can do the magic for you.
May 15, 2012
On Tuesday, 15 May 2012 at 09:33:31 UTC, deadalnix wrote:
> Le 14/05/2012 21:21, Roman D. Boiko a écrit :
>> Just to clarify: different modes in lexer in my view are like two
>> different implementations combined in a non-trivial way (unless
>> the difference is minor). So complexity goes from two factors:
>> different implementations and how to combine them. I try to avoid this.
>
> If both have the same interface, metaprograming can do the magic for you.

Well, I didn't state that there is no way to solve the issue.
Also I added "unless the difference is minor".
May 15, 2012
On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
> On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
>> I think you are wasting much more memory and performance by storing
>> all the tokens in the lexer.
>>
>> Imagine I want to implement a simple syntax highlighter: just
>> highlight keywords. How can I tell DCT to *not* store all tokens
>> because I need each one in turn? And since I'll be highlighting in the
>> editor I will need column and line information. That means I'll have
>> to do that O(log(n)) operation for every token.
>>
>> So you see, for the simplest use case of a lexer the performance of
>> DCT is awful.
>>
>> Now imagine I want to build an AST. Again, I consume the tokens one by
>> one, probably peeking in some cases. If I want to store line and
>> column information I just copy them to the AST. You say the tokens are
>> discarded but their data is not, and that's why their data is usually
>> copied.
>
> Currently I think about making token a class instead of struct.
>
> ...
>
> Could anybody suggest other pros and cons? Which option would you choose?

Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
May 15, 2012
On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:
> On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
>> Currently I think about making token a class instead of struct.
>> ...
>>
>> Could anybody suggest other pros and cons? Which option would you choose?
>
> Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
I'd like to be able to do efficient token lookup based on its start index in the file (e.g., for autocompletion and calculating column/line numbers on demand). I also plan to have ability to update lexer, parser and semantic analysis results as the user or tool makes edits (incrementally, not by restarting), and to keep history of changes as long as it is needed.

To achieve this, I will use Token[N][] for storing tokens, and an immutable binary tree for efficient lookup and preserving history.

I don't worry much about keeping all tokens in memory, because information they contain is useful and needed for most scenarios. If, on the contrary, information is no longer used, it will be garbage collected.

This is a significant redesign of current implementation. It is based on feedback from this thread and completely satisfies my design goals. It is also going to be efficient.

Alpha release for community review is planned on May 21, and design documentation in 10 days after that.

May 16, 2012
Le 15/05/2012 21:59, Roman D. Boiko a écrit :
> On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:
>> On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
>>> Currently I think about making token a class instead of struct.
>>> ...
>>>
>>> Could anybody suggest other pros and cons? Which option would you
>>> choose?
>>
>> Just use a circular buffer of value-type tokens. There is absolutely
>> no excuse for slow parsing.
> I'd like to be able to do efficient token lookup based on its start
> index in the file (e.g., for autocompletion and calculating column/line
> numbers on demand). I also plan to have ability to update lexer, parser
> and semantic analysis results as the user or tool makes edits
> (incrementally, not by restarting), and to keep history of changes as
> long as it is needed.
>
> To achieve this, I will use Token[N][] for storing tokens, and an
> immutable binary tree for efficient lookup and preserving history.
>
> I don't worry much about keeping all tokens in memory, because
> information they contain is useful and needed for most scenarios. If, on
> the contrary, information is no longer used, it will be garbage collected.
>
> This is a significant redesign of current implementation. It is based on
> feedback from this thread and completely satisfies my design goals. It
> is also going to be efficient.
>
> Alpha release for community review is planned on May 21, and design
> documentation in 10 days after that.
>

The only information that is usefull over time is the position. This is the one that you want to calculate lazily.

In other terms, what you want to keep in memory have no interest.

A circular buffer as Timon propose is a better idea.