May 12, 2012
On Saturday, 12 May 2012 at 05:41:16 UTC, Jonathan M Davis wrote:
> It's great that you're working on this. We definitely need more of this sort of
> stuff. However, I would point out that I'm not sure that it will be acceptable
> for Phobos (it _may_ be, but it's not quite what we've been looking for).
> There are two types of lexers/parsers that we've been looking to add to
> Phobos.
>
> 1. A direct port of the lexer from dmd's C++ front-end. The API would be D-
> ified to use ranges, but the internals would be almost identical to the C++
> front-end so that porting changes back and forth would be very easy. It also
> would make it easier to replace the C++ front-end with a D one later if the D
> one is very close to the C++ one.
>
> 2. A lexer/parser generator using templates and/or CTFE to generate a
> lexer/parser of whatever language using a BNF or EBNF grammar - probably
> similar to what Philippe Sigaud has done to generate a parser using PEGs. Such
> could be used to generate a lexer/parser for D or any other language.
>
> What you seem to be doing is creating a D-specific parser from scratch, which
> is great, but it's not quite what we've been looking to add to Phobos. That
> doesn't necessarily mean that we can't or won't add it to Phobos once it's
> complete (especially if nothing else gets done first), but it also may not be
> acceptable, because it's not either #1 or #2. That would have to be decided
> once it's done and ready to be reviewed for inclusion in Phobos. I'm not
> trying to discourage you at all. I'm just pointing out that what you're doing
> is quite what we're looking for. Regardless, it _would_ be interesting to be
> able to compare implementations of #1 and #2 against what you're doing and
> seeing which performs better.
>
> - Jonathan M Davis
Yes, my project has different goals from #1 or #2. The primary need I'm trying to address is making development of tools for D development much easier. Tools are needed for D to become more widely used in practise. Secondary (a nice-to-have) would be to build a compiler on top of that. I plan to do both, but I don't intend to be backwards-compatible with DMD backend. I hope that at least part of my code will either influence or be reused in the reference D compiler (whatever that will be in the future).

It is important to have a reference frontend implementation which can be used in many tools, otherwise a lot of effort would be completely wasted and no tool would comply with the D specification and compilers.
May 12, 2012
On 5/12/12 12:17 PM, Roman D. Boiko wrote:
> On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
>> As deadalnix says, I think you are over-complicating things.
>>
>> I mean, to store the column and line information it's just:
>>
>> if (isNewLine(c)) {
>> line++;
>> column = 0;
>> } else {
>> column++;
>> }
>>
>> (I think you need to add that to the SourceRange class. Then copy line
>> and column to token on the Lexer#lex() method)
>>
>> Do you really think it's that costly in terms of performance?
>>
>> I think you are wasting much more memory and performance by storing
>> all the tokens in the lexer.
>>
>> Imagine I want to implement a simple syntax highlighter: just
>> highlight keywords. How can I tell DCT to *not* store all tokens
>> because I need each one in turn? And since I'll be highlighting in the
>> editor I will need column and line information. That means I'll have
>> to do that O(log(n)) operation for every token.
>>
>> So you see, for the simplest use case of a lexer the performance of
>> DCT is awful.
>>
>> Now imagine I want to build an AST. Again, I consume the tokens one by
>> one, probably peeking in some cases. If I want to store line and
>> column information I just copy them to the AST. You say the tokens are
>> discarded but their data is not, and that's why their data is usually
>> copied.
>
> Would it be possible for you to fork my code and tweak it for
> comparison? You will definitely discover more problems this way, and
> such feedback would really help me. That doesn't seem to be inadequately
> much work to do.
>
> Any other volunteers for this?

Sure, I'll do it and provide some benchmarks. Thanks for creating the issue.
May 12, 2012
Le 11/05/2012 13:50, Ary Manzana a écrit :
> On 5/11/12 4:22 PM, Roman D. Boiko wrote:
>>> What about line and column information?
>> Indices of the first code unit of each line are stored inside lexer and
>> a function will compute Location (line number, column number, file
>> specification) for any index. This way size of Token instance is reduced
>> to the minimum. It is assumed that Location can be computed on demand,
>> and is not needed frequently. So column is calculated by reverse walk
>> till previous end of line, etc. Locations will possible to calculate
>> both taking into account special token sequences (e.g., #line 3
>> "ab/c.d"), or discarding them.
>
> But then how do you do to efficiently (if reverse walk is any efficient)
> compute line numbers?
>
> Usually tokens are used and discarded. I mean, somebody that uses the
> lexer asks tokens, process them (for example to highlight code or to
> build an AST) and then discards them. So you can reuse the same Token
> instance. If you want to peek the next token, or have a buffer of token,
> you can use a freelist ( http://dlang.org/memory.html#freelists , one of
> the many nice things I learned by looking at DMD's source code ).
>
> So adding line and column information is not like wasting a lot of
> memory: just 8 bytes more for each token in the freelist.

SDC uses struct Location to store such data.

Every token and AST element have a member data of type location.

As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.
May 12, 2012
On Saturday, 12 May 2012 at 13:24:11 UTC, deadalnix wrote:
> Le 11/05/2012 13:50, Ary Manzana a écrit :
>> Usually tokens are used and discarded. I mean, somebody that uses the
>> lexer asks tokens, process them (for example to highlight code or to
>> build an AST) and then discards them. So you can reuse the same Token
>> instance. If you want to peek the next token, or have a buffer of token,
>> you can use a freelist ( http://dlang.org/memory.html#freelists , one of
>> the many nice things I learned by looking at DMD's source code ).
>>
>> So adding line and column information is not like wasting a lot of
>> memory: just 8 bytes more for each token in the freelist.
>
> SDC uses struct Location to store such data.
>
> Every token and AST element have a member data of type location.
>
> As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.
1. If a struct is a field of heap allocated object, it will be
allocated and garbage collected. Only if it only exists on stack
(i.e., in method body), GC is not used.

2. If struct which is often allocated and deallocated, free list
would allow to skip its initialization, which is especially
important if size of struct is large. This is true for both heap
and stack allocated data.
May 12, 2012
> 1. If a struct is a field of heap allocated object, it will be
> allocated and garbage collected. Only if it only exists on stack
> (i.e., in method body), GC is not used.

As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
May 12, 2012
On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:
>> 1. If a struct is a field of heap allocated object, it will be
>> allocated and garbage collected. Only if it only exists on stack
>> (i.e., in method body), GC is not used.
>
> As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.

May 12, 2012
On 05/12/2012 11:08 PM, Roman D. Boiko wrote:
> On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:
>>> 1. If a struct is a field of heap allocated object, it will be
>>> allocated and garbage collected. Only if it only exists on stack
>>> (i.e., in method body), GC is not used.
>>
>> As far as I can tell, it won't be allocated on it's own, since it is
>> stored in the garbage collected object as a value field. So using a
>> freelist would actually increase the overhead. You have to manage the
>> freelist and do the allocation of the containing object.
> Sorry for confusion. In the first point I was trying to clarify that
> using a struct doesn't mean that it is stack allocated, but that
> statement was off-topic for the given context. Currently I consider my
> first point to be useless, and the second one is not relevant in *my*
> context, since I don't do many allocations of Location, as well as in a
> majority of other contexts. As it is often the case, performance
> optimization should be performed only when benchmarks show it is needed.
>

Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?
May 13, 2012
On Saturday, 12 May 2012 at 22:19:55 UTC, Timon Gehr wrote:
> On 05/12/2012 11:08 PM, Roman D. Boiko wrote:
>> On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:
>>>> 1. If a struct is a field of heap allocated object, it will be
>>>> allocated and garbage collected. Only if it only exists on stack
>>>> (i.e., in method body), GC is not used.
>>>
>>> As far as I can tell, it won't be allocated on it's own, since it is
>>> stored in the garbage collected object as a value field. So using a
>>> freelist would actually increase the overhead. You have to manage the
>>> freelist and do the allocation of the containing object.
>> Sorry for confusion. In the first point I was trying to clarify that
>> using a struct doesn't mean that it is stack allocated, but that
>> statement was off-topic for the given context. Currently I consider my
>> first point to be useless, and the second one is not relevant in *my*
>> context, since I don't do many allocations of Location, as well as in a
>> majority of other contexts. As it is often the case, performance
>> optimization should be performed only when benchmarks show it is needed.
>>
>
> Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?

Do you mean something like what I summarized from Ary:
http://forum.dlang.org/post/pggkobonzmgdwjigyaac@forum.dlang.org ?
If that is the case, could you please add your concerns to the
dedicated issue:
https://github.com/roman-d-boiko/dct/issues/15 ?

Otherwise, if by overcomplicating you meant using a free list,
I'd like to clarify that I don't plan to introduce it yet because I see no reason to use it in my context.

Thanks
May 14, 2012
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
> I think you are wasting much more memory and performance by storing all the tokens in the lexer.
>
> Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token.
>
> So you see, for the simplest use case of a lexer the performance of DCT is awful.
>
> Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.

Currently I think about making token a class instead of struct.

A token (from https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:

// Represents lexed token
struct Token
{
    size_t startIndex; // position of the first code unit in the source string
    string spelling; // characters from which this token has been lexed
    TokenKind kind; // enum; each keyword and operator, have a dedicated kind
    ubyte annotations; // meta information like whether a token is valid, or an integer literal is signed, long, hexadecimal, etc.
}

Making it a class would give several benefits:

* allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32.

* allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed)

* there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead

It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other).

These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior).

Could anybody suggest other pros and cons? Which option would you choose?
May 14, 2012
On Monday, 14 May 2012 at 15:00:37 UTC, Roman D. Boiko wrote:
> Could anybody suggest other pros and cons? Which option would you choose?
Further discussion on this topic (struct vs class) is at http://forum.dlang.org/thread/asdrqlaydzcdpqwsbtiv@forum.dlang.org