May 11, 2012
Am 11.05.2012 13:50, schrieb Ary Manzana:
> On 5/11/12 4:22 PM, Roman D. Boiko wrote:
>>>  What about line and column information?
>>  Indices of the first code unit of each line are stored inside lexer and
>>  a function will compute Location (line number, column number, file
>>  specification) for any index. This way size of Token instance is reduced
>>  to the minimum. It is assumed that Location can be computed on demand,
>>  and is not needed frequently. So column is calculated by reverse walk
>>  till previous end of line, etc. Locations will possible to calculate
>>  both taking into account special token sequences (e.g., #line 3
>>  "ab/c.d"), or discarding them.
>
> But then how do you do to efficiently (if reverse walk is any efficient)
> compute line numbers?
>
> Usually tokens are used and discarded. I mean, somebody that uses the
> lexer asks tokens, process them (for example to highlight code or to
> build an AST) and then discards them. So you can reuse the same Token
> instance. If you want to peek the next token, or have a buffer of token,
> you can use a freelist ( http://dlang.org/memory.html#freelists , one of
> the many nice things I learned by looking at DMD's source code ).
>
> So adding line and column information is not like wasting a lot of
> memory: just 8 bytes more for each token in the freelist.

it would be better to add something like an ColumnLine-collector-thing - that if applied is able to hold this information, so no waste if its not needed

i think there are several parts that can work like that
May 11, 2012
On Friday, 11 May 2012 at 11:55:47 UTC, Roman D. Boiko wrote:
> On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote:
>> Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D)
> Didn't think about that yet, because I don't use VisualD.
> I actually planned to analyse whether DCT could be integrated into Mono-D, so your feedback is welcome :)

Mono-D is written in C#, VisualD uses D -- so it actually should be easier to integrate into the second one :)
May 11, 2012
On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
> From the beginning, I'm think AST macro using CTFE.
Could you please elaborate?

I plan to strictly follow published D specification.
Exceptions from this rule are possible provided either of the following is true:
* new functionality has been implemented in DMD but is not included into specification yet
* specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification
* change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer)

Some more exceptions might be added later, but the goal is to minimize differences.


May 11, 2012
On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote:
> Mono-D is written in C#, VisualD uses D -- so it actually should be easier to integrate into the second one :)
Sorry, I meant D-IDE. But there might exist the reason to consume D implementation from C# also. I would happily collaborate to make it usable for that.

I have nothing against VisualD, just didn't think about it yet.

May 11, 2012
Le 11/05/2012 14:14, Roman D. Boiko a écrit :
> On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
>> From the beginning, I'm think AST macro using CTFE.
> Could you please elaborate?
>
> I plan to strictly follow published D specification.
> Exceptions from this rule are possible provided either of the following
> is true:
> * new functionality has been implemented in DMD but is not included into
> specification yet
> * specification is incorrect (has a bug) or incomplete, especially if
> DMD behavior differs from specification
> * change is compatible with specification and brings some significant
> improvement (e.g., this seems to be the case for my decision to
> introduce post-processor after lexer)
>
> Some more exceptions might be added later, but the goal is to minimize
> differences.
>

More explicitly, the goal isn't to implement a different language than D.

Simply doing the parsing/AST building in a way that would allow AST macro to be introduced later.

Your 3 points seem reasonable. Mine were :
 * Implement something that can parse D as it is currently defined/implemented (if dmd's behavior and spec differs, it is handled on a per case basis).
 * Discard all deprecated features. Not even try to implement them even if dmd support them currently.
 * Do the parsing in several steps to allow different tools to work with it.

I think we both have very compatibles goals. Let me do a clean package of it I write about design goals. I don't have that much time right now to do it, I will this week end.
May 11, 2012
On Friday, 11 May 2012 at 11:50:29 UTC, Ary Manzana wrote:
> On 5/11/12 4:22 PM, Roman D. Boiko wrote:
>>> What about line and column information?
>> Indices of the first code unit of each line are stored inside lexer and
>> a function will compute Location (line number, column number, file
>> specification) for any index. This way size of Token instance is reduced
>> to the minimum. It is assumed that Location can be computed on demand,
>> and is not needed frequently. So column is calculated by reverse walk
>> till previous end of line, etc. Locations will possible to calculate
>> both taking into account special token sequences (e.g., #line 3
>> "ab/c.d"), or discarding them.
>
> But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers?
I borrowed the trick from Brian Schott: tokens will be stored as
a sorted array. Sorting is done based on their indices in source
code (they are naturally sorted immediately, no need to run any
algorithm for that). The same is true for information about line
start indices.

Now given an index we can do a binary search in such sorted
arrays, and get corresponding line number or token (whatever we
need). Walking along utf code points starting from the index
which corresponds to beginning of line and taking into account
tab characters it is possible to calculate column reasonably fast.

My assumption is that column numbers are needed for use cases
like error reporting or other messages for a user, and thus there
is no need to pre-calculate them for each token (or character).
For such use cases efficiency should be really enough.

In general, I precalculate and store only information which is
needed frequently. Other information is evaluated lazily.

> Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ).
But the information from tokens is not discarded (at least, this
is the requirement for DCT). So my choice is to keep it in tokens
instead of converting to some other form. That also implies that
Token is free from any auxiliary information which is not
necessary for common use cases.

> So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.
It is inefficient to calculate it during lexing, scanning
algorithm become less robust and more complicated, and in most
cases we won't need it anyway. I will add a hook to plug-in such
functionality (when needed) if I will know why it is necessary.
May 11, 2012
On Friday, 11 May 2012 at 12:12:01 UTC, dennis luehring wrote:
> it would be better to add something like an ColumnLine-collector-thing - that if applied is able to hold this information, so no waste if its not needed
>
> i think there are several parts that can work like that

This looks like what I called a hook for pluggin-in respective functionality in my previous message. But I'm not sure it is needed at all (in addition to what I already designed with calculateFor() method).
May 11, 2012
On Friday, 11 May 2012 at 12:30:01 UTC, deadalnix wrote:
> Your 3 points seem reasonable. Mine were :
>  * Implement something that can parse D as it is currently defined/implemented (if dmd's behavior and spec differs, it is handled on a per case basis).
All differences should be documented.
>  * Discard all deprecated features. Not even try to implement them even if dmd support them currently.
Yes, I forgot this one. Actually, I didn't discard imaginary floats, because I don't know what exactly will be done instead and it is easy to keep them.

>  * Do the parsing in several steps to allow different tools to work with it.
I was thinking about a pool of analysers each of which would add some information. This could be more than needed for semantic analysis. An analyser would be created each time when information (e.g., some indexing) is needed for a particular use case.

> I think we both have very compatibles goals. Let me do a clean package of it I write about design goals. I don't have that much time right now to do it, I will this week end.
I saw your ast_as_lib branch of SDC, but didn't dig deeper.
May 11, 2012
On 2012-05-11 14:07, Roman D. Boiko wrote:
> On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:

>> Found it now, "calculateFor". It not sure if it's the most intuitive
>> name though. I get the feeling: "calculate what?".

> calculateLocation was original name, but I don't like repeating return
> type in method names, I decided to change it so that it is clear that
> another renaming is needed ;) Any suggestions?
>

My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.

>> I guess I'll have to wait for that then :)
> I'll try to do that ahead of roadmap, it is important.

Cool.

-- 
/Jacob Carlborg
May 11, 2012
On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote:
> On 2012-05-11 14:07, Roman D. Boiko wrote:
>> On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:
>
>>> Found it now, "calculateFor". It not sure if it's the most intuitive
>>> name though. I get the feeling: "calculate what?".
>
>> calculateLocation was original name, but I don't like repeating return
>> type in method names, I decided to change it so that it is clear that
>> another renaming is needed ;) Any suggestions?
>>
>
> My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.

What about the following signature: Location locate(size_t index)?
Or even better:
alias size_t CodeUnitIndex;
Location locateFor(CodeUnitIndex position);

The problem with placing it in Token is that Token should not know anything about source as a whole.