View mode: basic / threaded / horizontal-split · Log in · Help
May 11, 2012
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 16:02, Roman D. Boiko a écrit :
> On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:
>> Le 11/05/2012 15:01, Roman D. Boiko a écrit :
>>> The problem with placing it in Token is that Token should not know
>>> anything about source as a whole.
>>
>> I don't really see the benefit of this. You are trading a O(1)
>> operation to an O(log(n)) . It can only be faster in specific cases,
>> which should be measured.
> Technically, I'm trading N*0(1) operations needed to track line and
> column while consuming each character to M*0(log(n)) operations when
> calculating them on demand. N = number of characters, n is number of
> lines and M is number of actual usages of Location. My assumption is
> that M << N (M is much smaller than N).

N can easily be number of tokens.
May 11, 2012
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 14:45:45 UTC, Jacob Carlborg wrote:
> On 2012-05-11 15:30, Rory McGuire wrote:
>> ? my guess is that the spec is TDPL + TDPL errata. dlang.org
>> <http://dlang.org> should be updated as people notice 
>> inaccuracies.
>> This project would be an ideal time to update dlang.org
>> <http://dlang.org> as people notice its not in sync with TDPL.
>> If TDPL doesn't cover it then the community should review it.
>>
>
> None of these are in sync.

I suggest to choose a dedicated place where it is possible to 
discuss differences and other problems with specification. This 
forum?
May 11, 2012
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:
> Le 11/05/2012 16:02, Roman D. Boiko a écrit :
>> Technically, I'm trading N*0(1) operations needed to track 
>> line and
>> column while consuming each character to M*0(log(n)) 
>> operations when
>> calculating them on demand. N = number of characters, n is 
>> number of
>> lines and M is number of actual usages of Location. My 
>> assumption is
>> that M << N (M is much smaller than N).
>
> N can easily be number of tokens.
Yes, if you are looking for the token by its index, not for 
location. E.g., for autocompletion it is needed to find the last 
token before cursor location. But that is not related to location 
search.

Also please note that I oversimplified formula for complexity. It 
also has other components. My reply was just an additional minor 
comment. Motivation is design, not performance.

One additional reason for such separation: with my approach it is 
possible to calculate Location both taking into account 
infromation from special token sequences, or ignoring it. How 
would you do that for eager calculation? Calculate only one 
category? Or track and store both?

Unnecessary complexity will eventually find a way to shoot your 
leg :) Real-world usage (at least, anticipated scenarios) should 
be the basis for designing. Sometimes this contradicts intuition 
and requires you to look at the problem from a different side.
May 11, 2012
Re: DCT: D compiler as a collection of libraries
yip, but TDPL still has to take precedence because that is the one that
walter + andrei + community put the most focused effort into.


On Fri, May 11, 2012 at 4:45 PM, Jacob Carlborg <doob@me.com> wrote:

> On 2012-05-11 15:30, Rory McGuire wrote:
>
>> ? my guess is that the spec is TDPL + TDPL errata. dlang.org
>> <http://dlang.org> should be updated as people notice inaccuracies.
>>
>> This project would be an ideal time to update dlang.org
>> <http://dlang.org> as people notice its not in sync with TDPL.
>>
>> If TDPL doesn't cover it then the community should review it.
>>
>>
> None of these are in sync.
>
> --
> /Jacob Carlborg
>
May 11, 2012
Re: DCT: D compiler as a collection of libraries
On Friday, May 11, 2012 23:00:15 Rory McGuire wrote:
> yip, but TDPL still has to take precedence because that is the one that
> walter + andrei + community put the most focused effort into.

It doesn't necessarily. Each place that they differ is examined individually 
and decided on its own merits. There is no automatic "TDPL says it's so, so 
it's so." There _are_ cases where things have even been explicitly changed 
after TDPL was released, making TDPL out-of-date for that particular item (e.g 
strong vs weak purity is not discussed in TDPL at all, beacuse it didn't exist 
at the time). In general, however, TDPL _is_ correct, and when there's a 
difference, it's simply a case of the documentation or compiler not having been 
updated accordingly. We try to avoid making any changes that would conflict 
with TDPL.

- Jonathan M Davis
May 12, 2012
Re: DCT: D compiler as a collection of libraries
On 5/11/12 10:14 PM, Roman D. Boiko wrote:
> On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:
>> Le 11/05/2012 16:02, Roman D. Boiko a écrit :
>>> Technically, I'm trading N*0(1) operations needed to track line and
>>> column while consuming each character to M*0(log(n)) operations when
>>> calculating them on demand. N = number of characters, n is number of
>>> lines and M is number of actual usages of Location. My assumption is
>>> that M << N (M is much smaller than N).
>>
>> N can easily be number of tokens.
> Yes, if you are looking for the token by its index, not for location.
> E.g., for autocompletion it is needed to find the last token before
> cursor location. But that is not related to location search.
>
> Also please note that I oversimplified formula for complexity. It also
> has other components. My reply was just an additional minor comment.
> Motivation is design, not performance.
>
> One additional reason for such separation: with my approach it is
> possible to calculate Location both taking into account infromation from
> special token sequences, or ignoring it. How would you do that for eager
> calculation? Calculate only one category? Or track and store both?
>
> Unnecessary complexity will eventually find a way to shoot your leg :)
> Real-world usage (at least, anticipated scenarios) should be the basis
> for designing. Sometimes this contradicts intuition and requires you to
> look at the problem from a different side.

As deadalnix says, I think you are over-complicating things.

I mean, to store the column and line information it's just:

if (isNewLine(c)) {
  line++;
  column = 0;
} else {
  column++;
}

(I think you need to add that to the SourceRange class. Then copy line 
and column to token on the Lexer#lex() method)

Do you really think it's that costly in terms of performance?

I think you are wasting much more memory and performance by storing all 
the tokens in the lexer.

Imagine I want to implement a simple syntax highlighter: just highlight 
keywords. How can I tell DCT to *not* store all tokens because I need 
each one in turn? And since I'll be highlighting in the editor I will 
need column and line information. That means I'll have to do that 
O(log(n)) operation for every token.

So you see, for the simplest use case of a lexer the performance of DCT 
is awful.

Now imagine I want to build an AST. Again, I consume the tokens one by 
one, probably peeking in some cases. If I want to store line and column 
information I just copy them to the AST. You say the tokens are 
discarded but their data is not, and that's why their data is usually 
copied.
May 12, 2012
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
> As deadalnix says, I think you are over-complicating things.
>
> I mean, to store the column and line information it's just:
>
> if (isNewLine(c)) {
>   line++;
>   column = 0;
> } else {
>   column++;
> }
>
> (I think you need to add that to the SourceRange class. Then 
> copy line and column to token on the Lexer#lex() method)
>
> Do you really think it's that costly in terms of performance?
>
> I think you are wasting much more memory and performance by 
> storing all the tokens in the lexer.
>
> Imagine I want to implement a simple syntax highlighter: just 
> highlight keywords. How can I tell DCT to *not* store all 
> tokens because I need each one in turn? And since I'll be 
> highlighting in the editor I will need column and line 
> information. That means I'll have to do that O(log(n)) 
> operation for every token.
>
> So you see, for the simplest use case of a lexer the 
> performance of DCT is awful.
>
> Now imagine I want to build an AST. Again, I consume the tokens 
> one by one, probably peeking in some cases. If I want to store 
> line and column information I just copy them to the AST. You 
> say the tokens are discarded but their data is not, and that's 
> why their data is usually copied.
Summary of your points (I deliberately emphasize some of them 
more than you did; please correct me if I misinterpreted 
anything):
* storing location for each token is simple and cheap
* SourceRange is needed; an instance per token; it should be a 
class, not struct
* Syntax highlighting doesn't need to keep all tokens in memory
* but it must know column and line for each token
* for this use case DCT has awful performance
* to build AST lines and columns are required
* information from tokens must be copied and possibly transformed 
in order to put it into AST; after such transformation tokens are 
not needed any more

Those all are valid points given that we don't have any 
benchmarks and usage examples. The good news is that use cases 
like highlighting, parsing, autocompletion, etc. are the core 
functionality for which DCT should be designed. So if it fails 
any of them, design needs to be revisited.

But I will need some time to thoroughly address each of this 
issues (either prove that it is not relevant, or fix the 
problem). I will definitely complete this work during May. I will 
regularly report my progress here.

In general, I tend to disagree with most of them, because I 
already thought about each and designed accordingly, but it is 
important to give them enough attention. What I fail miserably at 
this moment is providing actual usage code and benchmarks, so I'm 
going to focus on that.

Thanks for your feedback!
May 12, 2012
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
> As deadalnix says, I think you are over-complicating things.
>
> I mean, to store the column and line information it's just:
>
> if (isNewLine(c)) {
>   line++;
>   column = 0;
> } else {
>   column++;
> }
>
> (I think you need to add that to the SourceRange class. Then 
> copy line and column to token on the Lexer#lex() method)
>
> Do you really think it's that costly in terms of performance?
>
> I think you are wasting much more memory and performance by 
> storing all the tokens in the lexer.
>
> Imagine I want to implement a simple syntax highlighter: just 
> highlight keywords. How can I tell DCT to *not* store all 
> tokens because I need each one in turn? And since I'll be 
> highlighting in the editor I will need column and line 
> information. That means I'll have to do that O(log(n)) 
> operation for every token.
>
> So you see, for the simplest use case of a lexer the 
> performance of DCT is awful.
>
> Now imagine I want to build an AST. Again, I consume the tokens 
> one by one, probably peeking in some cases. If I want to store 
> line and column information I just copy them to the AST. You 
> say the tokens are discarded but their data is not, and that's 
> why their data is usually copied.

Would it be possible for you to fork my code and tweak it for 
comparison? You will definitely discover more problems this way, 
and such feedback would really help me. That doesn't seem to be 
inadequately much work to do.

Any other volunteers for this?
May 12, 2012
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 05:13:36 UTC, Roman D. Boiko wrote:
> On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
>> As deadalnix says, I think you are over-complicating things.
...
> ... I will need some time to thoroughly address each of this 
> issues (either prove that it is not relevant, or fix the 
> problem). I will definitely complete this work during May. I 
> will regularly report my progress here.
>
> In general, I tend to disagree with most of them, because I 
> already thought about each and designed accordingly, but it is 
> important to give them enough attention.
https://github.com/roman-d-boiko/dct/issues/15
May 12, 2012
Re: DCT: D compiler as a collection of libraries
On Friday, May 11, 2012 10:01:28 Roman D. Boiko wrote:
> There were several discussions about the need for a D compiler
> library.
> 
> I propose my draft implementation of lexer for community review:
> https://github.com/roman-d-boiko/dct
> 
> Lexer is based on Brian Schott's project
> https://github.com/Hackerpilot/Dscanner, but it has been
> refactored and extended (and more changes are on the way).
> 
> The goal is to have source code loading, lexer, parser and
> semantic analysis available as parts of Phobos. These libraries
> should be designed to be usable in multiple scenarios (e.g.,
> refactoring, code analysis, etc.).
> 
> My commitment is to have at least front end built this year (and
> conforming to the D2 specification unless explicitly stated
> otherwise for some particular aspect).
> 
> Please post any feed here. A dedicated project web-site will be
> created later.

It's great that you're working on this. We definitely need more of this sort of 
stuff. However, I would point out that I'm not sure that it will be acceptable 
for Phobos (it _may_ be, but it's not quite what we've been looking for). 
There are two types of lexers/parsers that we've been looking to add to 
Phobos.

1. A direct port of the lexer from dmd's C++ front-end. The API would be D-
ified to use ranges, but the internals would be almost identical to the C++ 
front-end so that porting changes back and forth would be very easy. It also 
would make it easier to replace the C++ front-end with a D one later if the D 
one is very close to the C++ one.

2. A lexer/parser generator using templates and/or CTFE to generate a 
lexer/parser of whatever language using a BNF or EBNF grammar - probably 
similar to what Philippe Sigaud has done to generate a parser using PEGs. Such 
could be used to generate a lexer/parser for D or any other language.

What you seem to be doing is creating a D-specific parser from scratch, which 
is great, but it's not quite what we've been looking to add to Phobos. That 
doesn't necessarily mean that we can't or won't add it to Phobos once it's 
complete (especially if nothing else gets done first), but it also may not be 
acceptable, because it's not either #1 or #2. That would have to be decided 
once it's done and ready to be reviewed for inclusion in Phobos. I'm not 
trying to discourage you at all. I'm just pointing out that what you're doing 
is quite what we're looking for. Regardless, it _would_ be interesting to be 
able to compare implementations of #1 and #2 against what you're doing and 
seeing which performs better.

- Jonathan M Davis
2 3 4 5 6 7 8 9
Top | Discussion index | About this forum | D home