May 19, 2012
Am Fri, 11 May 2012 10:01:28 +0200
schrieb "Roman D. Boiko" <rb@d-coding.com>:

> There were several discussions about the need for a D compiler library.
> 
> I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct
> 
> Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way).
> 
> The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.).
> 
> My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect).
> 
> Please post any feed here. A dedicated project web-site will be created later.

A general purpose D front-end library has been discussed several times. So at least for some day-dreaming about better IDE support and code analysis tools it has a huge value. It is good to see that someone finally takes the time to implement one. My only concern is that not enough planing went into the design. I think about things like brainstorming and collecting possible use cases from the community or looking at how some C++ tools do their job and what infrastructure they are built on. Although it is really difficult to tell from other people's code which decision was 'important' and what was just the author's way to do it.

Inclusion into Phobos I would not see as a priority. As Jonathan said, there are already some clear visions of how such modules would look like. Also if any project seeks to replace the DMD front-end, Walter should be the technical advisor. Like anyone putting a lot of time and effort into a design, he could have strong feelings about certain decisions and implementing them in a seemingly worse way.
That said, you make the impression of being really dedicated to the project, even giving yourself a time line, which is a good sign. I wish you all the best for the project and hope that - even without any 'official' recognition - it will see a lot of use as the base of D tools.

To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I ended up with a similar design for the Token struct like yours:
- separate array for line # lookup
- TokenType/TokenKind enum
- Trie for matching token kinds (why do you expand it into nested switch-case through CTFE mixins though?)
Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).
One notable difference is that I only check for isEoF in one location, since I append "\0" to the source code as a stop token (that can also serve as an end-of-line indicator). (The general idea is to move checks outside of loops.)

** Errors  **
I generally keep the start and end column, in case someone wants that. A real world example:

  ubyte x = ...;
  if (x >= 0 && x <= 8) ...
  Line 2, warning: Comparison is always true.

At first glace you say "No, a byte can become larger than 8.", but the compiler is just complaining about the first part of the expression here, which would be clarified by e.g. some ASCII/Unicode art:

  Line 2, warning: Comparison is always true:
  if (x >= 0 && x <= 8) ...
      ^----^

** Code highlighting **
Especially Gtk's TextView (which GtkSourceView is base on) isn't made for several thousand tokens. The text view will take a second per 20000 tokens or so. The source view works around that by highlighting in chunks, but you can still fell the delay when you jump to the end of the file in gedit and even MonoDevelop suffers from the bad performance. If I understood your posts correctly, you are already planning for minimal change sets. It is *very* important to only update changed parts in a syntax highlighting editor. So for now I ended up writing a 'insertText' and 'deleteText' method for the parser that take 'start index', 'text' and 'start index', 'end index' respectively and return a list of removed and added tokens.
If possible, make sure that your library can be used with GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately the bindings for the last two could use an update, but the API help should get you an idea of how to use it most efficiently.

-- 
Marco
May 20, 2012
On Saturday, 19 May 2012 at 20:35:10 UTC, Marco Leise wrote:
> Am Fri, 11 May 2012 10:01:28 +0200
> schrieb "Roman D. Boiko" <rb@d-coding.com>:
>
>> There were several discussions about the need for a D compiler library.
>> 
>> I propose my draft implementation of lexer for community review:
>> https://github.com/roman-d-boiko/dct
(I decided to comment on both my post and your reply.)

I've got *a lot* of feedback from community, which caused a significant redesign (but caused a delay with yet unknown duration). I'll write more on that later. Thanks a lot to everybody!

>> Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way).
Looks like I'm going to replace this implementation almost completely.

>> The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.).
See below.

>> My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect).
By the D specification I mean http://dlang.org/language-reference.html, or (when specification is not clear for me) one of the following:
* TDPL,
* current DMD implementation
* community feedback.

(Note that I may assume that I understand some aspect, but later revisit it if needed.)

>> Please post any feed here. A dedicated project web-site will be created later.
It is going to be http://d-coding.com, but it is not usable yet (I have not much web-design practise). It is hosted on https://github.com/roman-d-boiko/roman-d-boiko.github.com, pull requests are welcome.

> A general purpose D front-end library has been discussed several times. So at least for some day-dreaming about better IDE support and code analysis tools it has a huge value. It is good to see that someone finally takes the time to implement one. My only concern is that not enough planing went into the design.
Could you name a few specific concerns? The reason of this thread was to reflect on design early. OTOH, I didn't spend time documenting my design goals and tradeoffs, so discussion turned into a brainstorm and wasn't always productive. But now when I see how much value can I get even without doing my homework, I'm much more likely to provide better documentation and ask more specific questions.

>I think about things like brainstorming and collecting
> possible use cases from the community or looking at how some C++ tools do their job and what infrastructure they are built on. Although it is really difficult to tell from other people's code which decision was 'important' and what was just the author's way to do it.
I'm going to pick up several use cases and prioritize them according to my judgement. Feel free to suggest any cases that you think are needed (with motivation). Prioritizing is necessary to define what is out of scope and plan work into milestones, in order to ensure the project is feasible.

> Inclusion into Phobos I would not see as a priority. As Jonathan said, there are already some clear visions of how such modules would look like.
Well, *considering* such inclusion seriously would help improve the quality of project. But it is not what necessarily will happen. Currently I think that my goals are close to those of use case 2 from Jonathan's reply. But until the project is reasonably complete it is not the time to argue whether to include it (or its part).

> Also if any project seeks to replace the DMD front-end, Walter should be the technical advisor. Like anyone putting a lot of time and effort into a design, he could have strong feelings about certain decisions and implementing them in a seemingly worse way.
Not a goal currently, because that would make the project significantly less likely to be completed ever.

> That said, you make the impression of being really dedicated to the project, even giving yourself a time line, which is a good sign. I wish you all the best for the project and hope that - even without any 'official' recognition - it will see a lot of use as the base of D tools.
Well, I hope that some more people will join the project once it stabilizes and delivers something useful.

By the way, is anybody *potentially* interested to join?

> To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I ended up with a similar design for the Token struct like yours:
> - separate array for line # lookup
> - TokenType/TokenKind enum
> - Trie for matching token kinds (why do you expand it into nested switch-case through CTFE mixins though?)
Switch code generation is something temporary that works currently and helps me evaluate possible problems with future design, which is planned to be compile-time finite automata (likely deterministic).

> Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).
I'm sure it will :) (I'm going to elaborate on this some time later).

> One notable difference is that I only check for isEoF in one location, since I append "\0" to the source code as a stop token (that can also serve as an end-of-line indicator). (The general idea is to move checks outside of loops.)
There are several EoF conditions: \0, \x1A, __EOF__ and physicas EOF. And any loop would need to check for all. Preallocation could eliminate the need to check for physical EoF, but would make it impossible to avoid input string copying and also would not allow passing a function a slice of string to be lexed.

> ** Errors  **
> I generally keep the start and end column, in case someone wants that. A real world example:
>
>   ubyte x = ...;
>   if (x >= 0 && x <= 8) ...
>   Line 2, warning: Comparison is always true.
>
> At first glace you say "No, a byte can become larger than 8.", but the compiler is just complaining about the first part of the expression here, which would be clarified by e.g. some ASCII/Unicode art:
>
>   Line 2, warning: Comparison is always true:
>   if (x >= 0 && x <= 8) ...
>       ^----^
This functionality has been the priority from the beginning. Not implemented yet but designed for. Evaluation of column and line only on demand is caused by the assumption that such information is needed rarely (primarily to display information to the user). My new design also addresses the use cases when it is needed frequently.

> ** Code highlighting **
> Especially Gtk's TextView (which GtkSourceView is base on) isn't made for several thousand tokens. The text view will take a second per 20000 tokens or so. The source view works around that by highlighting in chunks, but you can still fell the delay when you jump to the end of the file in gedit and even MonoDevelop suffers from the bad performance. If I understood your posts correctly, you are already planning for minimal change sets. It is *very* important to only update changed parts in a syntax highlighting editor. So for now I ended up writing a 'insertText' and 'deleteText' method for the parser that take 'start index', 'text' and 'start index', 'end index' respectively and return a list of removed and added tokens.
> If possible, make sure that your library can be used with GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately the bindings for the last two could use an update, but the API help should get you an idea of how to use it most efficiently.
Curretly I plan to support GtkSourceView, Scintilla, and a custom editor API which I plan to define so that it would be most efficient in this respect. Didn't work on that thoroughly yet.

Incremental changes are the key to efficiency, and I'm going to invest a lot of effort into making them. Also immutability of data structures will enable many optimizations.
May 20, 2012
Am Sun, 20 May 2012 10:09:34 +0200
schrieb "Roman D. Boiko" <rb@d-coding.com>:

> Could you name a few specific concerns?

Mostly my own gut feeling, that things that sound great in my head turn out to bite me in the end. Things that one just doesn't think of because of the limited horizon everyone has and probably a lack of experience in the field of compiler/tools writing.
On the other hand I have good experience with working by community feedback. So the rough edges in the design may already be ironed out by now.

> I'm going to pick up several use cases and prioritize them according to my judgement. Feel free to suggest any cases that you think are needed (with motivation). Prioritizing is necessary to define what is out of scope and plan work into milestones, in order to ensure the project is feasible.

There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion).

C bindings could be an option. (As in: the smallest common denominator.) They allow existing tools (written in Java, C#, Python, ...) to use your library.

> > Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).
> I'm sure it will :) (I'm going to elaborate on this some time later).

I'm curious.

> There are several EoF conditions: \0, \x1A, __EOF__ and physicas EOF. And any loop would need to check for all. Preallocation could eliminate the need to check for physical EoF, but would make it impossible to avoid input string copying and also would not allow passing a function a slice of string to be lexed.

I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean.
It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?) The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the "surrounding" parser state. When they reach the "file" level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed.
In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else.

> > ** Errors  **
> > I generally keep the start and end column, in case someone
> > wants that.
>
> This functionality has been the priority from the beginning. Not implemented yet but designed for. Evaluation of column and line only on demand is caused by the assumption that such information is needed rarely (primarily to display information to the user). My new design also addresses the use cases when it is needed frequently.
>
> [...]
>
> Incremental changes are the key to efficiency, and I'm going to invest a lot of effort into making them. Also immutability of data structures will enable many optimizations.

No more questions!

-- 
Marco

May 20, 2012
On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:
> There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion).
The opposite situation: I didn't think about parser per project :)
So I guess no need to worry here.

> C bindings could be an option. (As in: the smallest common denominator.) They allow existing tools (written in Java, C#, Python, ...) to use your library.
Yeah, but I'm far from there yet.

>> > Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).
>> I'm sure it will :) (I'm going to elaborate on this some time later).
>
> I'm curious.
Maybe I'm don't understand your use case, but the idea is that if
you parse as you type it should be possible to avoid parsing and
allocating memory for those lines which have not changed. But
that is not compatible with storing tokens in the array, since it
would cause reallocating memory each time, so the other data
structure should be used (e.g., a linked list, or, if efficient
lookup is needed, a red-black tree). Only benchmarks can show
whether (and how much) my approach would be faster for specific
situation (input patterns like average size of data, complexity
of parsing algorithms, requirements, etc.).

> I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean.
Answered below.

> It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion.
> (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?)
D has the following EoF cases: \0, \x1A, physical EoF and __EOF__
special token when not inside comment or some of string literals.
^D is not in this list.

> The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the "surrounding" parser state. When they reach the "file" level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed.
> In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else.
If you want to pass a slice of input string to a function, you
cannot append \0 to it without copying data. If you don't append
some pre-defined character, you must check for length *and* all
supported terminating characters. On the contrary, your design
might not require passing slices, and if language syntax allows
deterministic parsing (when you know what to expect next), there
is no need for checking EoF.
May 20, 2012
On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:
> There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion).
The opposite situation: I didn't think about parser per project :)
So I guess no need to worry here.

> C bindings could be an option. (As in: the smallest common denominator.) They allow existing tools (written in Java, C#, Python, ...) to use your library.
Yeah, but I'm far from there yet.

>> > Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).
>> I'm sure it will :) (I'm going to elaborate on this some time later).
>
> I'm curious.
Maybe I'm don't understand your use case, but the idea is that if
you parse as you type it should be possible to avoid parsing and
allocating memory for those lines which have not changed. But
that is not compatible with storing tokens in the array, since it
would cause reallocating memory each time, so the other data
structure should be used (e.g., a linked list, or, if efficient
lookup is needed, a red-black tree). Only benchmarks can show
whether (and how much) my approach would be faster for specific
situation (input patterns like average size of data, complexity
of parsing algorithms, requirements, etc.).

> I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean.
Answered below.

> It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion.
> (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?)
D has the following EoF cases: \0, \x1A, physical EoF and __EOF__
special token when not inside comment or some of string literals.
^D is not in this list.

> The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the "surrounding" parser state. When they reach the "file" level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed.
> In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else.
If you want to pass a slice of input string to a function, you
cannot append \0 to it without copying data. If you don't append
some pre-defined character, you must check for length *and* all
supported terminating characters. On the contrary, your design
might not require passing slices, and if language syntax allows
deterministic parsing (when you know what to expect next), there
is no need for checking EoF.
May 20, 2012
Am Sun, 20 May 2012 20:37:07 +0200
schrieb "Roman D. Boiko" <rb@d-coding.com>:

> >> > Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).
> >> I'm sure it will :) (I'm going to elaborate on this some time later).
> >
> > I'm curious.
> Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.).

I just only thought about the single-use situation, not the situation when editing files. Now the idea has evolved a bit and you probably thought about storing the scope hierarchy in a tree, too. It is still difficult to write a fast parser when someone can add/remove a '{' somewhere at the top of datetime.d, which changes the interpretation of the rest of the file. Mono-D has some hickups worse than those (several seconds even) - maybe a bug. I'll keep my fingers crossed.

> If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF.

Now I get it!

-- 
Marco
1 2 3 4 5 6 7 8 9
Next ›   Last »