Thread overview | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
August 01, 2012 Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
First: This is not a release announcement. I want to let people know that Dscanner *exists*. https://github.com/Hackerpilot/Dscanner/ It's a utility that I designed to be used by text editors such as VIM, Textadept, etc., for getting information about D source code. I've held off on anoncing this in the past because I don't think that it's really ready for a release, but after seeing several of the threads about lexers in the D newsgroup I decided I should make some sort of announcement. What it does: * Has a D lexer * Can syntax-highlight D source files as HTML * Can generate CTAGS files from D code * VERY BASIC autocomplete <- The reason I don't consider it "done" * Can generate a JSON summary of D code. * Line of code counter. Basically just a filter on the range of tokens that looks for things like semicolons. It's Boost licensed, so feel free to use (or submit improvements for) the tokenizer. |
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 8/1/2012 10:30 AM, Brian Schott wrote:
> First: This is not a release announcement.
>
> I want to let people know that Dscanner *exists*.
>
> https://github.com/Hackerpilot/Dscanner/
>
> It's a utility that I designed to be used by text editors such as VIM,
> Textadept, etc., for getting information about D source code.
>
> I've held off on anoncing this in the past because I don't think that it's
> really ready for a release, but after seeing several of the threads about lexers
> in the D newsgroup I decided I should make some sort of announcement.
>
> What it does:
> * Has a D lexer
> * Can syntax-highlight D source files as HTML
> * Can generate CTAGS files from D code
> * VERY BASIC autocomplete <- The reason I don't consider it "done"
> * Can generate a JSON summary of D code.
> * Line of code counter. Basically just a filter on the range of tokens that
> looks for things like semicolons.
>
> It's Boost licensed, so feel free to use (or submit improvements for) the
> tokenizer.
I suggest proposing the D lexer as an addition to Phobos. But if that is done, its interface would need to accept a range as input, and its output should be a range of tokens.
|
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
>
> I suggest proposing the D lexer as an addition to Phobos. But if that is done, its interface would need to accept a range as input, and its output should be a range of tokens.
It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....
|
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On Wednesday, August 01, 2012 19:58:46 Brian Schott wrote: > On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote: > > I suggest proposing the D lexer as an addition to Phobos. But if that is done, its interface would need to accept a range as input, and its output should be a range of tokens. > > It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this.... If you want really good performance out of a range-based solution operating on ranges of dchar, then you need to special case for the built-in string types all over the place, and if you have to wrap them in other range types (generally because of calling another range-based function), then there's a good chance that you will indeed get a performance hit. D's range-based approach is really nice from the perspective of usability, but you have to work at it a bit if you want it to be efficient when operating on strings. It _can_ be done though. The D lexer that I'm currently writing special-cases strings pretty much _everywhere_ (string mixins really help reduce the cost of that in terms of code duplication). The result is that if I do it right, its performance for strings should be very close to what dmd can do (it probably won't quite reach dmd's performance simply because of some extra stuff it does to make it more usable for stuff other than compilers - e.g. syntax highlighters). But you'll still likely get a performance hit of you did something like string source = getSource(); auto result = tokenRange(filter!"true"(source)); instead of string source = getSource(); auto result = tokenRange(source); It won't be quite as bad a performance hit with 2.060 thanks to some recent optimizations to string's popFront, but you're going to lose out on some performance regardless, because nothing can special-case for every possible range type, and one of the keys to fast string processing is to minimizing how much you decode characters, which generally requires special-casing. - Jonathan M Davis |
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | Am Wed, 01 Aug 2012 19:58:46 +0200 schrieb "Brian Schott" <briancschott@gmail.com>: > On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote: > > > > I suggest proposing the D lexer as an addition to Phobos. But if that is done, its interface would need to accept a range as input, and its output should be a range of tokens. > > It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this.... I can understand you. I was reading a dictionary file with readText().splitLines(); and wondering why a unicode decoding was performed. Unfortunately ranges work on Unicode units and all structured text files are structured by ASCII characters. While these file formats probably just old or done with some false sense of compatibility in mind, it is also clear to their inventors, that parsing them is easier and faster with single-byte characters to delimit tokens. But we have talked about UTF-8 vs. ASCII and foreach vs. ranges before. I still hope for some super-smart solution, that doesn't need a book of documentation and allows some kind of ASCII-equivalent range. I've heard that foreach over UTF-8 with a dchar loop variable, does an implicit decoding of the UTF-8 string. While this is useful it is also not self-explanatory and needs some reading into the topic. -- Marco |
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On Wed, Aug 1, 2012 at 7:30 PM, Brian Schott <briancschott@gmail.com> wrote: > First: This is not a release announcement. > > I want to let people know that Dscanner *exists*. > > https://github.com/Hackerpilot/Dscanner/ > What it does: > * Has a D lexer (...) > * Can generate a JSON summary of D code. I just tested the JSON output and it works nicely. Finally, a way to get imports! I have have two remarks (not critics!) - there seem to be two "structs" objects in the JSON, unless I'm mistaken. - alias declaration are not parsed, seemingly. (as in "alias int MyInt;") Also, do you think comments could be included in the JSON? Nice work, keep going! |
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco Leise | On Wednesday, August 01, 2012 22:34:14 Marco Leise wrote:
> I still hope for some
> super-smart solution, that doesn't need a book of documentation and allows
> some kind of ASCII-equivalent range.
If you want pure ASCII, then just cast to ubyte[] (or const(ubyte)[] or immutable(ubyte)[], depending on the constness involved). string functions won't work, because they require UTF-8 (or UTF-16 or UTF-32 if they're templatized on string type), but other range-based and array functions will work just fine.
- Jonathan M Davis
|
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Philippe Sigaud | On Wednesday, 1 August 2012 at 20:39:49 UTC, Philippe Sigaud wrote:
> I have have two remarks (not critics!)
>
> - there seem to be two "structs" objects in the JSON, unless I'm mistaken.
> - alias declaration are not parsed, seemingly. (as in "alias int MyInt;")
>
> Also, do you think comments could be included in the JSON?
>
> Nice work, keep going!
It's more likely that I'll remember things if they're enhancement requests/bugs on Github.
Structs: I'll look into it
Alias: Not implemented yet.
Comments: It's planned. I want to be able to give doc comments in the autocomplete information.
|
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | Le 01/08/2012 19:58, Brian Schott a écrit :
> On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
>>
>> I suggest proposing the D lexer as an addition to Phobos. But if that
>> is done, its interface would need to accept a range as input, and its
>> output should be a range of tokens.
>
> It used to be range-based, but the performance was terrible. The
> inability to use slicing on a forward-range of characters and the
> gigantic block on KCachegrind labeled "std.utf.decode" were the reasons
> that I chose this approach. I wish I had saved the measurements on this....
Maybe a RandomAccessRange could do the trick ?
|
August 01, 2012 Re: Dscanner - It exists | ||||
---|---|---|---|---|
| ||||
Posted in reply to deadalnix | On 8/1/12 5:09 PM, deadalnix wrote:
> Le 01/08/2012 19:58, Brian Schott a écrit :
>> On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
>>>
>>> I suggest proposing the D lexer as an addition to Phobos. But if that
>>> is done, its interface would need to accept a range as input, and its
>>> output should be a range of tokens.
>>
>> It used to be range-based, but the performance was terrible. The
>> inability to use slicing on a forward-range of characters and the
>> gigantic block on KCachegrind labeled "std.utf.decode" were the reasons
>> that I chose this approach. I wish I had saved the measurements on
>> this....
>
> Maybe a RandomAccessRange could do the trick ?
I think the best way here is to define a BufferedRange that takes any other range and supplies a buffer for it (with the appropriate primitives) in a native array.
Andrei
|
Copyright © 1999-2021 by the D Language Foundation