D language lexer written in D - dlexer.d - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » D language lexer written in D - dlexer.d

Thread overview

D language lexer written in D - dlexer.d

Dec 18, 2004

Dec 18, 2004

Dec 18, 2004

Re: D language lexer written in D [updated dlexer.d]

Dec 18, 2004

Dec 19, 2004

Dec 19, 2004

Dec 19, 2004

D parser for code-completion - dtags.d
Dec 19, 2004 James Dunne
Dec 24, 2004 Vincent Risi
Jan 01, 2005 ahiru
Jan 03, 2005 James Dunne

Dec 19, 2004

Dec 20, 2004

Dec 20, 2004

Dec 20, 2004

Dec 20, 2004

Dec 22, 2004

Dec 22, 2004

Dec 22, 2004

Dec 23, 2004

December 18, 2004

D language lexer written in D - dlexer.d

Posted by James Dunne

James Dunne

Attachments:

dlexer.d

Hello all,

I thought as a nice Christmas present for you all working so hard on D, I would contribute something some of you might find useful.  I've been working hard on a lexer (tokenizer) for the D language.  Attached to this post is the lexer module written in D!!

I began work on it for code-completion support for my D IDE called Orion (over on dsource.org).  Don't bother checking anything out there yet, as its all in ruins right now. ;)

I tested this module somewhat, and it successfully lexes its own source code! This module could be useful for a few things, like:

+ code-completion database for D modules
+ generating CTAGS for the D language
+ indent-like program for D

..and any other D source code tools you can think of.

Merry Christmas all, enjoy!

Regards,
James Dunne

December 18, 2004

Re: D language lexer written in D - dlexer.d

Posted by Matthew
in reply to James Dunne

Matthew

Posted in reply to James Dunne

Excellent! I shall be checking this out in the new year.

:-)

Matthew

"James Dunne" <jdunne4@bradley.edu> wrote in message news:cq0t39$18b8$1@digitaldaemon.com...
> Hello all,
>
> I thought as a nice Christmas present for you all working so hard on D, I would contribute something some of you might find useful.  I've been working hard on a lexer (tokenizer) for the D language.  Attached to this post is the lexer module written in D!!
>
> I began work on it for code-completion support for my D IDE called Orion (over on dsource.org).  Don't bother checking anything out there yet, as its all in ruins right now. ;)
>
> I tested this module somewhat, and it successfully lexes its own source code! This module could be useful for a few things, like:
>
> + code-completion database for D modules
> + generating CTAGS for the D language
> + indent-like program for D
>
> ..and any other D source code tools you can think of.
>
> Merry Christmas all, enjoy!
>
> Regards,
> James Dunne
>

December 18, 2004

Re: D language lexer written in D - dlexer.d

Posted by Ivan Senji
in reply to James Dunne

Ivan Senji

Posted in reply to James Dunne

"James Dunne" <jdunne4@bradley.edu> wrote in message news:cq0t39$18b8$1@digitaldaemon.com...
> Hello all,
>
> I thought as a nice Christmas present for you all working so hard on D, I
would
> contribute something some of you might find useful.  I've been working
hard on a
> lexer (tokenizer) for the D language.  Attached to this post is the lexer
module
> written in D!!
>

Thanks! This is just what of my projects needs. It looks great and looks
like
it isn't going to be hard to use. You scared me for a moment when i saw
TOKdotvar ,TOKdotti,
TOKdotexp, and TOKdottype but you don't seem to use them :)

> I began work on it for code-completion support for my D IDE called Orion
(over
> on dsource.org).  Don't bother checking anything out there yet, as its all
in
> ruins right now. ;)
>
> I tested this module somewhat, and it successfully lexes its own source
code!
> This module could be useful for a few things, like:
>
> + code-completion database for D modules
> + generating CTAGS for the D language
> + indent-like program for D
>
> ..and any other D source code tools you can think of.

hooray!

>
> Merry Christmas all, enjoy!

To you too!

>
> Regards,
> James Dunne
>

December 18, 2004

Re: D language lexer written in D [updated dlexer.d]

Posted by James Dunne
in reply to James Dunne

James Dunne

Posted in reply to James Dunne

Attachments:

dlexer.d

I have updated the dlexer.d module, so please use this version.

New features:
- line number tracking (line property of DLexer class)
- DLexerException class: constructs error message containing filename and
current line number
- independent module now
- more comments! :)

TODO:
- correct wysiwyg string parsing
- correct hex string parsing
- correct numeric literal parsing (ints, floats, etc.)

I'm glad I could help you all in your project endeavours!  If anyone has some extra webspace that they wouldn't mind hosting D code snippets on, please let me know!  I've got a bunch of useful ones :)

Regards,
James Dunne

December 19, 2004

Re: D language lexer written in D [updated dlexer.d]

Posted by J C Calvarese
in reply to James Dunne

J C Calvarese

Posted in reply to James Dunne

James Dunne wrote:
> I'm glad I could help you all in your project endeavours!  If anyone has some
> extra webspace that they wouldn't mind hosting D code snippets on, please let me
> know!  I've got a bunch of useful ones :)
> 
> Regards,
> James Dunne

It sounds like you want to check out dsource.org.

To start a new project, just post a request in the "Potential Projects" forum (http://www.dsource.org/forums/viewforum.php?f=13).

If you just want to make some examples available to the community, you can post an example in the Tutorials section (http://www.dsource.org/tutorials/).

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

December 19, 2004

Re: D language lexer written in D [updated dlexer.d]

Posted by Matthew
in reply to James Dunne

Matthew

Posted in reply to James Dunne

James

I've not yet had chance to look at the code, but I was wondering whether you could give a quick 2-para precise of the interface for your module? A bit of sample program might be useful? (I've just had a sneak, and seen nextToken(), so I reckon I could work it out, but it'd be nicer and quicker if you can just give us a bit of usage info.)

Cheers

Matthew

P.S. FYI, I wrote a source parser / processor about 4 yrs ago which I used to great effect on a very poorly programmed Java project I was brought in on. I was able to write filters to effect changes to hundreds of thousands of lines of code automatically, which was of great benefit when clearing up after a band of careless programmers. Now I'm sure that such programs will not be needed with D, being as how we're all so cool and all, but it'd be jolly nice to be able to write a server plug-in for D, and use it in the same tool (it uses COM). I shall investigate this in a couple of weeks' time. :-)

"James Dunne" <jdunne4@bradley.edu> wrote in message news:cq2dot$2lgq$1@digitaldaemon.com...
>
> I have updated the dlexer.d module, so please use this version.
>
> New features:
> - line number tracking (line property of DLexer class)
> - DLexerException class: constructs error message containing filename and
> current line number
> - independent module now
> - more comments! :)
>
> TODO:
> - correct wysiwyg string parsing
> - correct hex string parsing
> - correct numeric literal parsing (ints, floats, etc.)
>
> I'm glad I could help you all in your project endeavours!  If anyone has some extra webspace that they wouldn't mind hosting D code snippets on, please let me know!  I've got a bunch of useful ones :)
>
> Regards,
> James Dunne
>

December 19, 2004

Re: D language lexer written in D [updated dlexer.d]

Posted by James Dunne
in reply to Matthew

James Dunne

Posted in reply to Matthew

Sure thing Matthew,

As you have guessed already, nextToken() is your main guy to call.  This returns a Token *, representing the current language token.  A Token is a structure defined to have only 2 members: 'ident', and 'token'.  'token' is any value from the TOK enumeration which enumerates all the D language tokens available.  This lexer works just like the D compiler's lexer, making the longest possible token at all times (greedy).  The 'ident' member is used only if the token being parsed has a special meaning, like an identifier, string literal, or a numeric literal.

As of right now, the lexer is somewhat limited.  It successfully lexes most D language tokens, but does not have support for wysiwyg or hex strings, or for proper numeric literal parsing.  All numeric literals are assumed to be int32v tokens.  There is no official hexadecimal, octal, or floating-point parsing. However, a cheap hack is in effect to successfully half-assedly parse these, since it ensures the characters in the literal are either alpha or numeric. This means that 0x000 will be parsed as one numeric literal, since it starts with a numeric ('0') and follows with alphas ('x') and numerics ('0').

There are two ways to parse double-quoted strings ("string"), preserving escape-sequences, or interpreting the escape-sequences.  Code is in place to perform both methods for convenience.  Setting the version identifier 'interpret_slashes' will return the escaped characters in the string as they should be.  If this version identifier is disabled, then the string is simply copied and not escaped (this is the default).

In order to fully utilize the parser, you would ideally create a child class based on the DLexer class and use its methods.  If you encounter errors during your parsing, you may throw a DLexerException(), which will provide you with a fully detailed error message including the filename being parsed, and the current line number being parsed.  You must supply the actual error message yourself.  Here's an example:

# module dparser;
# import dlexer;
# class DParser : DLexer {
#   public:
#     this(char[] filename, char[] src) {
#       super(filename, src);
#     }
#
#     void parse() {
#       Token*   tok;
#
#       // Start parsing from the beginning of the module:
#       restart();
#
#       tok = nextToken();
#       // Throw some nicely formatted bogus error:
#       if (tok.token != TOK.TOKlcurly)
#         throw new DLexerException(this, "Some bogus error here.");
#     }
# }

You'll find this type of activity common when parsing source code.  Writing a convenience wrapper function to 'expect' tokens is a good idea.  Here's an example of what that would look like:

(within "class DParser : DLexer {" scope)

# Token* expect(TOK value, char[] msg) {
#   Token* tok = nextToken();
#   if (tok.token != value)
#     throw new DLexerException(this, msg);
#   return tok;
# }

Calling nextToken() will always consume a token and will place the cursor at the next token to be consumed.  If you wish to peek ahead of the current token to see what the next token will be without consuming it, call peekToken().  It works exactly like nextToken() but resets the cursor back to its original position.

The DLexer class provides the following public methods and variables:

this(char[] filename, char[] src) -- give the lexer the name of the file parsing
(filename), and the contents of the entire file as a single string (src)
Token* nextToken();    --   consumes next token
Token* peekToken();    --   peeks at next token without consuming it
void restart();        --   restart parsing from the beginning of the module
char[] filename        --   name of the file being parsed
int line               --   current line number

Hope that helps a bit!  The code is pretty self-explanatory and somewhat well-documented.  Take a good long gander at it before you try to really use it.

In article <cq2o54$2seg$1@digitaldaemon.com>, Matthew says...
>
>James
>
>I've not yet had chance to look at the code, but I was wondering whether you could give a quick 2-para precise of the interface for your module? A bit of sample program might be useful? (I've just had a sneak, and seen nextToken(), so I reckon I could work it out, but it'd be nicer and quicker if you can just give us a bit of usage info.)
>
>Cheers
>
>Matthew
>
>P.S. FYI, I wrote a source parser / processor about 4 yrs ago which I used to great effect on a very poorly programmed Java project I was brought in on. I was able to write filters to effect changes to hundreds of thousands of lines of code automatically, which was of great benefit when clearing up after a band of careless programmers. Now I'm sure that such programs will not be needed with D, being as how we're all so cool and all, but it'd be jolly nice to be able to write a server plug-in for D, and use it in the same tool (it uses COM). I shall investigate this in a couple of weeks' time. :-)
>

Regards,
James Dunne

December 19, 2004

D parser for code-completion - dtags.d

Posted by James Dunne
in reply to James Dunne

James Dunne

Posted in reply to James Dunne

Attachments:

dtags.d

Hey all again,

I thought I'd release my D parser now.  This puppy is still very much a work in progress, but will successfully parse relatively simple D programs.  Writing a parser isn't very difficult, it's just a lot of work ;).  This module uses my dlexer module I released in this thread.  And so, it is a good(?) example of how to use the dlexer module.

The point of this module is to parse D programs for code-completion purposes.  A few simple structures are used to represent the module's information.  An example program is supplied which dumps out the parsed module's structs, enums, classes, functions, and variables.

I should mention that functionality is somewhat limited: version blocks aren't parsed correctly, and the type parsing on identifiers and functions is somewhat lacking.  I do intend to fix these issues in the next couple of days.  Test it out on some D programs you have lying around (or the phobos library for a good kick).

Regards,
James Dunne

December 19, 2004

Re: D language lexer written in D - dlexer.d

Posted by Simon Buchan
in reply to James Dunne

Simon Buchan

Posted in reply to James Dunne

On Sat, 18 Dec 2004 09:26:01 +0000 (UTC), James Dunne <jdunne4@bradley.edu> wrote:

> Hello all,
>
> I thought as a nice Christmas present for you all working so hard on D, I would
> contribute something some of you might find useful.  I've been working hard on a
> lexer (tokenizer) for the D language.  Attached to this post is the lexer module
> written in D!!
>
> I began work on it for code-completion support for my D IDE called Orion (over
> on dsource.org).  Don't bother checking anything out there yet, as its all in
> ruins right now. ;)
>
> I tested this module somewhat, and it successfully lexes its own source code!
> This module could be useful for a few things, like:
>
> + code-completion database for D modules
> + generating CTAGS for the D language
> + indent-like program for D
>
> ..and any other D source code tools you can think of.
>
> Merry Christmas all, enjoy!
>
> Regards,
> James Dunne

Any particular reason you kept the whole TOK.TOK... naming system? Just seemed
confusing to me...

Plus, it seems you still have the === identity token in there, wasn't that replaced
with 'in'?

Just a small hint, those debug format strings would make a little more sense with
wysiwyg strings (ie `"%s"` instead of "\"%s\"")

Looks like a good start, though, and works as a beutifier, to boot! :D

-- 
"Unhappy Microsoft customers have a funny way of becoming Linux,
Salesforce.com and Oracle customers." - www.microsoft-watch.com:
"The Year in Review: Microsoft Opens Up"
--
"I plan on at least one critical patch every month, and I haven't been disappointed."
- Adam Hansen, manager of security at Sonnenschein Nath & Rosenthal LLP
(Quote from http://www.eweek.com/article2/0,1759,1736104,00.asp)
--
"It's been a challenge to "reteach or retrain" Web users to pay for content, said Pizey"
-Wired website: "The Incredible Shrinking Comic"

December 20, 2004

Re: D language lexer written in D - dlexer.d

Posted by James Dunne
in reply to Simon Buchan

James Dunne

Posted in reply to Simon Buchan

>Any particular reason you kept the whole TOK.TOK... naming system? Just
>seemed
>confusing to me...


I kept the TOK.TOK naming scheme out of pure dumb ignorance on my part ;)  I didn't realize we had anonymous enumerations.  Also, D doesn't allow you to use reserved keywords to declare enumeration values, so the TOK prefix was kept. Tonight I'll run thru and remove the redundant TOK.TOK to be just TOK.

>Plus, it seems you still have the === identity token in there, wasn't that
>replaced
>with 'in'?

The === operator, AFAIK, is indeed the same as 'is', which I'm sure you meant rather than 'in' (just clarifying for newbies, no offense meant).  In the static this() constructor of the DLexer class, you can see the line `keywords["in"] = TOK.TOKidentity` (this is used to convert identifier tokens to reserved keyword tokens).  And reversibly, if you use toktostr[TOK.TOKidentity], you get "===" back.  Strange, but that's exactly the behavior of the D front-end, and I mindlessly copied it over to my code.  I'll fix it if it bugs you ;)

>Just a small hint, those debug format strings would make a little more
>sense with
>wysiwyg strings (ie `"%s"` instead of "\"%s\"")

About the wysiwyg strings as a formatting suggestion for the code itself, I'd like to claim ignorance on my part as well.  :-D.  I often forget about the nice new features of D, since I'm so used to doing things "the old way."  I'm sure you can relate ;).  Besides, I'm growing accustomed to seeing all the horrible backslashes in strings.

>
>Looks like a good start, though, and works as a beutifier, to boot! :D
>

And yes, it does work quite famously as a beautifier doesn't it?  Except the small side-effect of REMOVING ALL COMMENTS :-D.  Of course, that's a small change if this module is to be used as the basis for a beautifier/indenter project.  Someone should really take that up, as I've got my hands full right now trying to come up with some preliminary code-completion support for my D IDE.  That, and Christmas shopping still...

I'll make my changes tonight and ship them out ASAP.

I don't want to host these projects on dsource.org due to the overhead of the SVN repository (since they're only single modules) and that dsource is (I'm sure) intended for medium-sized to large-scale projects.  If there was a code-snippets section, that would be perfect!  I think Brad (admin of dsource) is in the process of evaluating replacement PHP systems right now, so he's got enough to do at the moment.

Regards,
James Dunne

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation