More lexer questions - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » More lexer questions

Thread overview

More lexer questions
Feb 11, 2012 H. S. Teoh
Feb 11, 2012 Timon Gehr
Feb 11, 2012 Martin Nowak
Feb 11, 2012 Timon Gehr
Feb 12, 2012 simendsjo
Feb 12, 2012 H. S. Teoh
Feb 12, 2012 Jonathan M Davis
Feb 12, 2012 Martin Nowak
Feb 12, 2012 Alex_Dovhal
Feb 12, 2012 Martin Nowak
Feb 11, 2012 H. S. Teoh

February 11, 2012

More lexer questions

Posted by H. S. Teoh

H. S. Teoh

According to the online specs, the lexer tries to tokenize by maximal matching (except for one exception in the case of ranges like "1..2"). The fact that this exception is stated seems to indicate that it's permitted to have two literals side-by-side without an intervening space.

So does that mean "1e2" should be tokenized as (float lit: 1e2) and
"1f2" should be tokenized as (int lit: 1)(identifier: f2)?

Or, for that matter, "123abcdefg" should be tokenized as (int lit:
123)(identifier: abcdefg) whereas "0x123abcdefg" should be tokenized as
(int lit: 0x123abcdef)(identifier: g)?

Or worse, if we still allow octals, "0129" should be tokenized as (octal
lit: 012)(int lit: 9)?

Or do we expect that any integer/float literal will always span the longest string that has characters permitted in any numerical literal, and then after the fact the lexer will give an error if the string cannot be interpreted as a legal literal? IOW, "0129" will first be scanned in its entirety as a numerical literal, then afterwards the lexer decides that '9' doesn't belong in an octal so it throws an error (as opposed to maximally matching "012" as an octal literal followed by a decimal literal "9").  Or, for that matter, "0123xel.u123" will be scanned as a numerical literal (since all the characters in it occur in some kind of numerical literal), and then an error generated after the fact when the lexer realizes that this string isn't a legal numerical literal?


T

-- 
All men are mortal. Socrates is mortal. Therefore all men are Socrates.

February 11, 2012

Re: More lexer questions

Posted by Timon Gehr
in reply to H. S. Teoh

Timon Gehr

Posted in reply to H. S. Teoh

On 02/11/2012 07:42 PM, H. S. Teoh wrote:
> According to the online specs, the lexer tries to tokenize by maximal
> matching (except for one exception in the case of ranges like "1..2").
> The fact that this exception is stated seems to indicate that it's
> permitted to have two literals side-by-side without an intervening
> space.
>
> So does that mean "1e2" should be tokenized as (float lit: 1e2) and

Yes.

> "1f2" should be tokenized as (int lit: 1)(identifier: f2)?
>

No. maximal munch:

(float lit: 1f)(int lit 2)


> Or, for that matter, "123abcdefg" should be tokenized as (int lit:
> 123)(identifier: abcdefg)

Yes.

> whereas "0x123abcdefg" should be tokenized as
> (int lit: 0x123abcdef)(identifier: g)?
>
> Or worse, if we still allow octals, "0129" should be tokenized as (octal
> lit: 012)(int lit: 9)?
>

DMD views 0129 as an error. Therefore, the best way to handle integer literals with initial 0 is to just parse them as decimal and to reject them if they exceed 7.

> Or do we expect that any integer/float literal will always span the
> longest string that has characters permitted in any numerical literal,
> and then after the fact the lexer will give an error if the string
> cannot be interpreted as a legal literal? IOW, "0129" will first be
> scanned in its entirety as a numerical literal, then afterwards the
> lexer decides that '9' doesn't belong in an octal so it throws an error
> (as opposed to maximally matching "012" as an octal literal followed by
> a decimal literal "9").  Or, for that matter, "0123xel.u123" will be


(int lit: 0123)(identifier: xel)(token: '.')(identifier: u123)

> scanned as a numerical literal (since all the characters in it occur in
> some kind of numerical literal), and then an error generated after the
> fact when the lexer realizes that this string isn't a legal numerical
> literal?
>
>
> T
>

No. As an example, that kind of processing the code would reject the valid token q{0123xel.u123}.

February 11, 2012

Re: More lexer questions

Posted by Martin Nowak
in reply to Timon Gehr

Martin Nowak

Posted in reply to Timon Gehr

Just wanted to point you to my working D lexer (needs a CTFE bugfix http://d.puremagic.com/issues/show_bug.cgi?id=6815).

https://gist.github.com/1262321 D part
https://gist.github.com/1255439 Generic part

February 11, 2012

Re: More lexer questions

Posted by H. S. Teoh

H. S. Teoh

On Sat, Feb 11, 2012 at 09:59:06PM +0100, Martin Nowak wrote:
> Just wanted to point you to my working D lexer (needs a CTFE bugfix http://d.puremagic.com/issues/show_bug.cgi?id=6815).
> 
> https://gist.github.com/1262321 D part https://gist.github.com/1255439 Generic part

Cool, thanks!

Looks like you've gone far beyond what I'm doing. :-) But it's still a good learning exercise for me to get comfortable with coding in D.


T

-- 
"The whole problem with the world is that fools and fanatics are always
so certain of themselves, but wiser people so full of doubts." --
Bertrand Russell.
"How come he didn't put 'I think' at the end of it?" -- Anonymous

February 11, 2012

Re: More lexer questions

Posted by Timon Gehr
in reply to Martin Nowak

Timon Gehr

Posted in reply to Martin Nowak

On 02/11/2012 09:59 PM, Martin Nowak wrote:
> Just wanted to point you to my working D lexer (needs a CTFE bugfix
> http://d.puremagic.com/issues/show_bug.cgi?id=6815).
>

This seems to do the job:
constfold.c:1566
-        if (tn->ty == Tchar || tn->ty == Twchar || tn->ty == Tdchar)
+        if (tn->isImmutable() && (tn->ty == Tchar || tn->ty == Twchar || tn->ty == Tdchar))

However, I don't know the compiler's internals at all, therefore it is quite possible that the fix is incorrect.


> https://gist.github.com/1262321 D part
> https://gist.github.com/1255439 Generic part

Bug: The lexer cannot handle /++/ and /**/ (without new line character at the end).

February 12, 2012

Re: More lexer questions

Posted by simendsjo
in reply to Timon Gehr

simendsjo

Posted in reply to Timon Gehr

On 02/12/2012 12:35 AM, Timon Gehr wrote:
> On 02/11/2012 09:59 PM, Martin Nowak wrote:
>> Just wanted to point you to my working D lexer (needs a CTFE bugfix
>> http://d.puremagic.com/issues/show_bug.cgi?id=6815).
>>
>
> This seems to do the job:
> constfold.c:1566
> - if (tn->ty == Tchar || tn->ty == Twchar || tn->ty == Tdchar)
> + if (tn->isImmutable() && (tn->ty == Tchar || tn->ty == Twchar ||
> tn->ty == Tdchar))
>
> However, I don't know the compiler's internals at all, therefore it is
> quite possible that the fix is incorrect.
>
>
>> https://gist.github.com/1262321 D part
>> https://gist.github.com/1255439 Generic part
>
> Bug: The lexer cannot handle /++/ and /**/ (without new line character
> at the end).

Another thing.. Using /+ and +/ in strings gives unexpected results when commented out:
/+
auto a = "/+";
+/
everything from this point is commented out.

/+
auto a = "+/";
+/ // already terminated by the string value.

Is this a bug, or as designed? /++/ is meant to comment out code, so it would have been nice if it was able to handle this, but I guess it would complicate the lexer a great deal.

February 12, 2012

Re: More lexer questions

Posted by H. S. Teoh
in reply to simendsjo

H. S. Teoh

Posted in reply to simendsjo

On Sun, Feb 12, 2012 at 01:00:07AM +0100, simendsjo wrote: [...]
> Another thing.. Using /+ and +/ in strings gives unexpected results
> when commented out:
> /+
> auto a = "/+";
> +/
> everything from this point is commented out.
> 
> /+
> auto a = "+/";
> +/ // already terminated by the string value.
> 
> Is this a bug, or as designed? /++/ is meant to comment out code, so it would have been nice if it was able to handle this, but I guess it would complicate the lexer a great deal.

It's designed. At least according to the online specs:

	The contents of strings and comments are not tokenized.
	Consequently, comment openings occurring within a string do not
	begin a comment, and string delimiters within a comment do not
	affect the recognition of comment closings and nested "/+"
	comment openings. With the exception of "/+" occurring within a
	"/+" comment, comment openings within a comment are ignored.

		a = /+ // +/ 1;    // parses as if 'a = 1;'
		a = /+ "+/" +/ 1"; // parses as if 'a = " +/ 1";'
		a = /+ /* +/ */ 3; // parses as if 'a = */ 3;'

For commenting out code, a much better way is to use version(none){...}.


T

-- 
No! I'm not in denial!

February 12, 2012

Re: More lexer questions

Posted by Jonathan M Davis
in reply to simendsjo

Jonathan M Davis

Posted in reply to simendsjo

On Sunday, February 12, 2012 01:00:07 simendsjo wrote:
> Another thing.. Using /+ and +/ in strings gives unexpected results when
> commented out:
> /+
> auto a = "/+";
> +/
> everything from this point is commented out.
> 
> /+
> auto a = "+/";
> +/ // already terminated by the string value.
> 
> Is this a bug, or as designed? /++/ is meant to comment out code, so it would have been nice if it was able to handle this, but I guess it would complicate the lexer a great deal.

It's by design. Everything between /+ and +/ is a comment. It doesn't matter what it is. There's nothing special about " which would make it ignore the characters following it when looking for the +/ to end the comment.

- Jonathan M Davis

February 12, 2012

Re: More lexer questions

Posted by Martin Nowak
in reply to simendsjo

Martin Nowak

Posted in reply to simendsjo

On Sun, 12 Feb 2012 01:00:07 +0100, simendsjo <simendsjo@gmail.com> wrote:

> On 02/12/2012 12:35 AM, Timon Gehr wrote:
>> On 02/11/2012 09:59 PM, Martin Nowak wrote:
>>> Just wanted to point you to my working D lexer (needs a CTFE bugfix
>>> http://d.puremagic.com/issues/show_bug.cgi?id=6815).
>>>
>>
>> This seems to do the job:
>> constfold.c:1566
>> - if (tn->ty == Tchar || tn->ty == Twchar || tn->ty == Tdchar)
>> + if (tn->isImmutable() && (tn->ty == Tchar || tn->ty == Twchar ||
>> tn->ty == Tdchar))
>>
>> However, I don't know the compiler's internals at all, therefore it is
>> quite possible that the fix is incorrect.
>>
>>
>>> https://gist.github.com/1262321 D part
>>> https://gist.github.com/1255439 Generic part
>>
>> Bug: The lexer cannot handle /++/ and /**/ (without new line character
>> at the end).
>
> Another thing.. Using /+ and +/ in strings gives unexpected results when commented out:
> /+
> auto a = "/+";

/+ comments do nest. So you have opened two levels and the comment stops after two pairing +/.
/* comments do not nest.

> +/
> everything from this point is commented out.
>
> /+
> auto a = "+/";
> +/ // already terminated by the string value.
>
> Is this a bug, or as designed? /++/ is meant to comment out code, so it would have been nice if it was able to handle this, but I guess it would complicate the lexer a great deal.

February 12, 2012

Re: More lexer questions

Posted by Alex_Dovhal
in reply to Martin Nowak

Alex_Dovhal

Posted in reply to Martin Nowak

"Martin Nowak" <dawg@dawgfoto.de> wrote:
> Just wanted to point you to my working D lexer (needs a CTFE bugfix http://d.puremagic.com/issues/show_bug.cgi?id=6815).
>
> https://gist.github.com/1262321 D part https://gist.github.com/1255439 Generic part

Hi, how it should be compiled? I tried with DMD 2.057:
>dmd dlexer.d
and got
>c:\Programs\Programming\Lang\dmd2\windows\bin\..\..\src\phobos\std\conv.d(94): Error: template instance std.format.formatValue!(Appender!(string),defineToken,char) recursive expansion

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation