std.d.lexer: pre-voting review / discussion (page 11)

On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote: > I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool. Doing that would require making the identifier pool part of the public API, which is not something that I want to do at the moment. Let's wait until the allocators are figureod out first.

On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote: > I've been busy with things that aren't D-related recently, but I should have time the rest of this week to address the lexer. Changes since last time: https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master Test coverage is up to 85% now and a few bugs have been fixed.

On Wednesday, 25 September 2013 at 02:23:36 UTC, Brian Schott wrote: > On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote: >> I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool. > > Doing that would require making the identifier pool part of the public API, which is not something that I want to do at the moment. Let's wait until the allocators are figureod out first. Yes, ideally as alias parameter. This is a show stopper for many usages.

September 25, 2013

Re: std.d.lexer: pre-voting review / discussion

Posted by Jacob Carlborg
in reply to Brian Schott

Permalink

Jacob Carlborg

Posted in reply to Brian Schott

Permalink

On 2013-09-25 04:48, Brian Schott wrote:

> Changes since last time:
>
> https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master

I had some comments and a couple of minor things like spell errors:

* I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token.

* Could we get some unit tests for string literals, comments and identifies out side of the ASCII table

* Personally I would like to see a short description for each unit test, what it's testing

* Could you remove debug code and other code that is commented out:

- 344
- 1172
- 1226, is that needed?
- 3165-3166
- 3197-3198
- 3392
- 3410
- 3434

Spell errors:

* "forwarad" - 292
* "commemnt" - 2031
* "sentenels" - 299
* "messsage" - 301
* "underliying" - 2454
* "alloctors" - 3230
* "strightforward" - 2276

I guess these line number might be off now. My original comments was made September 12.

For reference see:

http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe@forum.dlang.org?page=7#post-l0rsje:24jf9:241:40digitalmars.com

-- 
/Jacob Carlborg

On Wednesday, 25 September 2013 at 09:36:43 UTC, Jacob Carlborg wrote: > * I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token. This is the default behavior that happens when you don't configure an error callback. > * Could we get some unit tests for string literals, comments and identifies out side of the ASCII table I've added one. > * Could you remove debug code and other code that is commented out: Most of this is now gone. > Spell errors: These were fixed weeks ago.

On Wednesday, 25 September 2013 at 16:52:43 UTC, Brian Schott wrote: > This is the default behavior that happens when you don't > configure an error callback. I see. > I've added one. Thanks. > Most of this is now gone. That's good. > These were fixed weeks ago. Great, I just never got a respond that. -- /Jacob Carlborg

Hello. I'm not sure if this belongs here, but I think there is bug at the very start of the Lexer chapter: Is U+001A really meant to end the source file? According to the Unicode specification this is a "replacement character", like the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to end the source (this would fit, as it means "end of media"). I don't know if anybody ever has ended his source in that way or if it was tested. More important to me is, that all the Space-Characters beyond ASCII are not considered whitespace (starting with U+00A0 NBSP, the different wide spaces U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, U+3000 and the famous U+FEFF). Why? Ok, the set is much larger, but for the end-of-line also the unicode versions (U+2028 and U+2029) are added. This seems inconsequent to me.

On 26-9-2013 17:41, Dominikus Dittes Scherkl wrote: > Hello. > > I'm not sure if this belongs here, but I think there is bug at the very start of the Lexer chapter: > > Is U+001A really meant to end the source file? > According to the Unicode specification this is a "replacement character", like the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to > end the source (this would fit, as it means "end of media"). > > I don't know if anybody ever has ended his source in that way or if it was tested. > > More important to me is, that all the Space-Characters beyond ASCII are not > considered whitespace (starting with U+00A0 NBSP, the different wide spaces > U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, U+3000 and > the famous U+FEFF). Why? > Ok, the set is much larger, but for the end-of-line also the unicode versions (U+2028 and U+2029) are added. This seems inconsequent to me. I imagine the lexer follows the language specification: http://dlang.org/lex.html#EndOfFile

On Thursday, 26 September 2013 at 16:47:09 UTC, Jos van Uden wrote: >> Is U+001A really meant to end the source file? >> According to the Unicode specification this is a "replacement character", like the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to >> end the source (this would fit, as it means "end of media"). >> >> More important to me is, that all the Space-Characters beyond ASCII are not considered whitespace > > I imagine the lexer follows the language specification: > > http://dlang.org/lex.html#EndOfFile I know. What I wanted to say is: The language specification has a bug here (at least it is strange to interpret "replacement character" as end of file and "end of media" not) and the handling of unicode space characters is not nice. If this is not the right place to discus that matter, please point me to a better place.

On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote: > The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity. > > The alternative is to go for some ambiguous function/template parameters parsing and resolve at the end, but as template argument are themselves ambiguous type/expression/symbols, the amount of complexity in the parser is doomed to explode. Pretty sure a GLR parser handles that well within O(n^2) space. Nothing exponential necessary...

Forums