std.d.lexer performance (WAS: std.d.lexer : voting thread) (page 3)

On Thursday, 3 October 2013 at 22:18:13 UTC, Andrei Alexandrescu wrote: > On 10/3/13 3:03 PM, Dicebot wrote: >> Please express your opinion in a clear "Yes", "No" or "Yes, if" form. I >> can't really interpret discussions into voting results. >> >> Of course, you and Walter also have "veto" votes in addition but it >> needs to be said explicitly. > > That's why I renamed the thread! I didn't vote. > > Andrei I mean I will be forced to ignore your opinion in current form when making voting summary and I will feel very uneasy about it :) (damn, that review manager thingy gets much more stressful by the end!)

On Thursday, 3 October 2013 at 19:47:28 UTC, Brian Schott wrote: > The most recent set of timings that I have can be found here: https://raw.github.com/Hackerpilot/hackerpilot.github.com/master/experimental/std_lexer/images/times4.png How exactly were these figures obtained? Based on the graphs, I'd guess that you measured execution time of a complete program (as LDC, which has a slightly higher startup overhead in druntime, overtakes GDC for larger inputs). If that's the case, DMD might be at an unfair advantage for this benchmark as it doesn't need to run all the druntime startup code – which is not a lot, but still. And indeed, its advantage seems to shrink for large inputs, although I don't want to imply that this could be the only reason. David

On 10/02/2013 04:41 PM, Dicebot wrote: > After brief discussion with Brian and gathering data from the review > thread, I have decided to start voting for `std.d.lexer` inclusion into > Phobos. I also have to vote with no for now. My biggest concern is that the lexer incorporates a string pool, something that isn't strictly part of lexing. IMO this is a major design flaw and possible performance/memory issue. It is buried into the API because byToken takes const(byte)[], i.e. mutable data, but each Token carries a string value, so it always requires a copy. For stream oriented lexing, e.g. token highlighting, no string pool is required at all. Instead the value type of Token should be something like take(input.save, lengthOfToken). Why was the Tok!">>=", Tok!"default" idea turned down. This leaves us with undesirable names like Tok.shiftRightAssign, Tok.default_. There are a few smaller issues that haven't yet been addressed, but of course this can be done during the merge code review. Adding it as experimental module would be a good idea.

On Friday, 4 October 2013 at 02:57:41 UTC, Martin Nowak wrote: > Adding it as experimental module would be a good idea. I would be in favor of adding such community-reviewed but not-quite-there-yet libraries to a special category on the DUB registry instead. It would also solve the visibility problem, and apart from the fact that it isn't really clear what being an »experimental« module would entail, having it as a package also allows for faster updates not reliant on the core release schedule. David

On Wednesday, 2 October 2013 at 14:41:56 UTC, Dicebot wrote: > After brief discussion with Brian and gathering data from the review thread, I have decided to start voting for `std.d.lexer` inclusion into Phobos. No. Let's iron out the issues first, both interface and possible performance issues.

On Thursday, 3 October 2013 at 20:11:02 UTC, Andrei Alexandrescu wrote: > On 10/3/13 12:47 PM, Brian Schott wrote: >> On Thursday, 3 October 2013 at 19:07:03 UTC, nazriel wrote: >>> (Btw, someone got benchmarks of std.d.lexer? >>> I remember that Brain was benchmarking his module quite a lot in order >>> to catch up with DMD's lexer but I can't find links in IRC logs. I >>> wonder if he achieved his goal in this regard) >> >> The most recent set of timings that I have can be found here: >> https://raw.github.com/Hackerpilot/hackerpilot.github.com/master/experimental/std_lexer/images/times4.png >> >> >> They're a bit old at this point, but not much has changed in the lexer >> internals. I can try running another set of benchmarks soon. (The >> hardest part is hacking DMD to just do the lexing) >> >> The times on the X-axis are milliseconds. > > I see we're considerably behind dmd. If improving performance would come at the price of changing the API, it may be sensible to hold off adoption for a bit. > > Andrei Quite frankly, I (or better say many of us) need a COMPLETE D lexer that is UP TO DATE. std.lexer should be, if it is a Phobos module, and that is all that matters. Performance optimizations can come later. So what if it's API will change? We, who use D2 since the very beginning, are used to it! API changes can be done smoothly, with phase-out stages. People would be informed what pieces of the API will become deprecated, and it is their responsibility to fix their code to reflect such changes. All that is needed is little bit of planning...

On Wednesday, 2 October 2013 at 18:41:32 UTC, Jacob Carlborg wrote: > On 2013-10-02 16:41, Dicebot wrote: >> After brief discussion with Brian and gathering data from the review >> thread, I have decided to start voting for `std.d.lexer` inclusion into >> Phobos. > > Yes. > > Not a condition but I would prefer the default exception being thrown not to be Exception but a subclass. Yes, I agree with Jacob. Btw, you have a "Yes, if" vote here. :)

> Why was the Tok!">>=", Tok!"default" idea turned down. This leaves us with undesirable names like Tok.shiftRightAssign, Tok.default_. Martin, that is truly a matter of taste. I, for an instance, do not like Tok!">>=" - too many special characters there for my taste. To me it looks like some part of a weird Perl script.

October 04, 2013

Re: std.d.lexer : voting thread

Posted by ilya-stromberg
in reply to Dicebot

Permalink

ilya-stromberg

Posted in reply to Dicebot

Permalink

On Wednesday, 2 October 2013 at 14:41:56 UTC, Dicebot wrote:
> After brief discussion with Brian and gathering data from the review thread, I have decided to start voting for `std.d.lexer` inclusion into Phobos.

No.

I really want to see `std.d.lexer` in Phobos, but have too many conditions.

Documentation issues:

- please specify the parser algorithm that you used for `std.d.lexer`. As I understand from review thread, you implement `GLR parser` - please document it (correct me if I wrong). Also, add link to the algorithm description, for example to the wikipedia:
http://en.wikipedia.org/wiki/GLR_parser
It helps to understand how `std.d.lexer` works.
Also, please add best-case and worst-case time complexity (for example, from O(n) to O(n^3)), and best-case and worst-case memory complexity.

- please add more usage examples. Currently you have only one big example how generate HTML markup of D code. Try to add a simple example for every function.

- explicitly specify functions that can throw: add `Throws:` block for it and specify conditions when they can throw.

UTF-16/UTF-32 support:
- why standart `std.d.lexer` supports only UTF-8, but not a UTF-16/UTF-32? The official lexing specification allows all of them. The conversion from UTF-16/UTF-32 to UTF-8 is not a option due performance issues.
If Phobos string functions too slow, please add a bug. If Phobos haven't got necessary functions, please add enhancement request.
I think it's serious issue that affects all string utilities (like std.xml or std.json), not only `std.d.lexer`.

Exception handling
- please use `ParseException` as a default exception, not the `Exception`.

Codestyle:
- I don't like `TokenType` enum. You can use Tok!">>=" and `static if` to compare the token string to the `TokenType` enum. So, you will not lose performance, because string parsing will be done at compile time.

Not a condition, but wishlist:
- implement low-level API, not only high-level range-based API. I hope it can help increase performance for applications that really need it.

- add ability to use `std.d.lexer` at the compile time.

Forums