February 29, 2012
On Wed, 29 Feb 2012 21:30:57 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 02/29/2012 09:03 PM, Martin Nowak wrote:
>> On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr@gmx.ch> wrote:
>>
>>> On 02/29/2012 07:28 PM, Martin Nowak wrote:
>>>> On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr@gmx.ch>
>>>> wrote:
>>>>
>>>>> On 02/28/2012 07:46 PM, Martin Nowak wrote:
>>>>>>
>>>>>> https://gist.github.com/1255439 - lexer generator
>>>>>> https://gist.github.com/1262321 - complete and fast D lexer
>>>>>>
>>>>>
>>>>> Well, it is slower at lexing than DMD at parsing. What is the
>>>>> bottleneck?
>>>>
>>>> No, it's as fast as dmd's lexer.
>>>>
>>>> Writing the tokens to stdout takes a lot of time though.
>>>> Just disable the "writeln(tok);" in the main loop.
>>>
>>> I did that.
>>
>> Interesting, I've commented it out https://gist.github.com/1262321#L1559
>> and get the following.
>>
>> <<<
>> PHOBOS=~/Code/D/DPL/phobos
>> mkdir test_lexer
>> cd test_lexer
>> curl https://raw.github.com/gist/1255439/lexer.d > lexer.d
>> curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d
>> curl https://raw.github.com/gist/1262321/entity.d > entity.d
>> dmd -O -release -inline dlexer lexer entity
>> wc -l ${PHOBOS}/std/*.d
>> time ./dlexer ${PHOBOS}/std/*.d
>>>>>
>> ./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total
>
> I get 0.160s for lexing using your lexer.
> Parsing the same file with DMDs parser takes 0.155 seconds. The difference grows with larger files.

Mmh, I've retested and you're right dmd's lexer is about 2x faster.
The main overhead stems from using ranges and enforce.

Quick profiling shows that 25% is spent in popFront and std.utf.stride.
Last time I worked on this I rewrote std.utf.decode to be much faster.
But utf characters are still "decoded" twice, once for front
and then again for popFront. Also stride uses table lookup and
can't be inlined.

If switch tables were implemented on x64 one could use them for
integral ElementType.
February 29, 2012
On Thu, Mar 01, 2012 at 12:04:39AM +0100, Martin Nowak wrote: [...]
> Mmh, I've retested and you're right dmd's lexer is about 2x faster. The main overhead stems from using ranges and enforce.
> 
> Quick profiling shows that 25% is spent in popFront and std.utf.stride.  Last time I worked on this I rewrote std.utf.decode to be much faster.  But utf characters are still "decoded" twice, once for front and then again for popFront. Also stride uses table lookup and can't be inlined.
[...]

One way to not decode characters twice is by using a single-dchar buffer in your range. Something like:

	struct MyRange {
		private File src;
		char buf[];
		dchar readahead;

		this(File _src) {
			// ... fill up buf from src here
			popFront();
		}

		@property pure dchar front() {
			return readahead;
		}

		void popFront() {
			int stride;
			readahead = decode(buf, stride);
			buf = buf[stride..$];
			// ... fill up buf more if needed
		}
	}


T

-- 
"A man's wife has more power over him than the state has." -- Ralph Emerson
February 29, 2012
On Wed, Feb 29, 2012 at 10:41:22AM -0600, Andrei Alexandrescu wrote: [...]
> An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand.

I see that std.stdio.writef already parses format strings at compile-time, though the code is somewhat ugly. :) It would be ideal if that can be done just by a series of ENBF/regex declarations.


> Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand.
[...]

This is definitely something worth having. It has always been my dream that an ideal programming language should let you write grammar rules and pretty much autogen the code for you. There shouldn't need to be external utilities with the associated messy build rules, auxiliary files and other such things, esp. if you just need a one-time small parser for a very specific task.


T

-- 
Why ask rhetorical questions? -- JC
March 01, 2012
Why do you need grammar rules for json? Just use a custom-defined struct and pass it to a toJson function like ae's json module. :)
March 01, 2012
Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme@jadamspam.pl>:

> CTFE code can be much slower than native one, and I would like to see
> some kind of compiler cache for such code.

I second this. As a fan of clean optimizations this is one of the things I tossed around my head for a while. It could use a cache file or the compiler could be started as a daemon process keeping a memory cache. All code that is called through CTFE would go into the cache, indexed by the internal representation of the function body and parameters.
But there are a few open questions, like how seamless this could be integrated. Is it easy to get a hash for function bodies and parameters? How should the cache be limited? N functions, n bytes or maybe one cache per module (together with the .o files)? The latter case would mean that if a.d uses CTFE, that executes code from b.d the results of CTFE would all be cached together with a.o, because that was the entry point. And if there was a module c.d that does the same as a.d it would duplicate the b.d part in its own cache. The benefit is that the cache works with both single module and whole program compilation and it doesn't need any limits. Instead the caches for each module are always refreshed with what was last used in the compilation.
In addition to the last compilation, the caches could be aware of versioned compilation. I usually want to compile debug versions and Linux/Windows release versions at least, so I wouldn't want to invalidate the caches. For 32-bit vs. 64-bit I assume it is the best to just cache them separately as it could prove difficult to distinguish two versions of code that uses (void*).sizeof or something else that isn't declared wrapped in a version statement like size_t is.
March 01, 2012
Le 29/02/2012 17:45, Andrei Alexandrescu a écrit :
> On 2/29/12 3:45 AM, Philippe Sigaud wrote:
>> mixin(Grammar!("Doc <- Node*"
>> "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
>> "OpeningTag <- '<' Identifier '>'", OpeningAction,
>> "ClosingTag <- `</` Identifier '>'", ClosingAction,
>> "Text <- (!(OpeningTag / ClosingTag) _)+"));
>
> That looks about right, but still has a fair amount of noise. I think
> the approach of putting the entire grammar in one string is best.
>
> Andrei
>

It make sense so an external file can be imported as string and processed at compile time.
March 01, 2012
>> mixin(Grammar!("Doc <- Node*"
>> "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
>> "OpeningTag <- '<' Identifier '>'", OpeningAction,
>> "ClosingTag <-  `</` Identifier '>'", ClosingAction,
>> "Text <- (!(OpeningTag / ClosingTag) _)+"));
>
>
> That looks about right, but still has a fair amount of noise. I think the
approach of putting the entire grammar in one string is best.

Yes, using one string is indeed better. That won't be too difficult to code.

But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not.

In my case, a parse tree is automatically generated (indeed, it's almost the only thing returned by a parse call, that with recognized string and named captures). I have some additional syntax to 'drop' a node ans such, similar to what you proposed a few posts ago.


March 01, 2012
On Thu, Mar 01, 2012 at 04:10:26PM +0100, Philippe Sigaud wrote:
> >> mixin(Grammar!("Doc <- Node*"
> >> "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction,
> >> "OpeningTag <- '<' Identifier '>'", OpeningAction,
> >> "ClosingTag <-  `</` Identifier '>'", ClosingAction,
> >> "Text <- (!(OpeningTag / ClosingTag) _)+"));
> >
> >
> > That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best.
> 
> Yes, using one string is indeed better. That won't be too difficult to code.
> 
> But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not.

You could maybe just put D code in the grammar string, which gets compiled as a string mixin by CTFE?


T

-- 
"A man's wife has more power over him than the state has." -- Ralph Emerson
March 01, 2012
> > But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not.
>
> You could maybe just put D code in the grammar string, which gets compiled as a string mixin by CTFE?

I could, but one of my driving goals while beginning this project a month ao was to use the shiny new lambda syntax directly :)

"rule1", o => o
"rule2", o => o

etc.


March 01, 2012
On 3/1/12 1:54 AM, Marco Leise wrote:
> Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme@jadamspam.pl>:
>
>> CTFE code can be much slower than native one, and I would like to see
>> some kind of compiler cache for such code.
>
> I second this. As a fan of clean optimizations this is one of the things
> I tossed around my head for a while. It could use a cache file or the
> compiler could be started as a daemon process keeping a memory cache.
> All code that is called through CTFE would go into the cache, indexed by
> the internal representation of the function body and parameters.

I think this is the kind of development that will be naturally motivated and pushed by serious CTFE applications pushing the boundary. I'm not very worried about it right now as about investigating CTFE potential and blockers.

Andrei