January 28, 2013
On Sunday, 27 January 2013 at 09:51:10 UTC, Brian Schott wrote:
> I'm writing a D lexer for possible inclusion in Phobos.
>
> DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
> Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d
>
> It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input.
>
> I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here.
>
> I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)

Having a "standard" lexer module, that many people work on, and has up-to-date rules, is priceless! Thank you for doing this.
January 28, 2013
On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky wrote:
> 28-Jan-2013 14:39, Jacob Carlborg пишет:
>> On 2013-01-28 11:14, Walter Bright wrote:
>>
>>> Well, the source buffer can be large, and will span a lot of memory
>>> cache lines, making accessing those slices slow.
>>
>> Would it be better to .dup the slices? Won't that be just as slow as
>> using the appender?
>>
>
> It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s).

Would it also make sense to do small string optimization? On 64 bit, you could use the memory used by the string field itself to store strings of length up to 15 characters.
January 28, 2013
28-Jan-2013 18:08, jerro пишет:
> On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky wrote:
>> 28-Jan-2013 14:39, Jacob Carlborg пишет:
>>> On 2013-01-28 11:14, Walter Bright wrote:
>>>
>>>> Well, the source buffer can be large, and will span a lot of memory
>>>> cache lines, making accessing those slices slow.
>>>
>>> Would it be better to .dup the slices? Won't that be just as slow as
>>> using the appender?
>>>
>>
>> It would be better to compactly layout these one by one in a separate
>> buffer skipping all of the extra slack in the source file(s).
>
> Would it also make sense to do small string optimization? On 64 bit, you
> could use the memory used by the string field itself to store strings of
> length up to 15 characters.

It very well could be iff slice is a field in the Token struct.

-- 
Dmitry Olshansky
January 28, 2013
On Sun, Jan 27, 2013 at 10:39:13PM +0100, Brian Schott wrote:
> On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:
[...]
> >Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
> 
> The file name is accepted for eventual error reporting purposes. The actual input for the lexer is the parameter called "range".
[...]

FWIW, I've developed this little idiom in my code when it comes to dealing with error reporting in lexing/parsing code (for my own DSLs, not D):

The main problem I have is that my lexer/parser accepts an input range, but input ranges don't (necessarily) have any filename/line number associated with them. Moreover, the code that throws the exception may be quite far down the call chain, and may have not access to the context that knows what filename/line number the error occurred at. For example, I may have a generic function called parseInt(), which can be called from the lexer, the parser, or a whole bunch of other places. It wouldn't make sense to force parseInt() to take a filename and line number, just so it can have nicer error reporting, for example.

So I decided to move the inclusion of filename/line number information to where they belong: in the code that knows about them. So here's a sketch of my approach:

	class SyntaxError : Exception {
		string filename;
		int linenum;
		this(string msg) { super(msg); }
	}
	...
	int parseInt(R)(R inputRange) {
		...
		// N.B.: no filename/line number info here
		if (!isDigit(inputRange.front))
			throw new Exception("Invalid digit: %s",
				inputRange.front);
	}
	...
	Expr parseExpr(R)(R inputRange) {
		...
		// N.B.: any exception just unrolls past this call, 'cos
		// we have no filename/line number info to insert anyway
		if (tokenType == IntLiteral) {
			value = parseInt(inputRange):
		}
		...
	}
	...
	Expr parseFileInput(string filename) {
		auto f = File(filename);
		try {
			// Wrapper range that counts line numbers
			auto r = NumberedSrc(f);

			return parseExpr(r);
		} catch(SyntaxError e) {
			// Insert filename/line number info into message
			e.filename = filename;
			e.linenum = r.linenum;
			e.msg = format("%s:%d %s", filename, r.linenum, e.msg);
			throw e;
		}
	}
	...
	Expr parseConsoleInput() {
		// No filename/line number info here
		return parseExpr(stdin.byLine());
	}
	...
	Expr parseStringInput(string input) {
		try {
			auto r = NumberedSrc(input);
			return parseExpr(r);
		} catch(SyntaxError e) {
			// We don't have filename here, but we do have
			// line number, so use that.
			e.linenum = r.linenum;
			e.msg = format("Line %d: %s", r.linenum, e.msg);
			throw e;
		}
	}

Notice that I have different wrapper functions for dealing with different kinds of input; the underlying parser doesn't even care about filename/line numbers, but the wrapper functions catch any parsing exceptions that are thrown from underneath and prepend this info as appropriate. This simplifies the parsing code (don't have to keep worrying about line numbers and propagating filenames) and also produces output that makes sense:

- Console input don't need line numbers; the user doesn't care if this
  is the 500th command he typed, or the 701st.

- Internal strings don't get a nonsensical "filename", 'cos they don't
  *have* a filename in the first place. Just a single line number so you
  can locate the problem in, say, the string literal or something.

- File input has filename and line number.

- Other kinds of input contexts can be handled in the same way.

- The use of NumberedSrc (maybe better named LineNumberedRange or
  something) makes line numbers available to each of these contexts at
  the top-level. Though of course, the lexer itself can also handle this
  (but it adds complications if you have to continue detecting newlines
  in, say, string literals, when the lexer is in a different state).

The cleaner code does come at a price, though: this code probably is a bit inefficient due to the number of layers in it. But, just thought I'd share this idea.


T

-- 
You are only young once, but you can stay immature indefinitely. -- azephrahel
January 28, 2013
On 1/28/2013 3:34 AM, Mehrdad wrote:
> On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:
>> On 1/28/2013 2:16 AM, dennis luehring wrote:
>>> Am 28.01.2013 11:13, schrieb Walter Bright:
>>>> On 1/28/2013 1:30 AM, Mehrdad wrote:
>>>>> I'm wondering if there's any way to make it independent of the code/grammar
>>>>
>>>> Just run under a profiler, then fix the hot spots.
>>>>
>>>
>>> mehrdad speaks about benchmarking not optimization :)
>>
>> I know. But profiling is how you make it fast. Lexers should be as fast as
>> possible, not merely faster than the competition.
>
>
>
> I was talking about parsers actually :| on a second thought maybe I shouldn't
> have asked that on a thread about lexing...

Same comment applies.
January 28, 2013
On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad <wfunction@hotmail.com> wrote:
> https://gist.github.com/b10ae22ab822c87467a3

This link does not work for me :(
January 28, 2013
On Monday, 28 January 2013 at 20:35:51 UTC, Philippe Sigaud wrote:
> On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad <wfunction@hotmail.com> wrote:
>> https://gist.github.com/b10ae22ab822c87467a3
>
> This link does not work for me :(


Yeah sorry, I found a bug in it, so I took it off and fixed it (... or so I think).
I wouldn't be surprised if it's still buggy.

It's more of a proof of concept than anything... it definitely has room for improvement, but the speed is still decent.

Here's another link (grab it before it's gone!):

https://gist.github.com/addfe830acf9785d01ef
January 28, 2013
On 01/28/2013 01:53 AM, Brian Schott wrote:
> ...
>
> On the topic of performance, I realized that the numbers posted
> previously were actually for a debug build. Fail.
>
> For whatever reason, the current version of the lexer code isn't
> triggering my heisenbug[1] and I was able to build with -release -inline
> -O.
>
> Here's what avgtime has to say:
>
> $ avgtime -q -h -r 200 dscanner --tokenCount ../phobos/std/datetime.d
>
> ------------------------
> Total time (ms): 51409.8
> Repetitions    : 200
> Sample mode    : 250 (169 ocurrences)
> Median time    : 255.57
> Avg time       : 257.049
> Std dev.       : 4.39338
> Minimum        : 252.931
> Maximum        : 278.658
> 95% conf.int.  : [248.438, 265.66]  e = 8.61087
> 99% conf.int.  : [245.733, 268.366]  e = 11.3166
> EstimatedAvg95%: [256.44, 257.658]  e = 0.608881
> EstimatedAvg99%: [256.249, 257.849]  e = 0.800205
> Histogram      :
>      msecs: count  normalized bar
>        250:   169  ########################################
>        260:    22  #####
>        270:     9  ##
>
> Which works out to 1,327,784 tokens per second on my Ivy Bridge i7.
>

Better, but still slow.

> I created a small program that demangles the output of valgrind so that
> tools like KCachegrind can display profiling information more clearly.
> It's now on the wiki[2]
>
> The bottleneck in std.d.lexer as it stands is the appender instances
> that assemble Token.value during iteration and front() on the array of
> char[]. (As I'm sure everyone expected)
>

I see, probably there should be an option to do this by slicing instead.
Also try to treat narrow strings in such a way that they do not incur undue decoding overhead.

I guess that at some point

pure nothrow TokenType lookupTokenType(const string input)

might become a bottleneck. (DMD does not generate near-optimal string switches, I think.)

January 28, 2013
On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
> 28-Jan-2013 15:48, Johannes Pfau пишет:
>> ...
>>
>> But to be fair that doesn't fit ranges very well. If you don't want to
>> do any allocation but still keep identifiers etc in memory this
>> basically means you have to keep the whole source in memory and this is
>> conceptually an array and not a range.
>>
>
> Not the whole source but to construct a table of all identifiers. The
> source is awfully redundant because of repeated identifiers, spaces,
> comments and what not. The set of unique identifiers is rather small.
>

Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My own lexer-parser combination slices directly into the original source code, for every token, in order to remember the exact source location, and the last time I have measured, it ran faster than DMD's. I keep the source around for error reporting anyway:

tt.d:27:5: error: no member 'n' for type 'A'
    a.n=2;
    ^~~

Since the tokens point directly into the source code, it is not necessary to construct any other data structures in order to allow fast retrieval of the appropriate source code line.

But it's clear that a general-purpose library might not want to impose this restriction on storage upon it's clients. I think it is somewhat helpful for speed though. The other thing I do is buffering tokens in a contiguous ring buffer that grows if a lot of lookahead is requested.

> I think the best course of action is to just provide a hook to trigger
> on every identifier encountered. That could be as discussed earlier a
> delegate.
>
> ...

Maybe. I map identifiers to unique id's later, in the identifier AST node constructor, though.
January 28, 2013
>> This link does not work for me :(
>
>
>
> Yeah sorry, I found a bug in it, so I took it off and fixed it (... or so I
> think).
> I wouldn't be surprised if it's still buggy.
>
> It's more of a proof of concept than anything... it definitely has room for improvement, but the speed is still decent.
>
> Here's another link (grab it before it's gone!):
>
> https://gist.github.com/addfe830acf9785d01ef

Grabby grab!

Owww, C++! My eyes, they're melting! My hairs, on fire! Why did you forfeit your immortal soul?

Anyway, thanks. I'll look at it more closely. It seems to be the kind of C++ I can still read *and* understand.