Writing a really fast lexer

Dec 11, 2020

vnr

Dec 11, 2020

H. S. Teoh

Dec 12, 2020

Dec 12, 2020

Dec 12, 2020

Dec 12, 2020

Dec 12, 2020

For a project with good performance, I would need to be able to analyse text. To do so, I would write a parser by hand using the recursive descent algorithm, based on a stream of tokens. I started writing a lexer with the d-lex package (https://code.dlang.org/packages/d-lex), it works really well, unfortunately, it's quite slow for the number of lines I'm aiming to analyse (I did a test, for a million lines, it lasted about 3 minutes). As the parser will only have to manipulate tokens, I think that the performance of the lexer will be more important to consider. Therefore, I wonder what resources there are, in D, for writing an efficient lexer. I could of course write it by hand, but I would be interested to know what already exists, so as not to reinvent the wheel. Of course, if the standard library (or its derivatives, such as mir) has features that could be of interest to me for this lexer, I am interested. Thanks to you :)

December 11, 2020

Re: Writing a really fast lexer

Posted by H. S. Teoh
in reply to vnr

Permalink

H. S. Teoh

Posted in reply to vnr

Permalink

On Fri, Dec 11, 2020 at 07:49:12PM +0000, vnr via Digitalmars-d-learn wrote:
> For a project with good performance, I would need to be able to analyse text. To do so, I would write a parser by hand using the recursive descent algorithm, based on a stream of tokens. I started writing a lexer with the d-lex package (https://code.dlang.org/packages/d-lex), it works really well, unfortunately, it's quite slow for the number of lines I'm aiming to analyse (I did a test, for a million lines, it lasted about 3 minutes). As the parser will only have to manipulate tokens, I think that the performance of the lexer will be more important to consider. Therefore, I wonder what resources there are, in D, for writing an efficient lexer. I could of course write it by hand, but I would be interested to know what already exists, so as not to reinvent the wheel. Of course, if the standard library (or its derivatives, such as mir) has features that could be of interest to me for this lexer, I am interested. Thanks to you :)

If you want a *really* fast lexer, I recommend using GNU Flex (https://github.com/westes/flex/).  Unfortunately, AFAIK it does not support D directly; it generates a lexer in C that you then compile. Fortunately, you can interface with the generated C code quite easily from D.

I took a quick glance at d-lex, and immediately noticed that every rule match incurs a GC allocation, which is sure to cause a performance slowdown.  It does have a nice API, but for a truly fast lexer at the very least you need to compile the lexing rules down into a fast IR, which d-lex does not do.  So it's unsurprising that performance isn't the best.

Some things to watch out for when writing high-performance string-processing in D:

1) Avoid GC allocations as much as possible.  If some object needs to be created frequently (e.g., tokens), try your best to make it a struct rather than a class, to avoid incurring allocation overhead and to decrease GC pressure. (This is a common mistake by people who write lexers/parsers in D.)

2) Use slices of the input (e.g. for returning tokens) as much as possible, avoid .dup/.idup as much as you can.  This may not always be possible if you're reading from a file; if that's the case, consider using std.mmfile to memory-map the input into memory so that you can freely slice over the entire input without needing to handle caching/paging yourself.  Ideally, your lexer should simply return slices over the input, wrapped in a struct with an enum tag to indicate what type of token it is.  Don't bother with things like converting integers/floats in the input until semantic analysis, or when your parser needs to create AST nodes that store the actual values. This eliminates the need for polymorphic token types, which tends to slow lexers down.

3) Avoid autodecoding unless you need it. Use .byCodeUnit when processing strings with Phobos algorithms, unless the correctness of what you need to do happens to depend on decoding Unicode code points.


T

-- 
"A one-question geek test. If you get the joke, you're a geek: Seen on a California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs - Natural Intelligence, Inc.

On Friday, 11 December 2020 at 19:49:12 UTC, vnr wrote: > For a project with good performance, I would need to be able to analyse text. To do so, I would write a parser by hand using the recursive descent algorithm, based on a stream of tokens. I started writing a lexer with the d-lex package (https://code.dlang.org/packages/d-lex), it works really well, unfortunately, it's quite slow for the number of lines I'm aiming to analyse (I did a test, for a million lines, it lasted about 3 minutes). As the parser will only have to manipulate tokens, I think that the performance of the lexer will be more important to consider. Therefore, I wonder what resources there are, in D, for writing an efficient lexer. Have you looked at Pegged [1]? It will give you the lexer and parser in one go. I'd be very interested to see how it performs on that kind of input. -- Bastiaan. [1] https://code.dlang.org/packages/pegged

On Friday, 11 December 2020 at 20:19:49 UTC, H. S. Teoh wrote: > On Fri, Dec 11, 2020 at 07:49:12PM +0000, vnr via Digitalmars-d-learn wrote: >> [...] > > If you want a *really* fast lexer, I recommend using GNU Flex (https://github.com/westes/flex/). Unfortunately, AFAIK it does not support D directly; it generates a lexer in C that you then compile. Fortunately, you can interface with the generated C code quite easily from D. > > [...] Thank you for this answer :) I hadn't imagined using Flex, but it's true that the integration of C in D is quite impressive.

On Saturday, 12 December 2020 at 16:43:43 UTC, Bastiaan Veelo wrote: > On Friday, 11 December 2020 at 19:49:12 UTC, vnr wrote: >> For a project with good performance, I would need to be able to analyse text. To do so, I would write a parser by hand using the recursive descent algorithm, based on a stream of tokens. I started writing a lexer with the d-lex package (https://code.dlang.org/packages/d-lex), it works really well, unfortunately, it's quite slow for the number of lines I'm aiming to analyse (I did a test, for a million lines, it lasted about 3 minutes). As the parser will only have to manipulate tokens, I think that the performance of the lexer will be more important to consider. Therefore, I wonder what resources there are, in D, for writing an efficient lexer. > > Have you looked at Pegged [1]? It will give you the lexer and parser in one go. I'd be very interested to see how it performs on that kind of input. > > -- Bastiaan. > > [1] https://code.dlang.org/packages/pegged Yes, I know Pegged, it's a really interesting parser generator engine, nevertheless, the grammar of what I would like to analyse is not a PEG. But I am also curious to know the performances of this tool for very large inputs.

On Saturday, 12 December 2020 at 18:15:11 UTC, vnr wrote: > On Saturday, 12 December 2020 at 16:43:43 UTC, Bastiaan Veelo wrote: >> Have you looked at Pegged [1]? It will give you the lexer and parser in one go. I'd be very interested to see how it performs on that kind of input. >> >> -- Bastiaan. >> >> [1] https://code.dlang.org/packages/pegged > > Yes, I know Pegged, it's a really interesting parser generator engine, nevertheless, the grammar of what I would like to analyse is not a PEG. But I am also curious to know the performances of this tool for very large inputs. Are you able to share the grammar? Since you plan to parse using recursive descent, I think there is a good chance that the language can be defined as a PEG. I am using it to parse Pascal, whose grammar was defined long before PEG was a thing. — Bastiaan.

On Saturday, 12 December 2020 at 18:15:11 UTC, vnr wrote: > Yes, I know Pegged, it's a really interesting parser generator engine, nevertheless, the grammar of what I would like to analyse is not a PEG. But I am also curious to know the performances of this tool for very large inputs. The performance of pegged isn't great. Neither memory nor cpu wise. In many cases that is fine though, but if you are performance sensitive, you should look elsewhere. I did a full JS grammar in pegged, but switched to a handwritten one when I realised its poor performance.

Forums