August 02, 2012
Am Thu, 02 Aug 2012 14:26:58 +0200
schrieb "Adam D. Ruppe" <destructionator@gmail.com>:

> On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
> > lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language).
> 
> What if we're just using this lexer in something like a syntax highlighting text editor? I'd be annoyed if it stopped typing for a little bit cuz of processing.

My pet peeve for editors is how to make the lexer work on only what I just edited. It is really not trivial, but eased for example by editors that insert closing } automatically, so the scopes don't mess up. A missing curly bracket results in the parser reinterpreting the whole file. The same goes for string terminators:

If I type `` at the top of std.datetime in Mono-D right after "module std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive again.
But if I use ' or ", then the editor inserts the second terminator and is done with reparsing in 1-2 seconds.

Can we really have a good lexer that is equally well suited for compilers as for editors ?

-- 
Marco

August 02, 2012
On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> Regarding the problem at hand, it's becoming painfully obvious to me that the lexer MUST do its own decoding internally.

That's not a great surprise to me. I hit the same issues when writing my XML parser, which is why I invented functions called frontUnit and popFrontUnit. I'm glad you're realizing this.

> Hence, a very simple thing to do is have the entire lexer only deal with ranges of ubyte. If someone passes a char[], the lexer's front end can simply call s.representation and obtain the underlying ubyte[].

That's ugly, but it could work (assuming s.representation returns the casted range by ref). I still prefer my frontUnit and popFrontUnit approach though.

In fact, any parser for which speed is important will have to bypass std.range's clever handling of UTF characters. Dealing simply with ubytes isn't enough, since in some cases you'll want to fire up the UTF decoder.

The next issue, which I haven's seen discussed here is that for a parser to be efficient it should operate on buffers. You can make it work with arbitrary ranges, but if you don't have a buffer you can slice when you need to preserve a string, you're going to have to build the string character by character, which is not efficient at all. But then you can only really return slices if the underlying representation is the same as the output representation, and unless your API has a templated output type, you're going to special case a lot of things.

After having attempted an XML parser with ranges, I'm not sure parsing using generic ranges can be made very efficient. Automatic conversion to UTF-32 is a nuisance for performance, and if the output needs to return parts of the input, you'll need to create an inefficient special case just to allocate many new strings in the correct format.

I wonder how your call with Walter will turn out.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca/

August 02, 2012
On 8/2/2012 11:14 AM, Marco Leise wrote:
> If I type `` at the top of std.datetime in Mono-D right after "module
> std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive
> again. But if I use ' or ", then the editor inserts the second terminator and
> is done with reparsing in 1-2 seconds.

A good IDE should do its parsing in a separate thread, so the main user input thread remains crisp and responsive.

If the user edits the text while the parsing is in progress, the background parsing thread simply abandons the current parse and starts over.

August 02, 2012
On 8/2/12, Walter Bright <newshound2@digitalmars.com> wrote:
> A good IDE should do its parsing in a separate thread, so the main user
> input
> thread remains crisp and responsive.
>
> If the user edits the text while the parsing is in progress, the background
>
> parsing thread simply abandons the current parse and starts over.

Agreed, this is an editor/IDE domain problem. And there are some obvious optimizations an editor could use, such as only lexing the visible portion of text on the screen (and then lexing the rest in a background thread which doesn't block the interface).
August 02, 2012
On 02-Aug-12 22:06, Walter Bright wrote:
> On 8/2/2012 4:47 AM, deadalnix wrote:
>> Le 02/08/2012 10:13, Walter Bright a écrit :
>>> As fast as the dmd one would be best.
>>>
>>
>> That'd be great but . . .
>>
>> lexer really isn't the performance bottleneck of dmd (or any compiler
>> of a non
>> trivial language). Additionally, anybody that have touched dmd source
>> code can
>> agree that usability/maintainability isn't as its best.
>>
>> Sacrificing some perfs in a non bottleneck area to increase ease of
>> use make
>> perfect sense.
>
> A lexer is like a device driver, or a memory allocator. Having it be as
> fast as possible pays off dividends endlessly for whoever uses it.
>

+1 Another point is that it's crystal clear *how* to optimize lexer, and gain some precious time. It is well trodden path with well known destination.
The time saved in lexer gives you some headroom in say optimizer, where you always can find more ways to improve result by sacrificing speed.

In the long run I totally expect millions of lines of D code in the wild.

> And in my profiling, lexing speed is a big factor in the debug build times.
>
> The Digital Mars C++ compiler (and its antecedents) made a lot of hay by
> being the fastest compiler available by a wide margin. A big advantage D
> has is also compile speed.
>
> Please do not underestimate its importance.


-- 
Dmitry Olshansky
August 02, 2012
On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
> > On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
> >> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
> >>> It is for ranges in general. In the general case, a range of UTF-8 or
> >>> UTF-16 makes no sense whatsoever. Having range-based functions which
> >>> understand the encodings and optimize accordingly can be very beneficial
> >>> (which happens with strings but can't happen with general ranges without
> >>> the concept of a variably-length encoded range like we have with forward
> >>> range or random access range), but to actually have a range of UTF-8 or
> >>> UTF-16 just wouldn't work. Range-based functions operate on elements,
> >>> and
> >>> doing stuff like filter or map or reduce on code units doesn't make any
> >>> sense at all.
> >> 
> >> Yes, it can work.
> > 
> > How?
> 
> Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.

And how on earth is that going to work as a range? Range-based functions operate on elements. They use empty, front, popFront, etc. If front doesn't return an element that a range-based function can operate on without caring what it is, then that type isn't going to work as a range. If you need the consumer to be doing something special, then that means you need to special case it for that range type. And that's what you're doing when you special- case range-base functions for strings.

So, because of how front and popFront work, you either have to have a range of code units or a range of code points. With a range of code units, the element type is a code unit, so any operations that you do will be operating on individual code units, not code points. With a range of code points, any operations being done will operate on code points, which _will_ require decoding as long as you're actually using the range API. You only make strings more efficent by special-casing the function for them such that it understands unicode and will operate on the string in the most efficient way according to how that string's encoding works.

You seem to be arguing that we can somehow have a generic range API which operates on strings, and yet what you're saying a function using those ranges must do (e.g. having a buffer of multipl code units) requires that a range- based function operate in a non-generic manner for strings. If the consumer has to do _anything_ special for strings, then what it's doing is non-generic with regards to strings.

I agree that we should be making string operations more efficient by taking code units into account, but I completely disagree that we can do that generically. At best, we could add the concept of a variably-length encoded range such that a range-based function could special case them and use the encoding where appropriate, but all that buys us is the ability to special case variably- length encoded ranges _other_ than strings (since we can already special case strings), and I don't think that it's even possible to deal with a variably- length encoded range properly without understanding what the encoding is, in which case, we'd be dealing with special range types which were specifically UTF-8 encoded or UTF-16 encoded, and range-based functions would be special- casing _those_ rather than a generic variably-length encoded range.

In either case, because the consumer must do something other than simply operate on front, popFront, empty, etc., you're _not_ dealing with the range API but rather working around it.

- Jonathan M Davis
August 02, 2012
On 8/2/12 2:17 PM, Michel Fortin wrote:
> On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>> Hence, a very simple thing to do is have the entire lexer only deal
>> with ranges of ubyte. If someone passes a char[], the lexer's front
>> end can simply call s.representation and obtain the underlying ubyte[].
>
> That's ugly, but it could work (assuming s.representation returns the
> casted range by ref). I still prefer my frontUnit and popFrontUnit
> approach though.

I agree frontUnit and popFrontUnit are more generic because they allow other ranges to define them.

> In fact, any parser for which speed is important will have to bypass
> std.range's clever handling of UTF characters. Dealing simply with
> ubytes isn't enough, since in some cases you'll want to fire up the UTF
> decoder.
>
> The next issue, which I haven's seen discussed here is that for a parser
> to be efficient it should operate on buffers. You can make it work with
> arbitrary ranges, but if you don't have a buffer you can slice when you
> need to preserve a string, you're going to have to build the string
> character by character, which is not efficient at all. But then you can
> only really return slices if the underlying representation is the same
> as the output representation, and unless your API has a templated output
> type, you're going to special case a lot of things.

I think a BufferedRange could go a long way here.

> After having attempted an XML parser with ranges, I'm not sure parsing
> using generic ranges can be made very efficient. Automatic conversion to
> UTF-32 is a nuisance for performance, and if the output needs to return
> parts of the input, you'll need to create an inefficient special case
> just to allocate many new strings in the correct format.

I'm not so sure, but I'll measure.

> I wonder how your call with Walter will turn out.

What call?


Thanks,

Andrei
August 02, 2012
On 8/2/12, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
> In the long run I totally expect millions of lines of D code in the wild.

In a single project or globally? I hope not the former, D should inspire people to write less code that does more. :)
August 02, 2012
On 2012-08-02 21:35, Walter Bright wrote:

> A good IDE should do its parsing in a separate thread, so the main user
> input thread remains crisp and responsive.
>
> If the user edits the text while the parsing is in progress, the
> background parsing thread simply abandons the current parse and starts
> over.

It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI.

-- 
/Jacob Carlborg
August 02, 2012
On Thursday, August 02, 2012 11:06:39 Walter Bright wrote:
> On 8/2/2012 4:47 AM, deadalnix wrote:
> > Le 02/08/2012 10:13, Walter Bright a écrit :
> >> As fast as the dmd one would be best.
> > 
> > That'd be great but . . .
> > 
> > lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best.
> > 
> > Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.
> 
> A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it.
> 
> And in my profiling, lexing speed is a big factor in the debug build times.
> 
> The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being the fastest compiler available by a wide margin. A big advantage D has is also compile speed.
> 
> Please do not underestimate its importance.

Recent benchmarks by Iain Buclaw show that lexing is not a bottleneck in dmd, so it's natural for some folks to think that the lexing is not all that critical in comparison to other compiler components. However, the reason that lexing is not a bottleneck is almost ceratinly because the lexer is so thoroughly optimized. So, if it _didn't_ try and be as efficient as possible, it _would_ be a bottleneck.

However, given that we're providing an API for a lexer rather than having the lexer be an internal component of a compiler, that _does_ put more pressure on making the API user-friendly. Hopefully that can be done without sacrificing any performance, but there may be a point where the only way to make it faster would be to make the API very unfriendly. If that's the case, then we're going to have some tough decisions to make. But we'll cross that bridge if/when we get to it.

- Jonathan M Davis