std.d.lexer requirements (page 6)

August 02, 2012

Re: std.d.lexer requirements

Posted by Jonathan M Davis
in reply to Walter Bright

Permalink

Jonathan M Davis

Posted in reply to Walter Bright

Permalink

On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
> > It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.
> 
> Yes, it can work.

How? If you operate on a range of code units, then you're operating on individual code units, which almost never makes sense. There are plenty cases where a function which understands the encoding can avoid some of costs associated with decoding and whatnot, but since range-based functions operate on their elements, if the elementse are code units, a range-based function will operate on individual code units with _no_ understanding of the encoding at all. Ranges have no concept of encoding.

Do you really think that it makes sense for a function like map or filter to operate on individual code units? Because that's what would end up happening with a range of code units. Your average, range-based function only makes sense with _characters_, not code units. Functions which can operate on ranges of code units without screwing up the encoding are a rarity.

Unless a range-based function special cases a range-type which is variably- lengthed encoded (e.g. string), it just isn't going to deal with the encoding properly. Either it operates on the encoding or the actual value, depending on what its element type is.

I concur that operating on strings as code units is better from the standpoint of efficiency, but it just doesn't work with a generic function without it having a special case which therefore _isn't_ generic.

- Jonathan M Davis

On 8/2/2012 1:21 AM, Jonathan M Davis wrote: > How would we measure that? dmd's lexer is tied to dmd, so how would we test > the speed of only its lexer? Easy. Just make a special version of dmd that lexes only, and time it.

On 8/2/2012 1:38 AM, Jonathan M Davis wrote: > On Thursday, August 02, 2012 01:14:30 Walter Bright wrote: >> On 8/2/2012 12:43 AM, Jonathan M Davis wrote: >>> It is for ranges in general. In the general case, a range of UTF-8 or >>> UTF-16 makes no sense whatsoever. Having range-based functions which >>> understand the encodings and optimize accordingly can be very beneficial >>> (which happens with strings but can't happen with general ranges without >>> the concept of a variably-length encoded range like we have with forward >>> range or random access range), but to actually have a range of UTF-8 or >>> UTF-16 just wouldn't work. Range-based functions operate on elements, and >>> doing stuff like filter or map or reduce on code units doesn't make any >>> sense at all. >> >> Yes, it can work. > > How? Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it. > Do you really think that it makes sense for a function like map or filter to > operate on individual code units? Because that's what would end up happening > with a range of code units. Your average, range-based function only makes > sense with _characters_, not code units. Functions which can operate on ranges > of code units without screwing up the encoding are a rarity. Rare or not, they are certainly possible, and the early versions of std.string did just that (although they weren't using ranges, the same techniques apply).

Walter Bright , dans le message (digitalmars.D:174015), a écrit : > On 8/2/2012 12:49 AM, Jacob Carlborg wrote: >> But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos. > > Worst case use an adapter range. > > Yes auto r = myString.byChar(); after implementing a byChar adapter range or just auto r = cast(const(ubyte)[]) myString; And it's a range of code unit, not code point. And it's usable in phobos.

Walter Bright wrote: > 1. It should accept as input an input range of UTF8. I feel it is a > mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it > UTF16 should use an 'adapter' range to convert the input to UTF8. (This > is what component programming is all about.) Why it is a mistake? I think Lexer should parse any UTF range and return compatible token's strings. That is it should provide strings for UTF8 input, wstrings for UTF16 input and so on.

On 8/2/2012 2:27 AM, Piotr Szturmaj wrote: > Walter Bright wrote: >> 1. It should accept as input an input range of UTF8. I feel it is a >> mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it >> UTF16 should use an 'adapter' range to convert the input to UTF8. (This >> is what component programming is all about.) > > Why it is a mistake? Because the lexer is large and it would have to have a lot of special case code inserted here and there to make that work. > I think Lexer should parse any UTF range and return > compatible token's strings. That is it should provide strings for UTF8 input, > wstrings for UTF16 input and so on. Why? I've never seen any UTF16 or UTF32 D source in the wild. Besides, if it is not templated then it doesn't need to be recompiled by every user of it - it can exist as object code in the library.

Le 02/08/2012 06:48, Walter Bright a écrit : > On 8/1/2012 9:41 PM, H. S. Teoh wrote: >> Whether it's part of the range type or a separate lexer type, >> *definitely* make it possible to have multiple instances. One of the >> biggest flaws of otherwise-good lexer generators like lex and flex >> (C/C++) is that the core code assumes a single instance, and >> multi-instances were glued on after the fact, making it a royal pain to >> work with anything that needs lexing multiple things at the same time. > > Yup. I keep trying to think of a way to lex multiple files at the same > time in separate threads, but the problem is serializing access to the > identifier table will likely kill off any perf gain. > That was exactly my reaction to the « let's reuse the identifier table » comment of yours. The future is multicore.

>> 7. It should accept a callback delegate for errors. That delegate should >> decide whether to: >> 1. ignore the error (and "Lexer" will try to recover and continue) >> 2. print an error message (and "Lexer" will try to recover and continue) >> 3. throw an exception, "Lexer" is done with that input range > > I'm currently treating errors as tokens. It then becomes easy for the code > using the lexer to just ignore the errors, to process them immediately, or to > put off dealing with them until the lexing is complete. So, the code using the > lexer can handle errors however and whenever it likes without having to worry > about delegates or exceptions. Since tokens are lexed lazily, the fact that an > error is reported as a token doesn't require that the lexing continue, but it > also makes it _easy_ to continue lexing, ignoring or saving the error. I'm > inclined to think that that's a superior approach to using delegates and > exceptions. > Really nice idea. It is still easy to wrap the Range in another Range that process errors in a custom way.

Le 02/08/2012 10:13, Walter Bright a écrit : > On 8/2/2012 12:33 AM, Bernard Helyer wrote: >> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote: >>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. >>> If it >>> isn't fast, serious users will eschew it and will cook up their own. >>> You'll >>> have a nice, pretty, useless toy of std.d.lexer. >> >> If you want to throw out some target times, that would be useful. > > As fast as the dmd one would be best. > That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.

Forums