std.d.lexer requirements (page 4)

August 02, 2012

Re: std.d.lexer requirements

Posted by Jonathan M Davis
in reply to Walter Bright

Permalink

Jonathan M Davis

Posted in reply to Walter Bright

Permalink

On Wednesday, August 01, 2012 23:39:42 Walter Bright wrote:
> >> Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings.
> > 
> > Why is there any converting to dchar going on here?
> 
> Because your input range is a range of dchar?

I think that we're misunderstanding each other here. A typical, well-written, range-based function which operates on ranges of dchar will use static if or overloads to special-case strings. This means that it will function with any range of dchar, but it _also_ will be as efficient with strings as if it just operated on strings. It won't decode anything in the string unless it has to. So, having a lexer which operates on ranges of dchar does _not_ make string processing less efficient. It just makes it so that it can _also_ operate on ranges of dchar which aren't strings.

For instance, my lexer uses this whenever it needs to get at the first character in the range:

static if(isNarrowString!R)
    Unqual!(ElementEncodingType!R) first = range[0];
else
    dchar first = range.front;

I use string mixins to reduce the code duplication that goes with this, but notice that strings and wstrings do _not_ get decoded. Their first code unit is used, and then most everything operates on that code unit. e.g.

switch(first)
{
    case '\n': //...
    case ' ': //...
    case '+' //...
    //...
    default: //...
}

It's only when the code actually needs to decode the first code point that it does so (e.g. when all of the possible ASCII characters have been exhausted, it needs to check whether the first code point is a unicode alpha character, since that could be the beginning of an identifier). At that point, I end up doing something like

static if(isNarrowString!R)
    dchar front = range.front;
else
    alias first front;

or

if I need to know the number of code units that make up the code point, I explicitly call decode in the case of a narrow string. In either case, code units are _not_ being converted to dchar unless they absolutely have to be.

The only thing that suffers here performance-wise is ranges that aren't actually strings. For instance, filter!"true"(source) would have _much_ worse performance, because it has to decode every character. But without the concept of a variably-lengthed encoded range, we can't really get around that.

> > My point is that it's the sort of thing that _only_ a parser would care about. So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.
> I think you are misunderstanding. The lexer doesn't have a *symbol* table in it. It has a mapping from identifiers to unique handles. It needs to be there, otherwise the semantic analysis has to scan identifier strings a second time.

Yes. I understand. It has a mapping of pointers to identifiers. My point is that nothing but parsers will need that. From the standpoint of functionality, it's a parser feature, not a lexer feature. So, if it can be done just fine in the parser, then that's where it should be. If on the other hand, it _needs_ to be in the lexer for some reason (e.g. performance), then that's a reason to put it there.

It sounds like you're arguing that performance requirements dictate that it be put in the lexer even if nothing but the parser cares about it, in which case, the lexer is indeed where it should be, but if performance isn't affected, then I'd still argue that it should be in the parser.

- Jonathan M Davis

On 8/1/2012 11:59 PM, Jacob Carlborg wrote: > Your first requirement is a bit strange and confusing: > > "1. It should accept as input an input range of UTF8." > > To my understand ranges in Phobos are designed to operate on dchars, not chars, > regardless of the original input type. So if you can create a range that > operates on UTF8 I don't know if that will be usable with the rest of the Phobos > functions expecting ranges of dchars. Basically making this type of range > useless since it cannot use any other functionality in Phobos. > > The current state of Phobos: you either have a range of dchars or you don't have > a range, you have an array or something else. > I answered this point a few posts up in the thread.

On 2012-08-02 09:21, Walter Bright wrote: > I answered this point a few posts up in the thread. I've read a few posts up and the only answer I found is that the lexer needs to operates on chars. But it does not answer the question how that range type would be used by all other range based functions in Phobos. Have I missed this? Can you please link to your post. -- /Jacob Carlborg

On 8/1/2012 11:56 PM, Jonathan M Davis wrote: > Another thing that I should point out is that a range of UTF-8 or UTF-16 > wouldn't work with many range-based functions at all. Most of std.algorithm > and its ilk would be completely useless. Range-based functions operate on a > ranges elements, so operating on a range of code units would mean operating on > code units, which is going to be _wrong_ almost all the time. Think about what > would happen if you used a function like map or filter on a range of code > units. The resultant range would be completely invalid as far as unicode goes. My experience in writing fast string based code that worked on UTF8 and correctly handled multibyte characters was that they are very possible and practical, and they are faster. The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer. I think there's some serious underestimation of how critical this is. > Range-based functions need to be operating on _characters_. Technically, not > even code points gets us there, so it's _still_ buggy. It's just a _lot_ > closer to being correct and works 99.99+% of the time. Multi-code point characters are quite irrelevant to the correctness of a D lexer. > If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add > a concept of variably-length encoded ranges so that it's possible to treat > them as both their encoding and whatever they represent (e.g. code point or > grapheme in the case of ranges of code units). No, this is not necessary.

On 8/1/2012 11:49 PM, Jacob Carlborg wrote: > On 2012-08-02 02:10, Walter Bright wrote: > >> 1. It should accept as input an input range of UTF8. I feel it is a >> mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it >> UTF16 should use an 'adapter' range to convert the input to UTF8. (This >> is what component programming is all about.) > > I'm no expert on ranges but won't that prevent slicing? Slicing is one of the > main reasons for why the Tango XML parser is so amazingly fast. > You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.

On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote: > The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer. If you want to throw out some target times, that would be useful.

On Thursday, 2 August 2012 at 07:32:59 UTC, Walter Bright wrote: > On 8/2/2012 12:09 AM, Bernard Helyer wrote: >> http://i.imgur.com/oSXTc.png >> >> Posted without comment. > > I'm afraid I'm mystified as to your point. Just that I'm slaving away over a hot IDE here. :P

On Thursday, August 02, 2012 00:29:09 Walter Bright wrote: > > If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add a concept of variably-length encoded ranges so that it's possible to treat them as both their encoding and whatever they represent (e.g. code point or grapheme in the case of ranges of code units). > > No, this is not necessary. It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all. - Jonathan M Davis

Forums