std.d.lexer requirements (page 5)

On 2012-08-02 09:29, Walter Bright wrote: > My experience in writing fast string based code that worked on UTF8 and > correctly handled multibyte characters was that they are very possible > and practical, and they are faster. > > The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If > it isn't fast, serious users will eschew it and will cook up their own. > You'll have a nice, pretty, useless toy of std.d.lexer. > > I think there's some serious underestimation of how critical this is. I do understand that the lexer needs to be insanely fast and it needs to operate on UTF-8 and not UTF-32 or anything else. But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos. -- /Jacob Carlborg

> 10. High speed matters a lot then add a benchmark "suite" to the list - the lexer should be benchmarked from the very first beginning and it should be designed for multithreading - there is no need for on-the-fly hash-table updating - maybe just one update on each lex threads end

On 8/2/2012 12:33 AM, Bernard Helyer wrote: > On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote: >> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it >> isn't fast, serious users will eschew it and will cook up their own. You'll >> have a nice, pretty, useless toy of std.d.lexer. > > If you want to throw out some target times, that would be useful. As fast as the dmd one would be best.

On 8/2/2012 12:43 AM, Jonathan M Davis wrote: > It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 > makes no sense whatsoever. Having range-based functions which understand the > encodings and optimize accordingly can be very beneficial (which happens with > strings but can't happen with general ranges without the concept of a > variably-length encoded range like we have with forward range or random access > range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. > Range-based functions operate on elements, and doing stuff like filter or map or > reduce on code units doesn't make any sense at all. Yes, it can work.

On 8/2/2012 12:49 AM, Jacob Carlborg wrote: > But what I still don't understand is how a UTF-8 range is going to be usable by > other range based functions in Phobos. Worst case use an adapter range.

On 8/2/2012 12:29 AM, Jacob Carlborg wrote: > On 2012-08-02 09:21, Walter Bright wrote: > >> I answered this point a few posts up in the thread. > > I've read a few posts up and the only answer I found is that the lexer needs to > operates on chars. But it does not answer the question how that range type would > be used by all other range based functions in Phobos. > > Have I missed this? Can you please link to your post. > You can use an adapter range.

On Thursday, August 02, 2012 01:13:04 Walter Bright wrote: > On 8/2/2012 12:33 AM, Bernard Helyer wrote: > > On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote: > >> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If > >> it > >> isn't fast, serious users will eschew it and will cook up their own. > >> You'll > >> have a nice, pretty, useless toy of std.d.lexer. > > > > If you want to throw out some target times, that would be useful. > > As fast as the dmd one would be best. How would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer? - Jonathan M Davis

Am 02.08.2012 10:13, schrieb Walter Bright: > On 8/2/2012 12:33 AM, Bernard Helyer wrote: >> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote: >>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it >>> isn't fast, serious users will eschew it and will cook up their own. You'll >>> have a nice, pretty, useless toy of std.d.lexer. >> >> If you want to throw out some target times, that would be useful. > > As fast as the dmd one would be best. > would it be (easily) possible to "extract" the dmd lexer code (and needed interface) for using it as an spererated benchmark reference?

Am 02.08.2012 10:13, schrieb Walter Bright: > On 8/2/2012 12:33 AM, Bernard Helyer wrote: >> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote: >>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it >>> isn't fast, serious users will eschew it and will cook up their own. You'll >>> have a nice, pretty, useless toy of std.d.lexer. >> >> If you want to throw out some target times, that would be useful. > > As fast as the dmd one would be best. > can the dmd lexer seperated as an lib and become useable from outside - as the benchmark reference

August 02, 2012

Re: std.d.lexer requirements

Posted by Walter Bright
in reply to Jonathan M Davis

Permalink

Walter Bright

Posted in reply to Jonathan M Davis

Permalink

On 8/2/2012 12:21 AM, Jonathan M Davis wrote:
>> Because your input range is a range of dchar?
> I think that we're misunderstanding each other here. A typical, well-written,
> range-based function which operates on ranges of dchar will use static if or
> overloads to special-case strings. This means that it will function with any
> range of dchar, but it _also_ will be as efficient with strings as if it just
> operated on strings.

It *still* must convert UTF8 to dchars before presenting them to the consumer of the dchar elements.

> It won't decode anything in the string unless it has to.
> So, having a lexer which operates on ranges of dchar does _not_ make string
> processing less efficient. It just makes it so that it can _also_ operate on
> ranges of dchar which aren't strings.
>
> For instance, my lexer uses this whenever it needs to get at the first
> character in the range:
>
> static if(isNarrowString!R)
>      Unqual!(ElementEncodingType!R) first = range[0];
> else
>      dchar first = range.front;

You're requiring a random access input range that has random access to something other than the range element type?? and you're requiring an isNarrowString to work on an arbitrary range?

> if I need to know the number of code units that make up the code point, I
> explicitly call decode in the case of a narrow string. In either case, code
> units are _not_ being converted to dchar unless they absolutely have to be.

Or you could do away with requiring a special range type and just have it be a UTF8 range.

What I wasn't realizing earlier was that you were positing a range type that has two different kinds of elements. I don't think this is a proper component type.

> Yes. I understand. It has a mapping of pointers to identifiers. My point is
> that nothing but parsers will need that.
> From the standpoint of functionality,
> it's a parser feature, not a lexer feature. So, if it can be done just fine in
> the parser, then that's where it should be. If on the other hand, it _needs_
> to be in the lexer for some reason (e.g. performance), then that's a reason to
> put it there.

If you take it out of the lexer, then:

1. the lexer must allocate storage for every identifier, rather than only for unique identifiers

2. and then the parser must scan the identifier string *again*

3. there must be two hash lookups of each identifier rather than one

It's a suboptimal design.

Forums