August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On 2012-08-02 09:29, Walter Bright wrote: > My experience in writing fast string based code that worked on UTF8 and > correctly handled multibyte characters was that they are very possible > and practical, and they are faster. > > The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If > it isn't fast, serious users will eschew it and will cook up their own. > You'll have a nice, pretty, useless toy of std.d.lexer. > > I think there's some serious underestimation of how critical this is. I do understand that the lexer needs to be insanely fast and it needs to operate on UTF-8 and not UTF-32 or anything else. But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos. -- /Jacob Carlborg |
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | > 10. High speed matters a lot
then add a benchmark "suite" to the list - the lexer should be benchmarked from the very first beginning
and it should be designed for multithreading - there is no need for
on-the-fly hash-table updating - maybe just one update on each lex threads end
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bernard Helyer | On 8/2/2012 12:33 AM, Bernard Helyer wrote:
> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
>> isn't fast, serious users will eschew it and will cook up their own. You'll
>> have a nice, pretty, useless toy of std.d.lexer.
>
> If you want to throw out some target times, that would be useful.
As fast as the dmd one would be best.
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
> It is for ranges in general. In the general case, a range of UTF-8 or UTF-16
> makes no sense whatsoever. Having range-based functions which understand the
> encodings and optimize accordingly can be very beneficial (which happens with
> strings but can't happen with general ranges without the concept of a
> variably-length encoded range like we have with forward range or random access
> range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work.
> Range-based functions operate on elements, and doing stuff like filter or map or
> reduce on code units doesn't make any sense at all.
Yes, it can work.
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On 8/2/2012 12:49 AM, Jacob Carlborg wrote:
> But what I still don't understand is how a UTF-8 range is going to be usable by
> other range based functions in Phobos.
Worst case use an adapter range.
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On 8/2/2012 12:29 AM, Jacob Carlborg wrote:
> On 2012-08-02 09:21, Walter Bright wrote:
>
>> I answered this point a few posts up in the thread.
>
> I've read a few posts up and the only answer I found is that the lexer needs to
> operates on chars. But it does not answer the question how that range type would
> be used by all other range based functions in Phobos.
>
> Have I missed this? Can you please link to your post.
>
You can use an adapter range.
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Thursday, August 02, 2012 01:13:04 Walter Bright wrote:
> On 8/2/2012 12:33 AM, Bernard Helyer wrote:
> > On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
> >> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If
> >> it
> >> isn't fast, serious users will eschew it and will cook up their own.
> >> You'll
> >> have a nice, pretty, useless toy of std.d.lexer.
> >
> > If you want to throw out some target times, that would be useful.
>
> As fast as the dmd one would be best.
How would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer?
- Jonathan M Davis
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Am 02.08.2012 10:13, schrieb Walter Bright:
> On 8/2/2012 12:33 AM, Bernard Helyer wrote:
>> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
>>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
>>> isn't fast, serious users will eschew it and will cook up their own. You'll
>>> have a nice, pretty, useless toy of std.d.lexer.
>>
>> If you want to throw out some target times, that would be useful.
>
> As fast as the dmd one would be best.
>
would it be (easily) possible to "extract" the dmd lexer code (and needed interface) for using it as an spererated benchmark reference?
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Am 02.08.2012 10:13, schrieb Walter Bright:
> On 8/2/2012 12:33 AM, Bernard Helyer wrote:
>> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
>>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
>>> isn't fast, serious users will eschew it and will cook up their own. You'll
>>> have a nice, pretty, useless toy of std.d.lexer.
>>
>> If you want to throw out some target times, that would be useful.
>
> As fast as the dmd one would be best.
>
can the dmd lexer seperated as an lib and become useable from outside - as the benchmark reference
|
August 02, 2012 Re: std.d.lexer requirements | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On 8/2/2012 12:21 AM, Jonathan M Davis wrote: >> Because your input range is a range of dchar? > I think that we're misunderstanding each other here. A typical, well-written, > range-based function which operates on ranges of dchar will use static if or > overloads to special-case strings. This means that it will function with any > range of dchar, but it _also_ will be as efficient with strings as if it just > operated on strings. It *still* must convert UTF8 to dchars before presenting them to the consumer of the dchar elements. > It won't decode anything in the string unless it has to. > So, having a lexer which operates on ranges of dchar does _not_ make string > processing less efficient. It just makes it so that it can _also_ operate on > ranges of dchar which aren't strings. > > For instance, my lexer uses this whenever it needs to get at the first > character in the range: > > static if(isNarrowString!R) > Unqual!(ElementEncodingType!R) first = range[0]; > else > dchar first = range.front; You're requiring a random access input range that has random access to something other than the range element type?? and you're requiring an isNarrowString to work on an arbitrary range? > if I need to know the number of code units that make up the code point, I > explicitly call decode in the case of a narrow string. In either case, code > units are _not_ being converted to dchar unless they absolutely have to be. Or you could do away with requiring a special range type and just have it be a UTF8 range. What I wasn't realizing earlier was that you were positing a range type that has two different kinds of elements. I don't think this is a proper component type. > Yes. I understand. It has a mapping of pointers to identifiers. My point is > that nothing but parsers will need that. > From the standpoint of functionality, > it's a parser feature, not a lexer feature. So, if it can be done just fine in > the parser, then that's where it should be. If on the other hand, it _needs_ > to be in the lexer for some reason (e.g. performance), then that's a reason to > put it there. If you take it out of the lexer, then: 1. the lexer must allocate storage for every identifier, rather than only for unique identifiers 2. and then the parser must scan the identifier string *again* 3. there must be two hash lookups of each identifier rather than one It's a suboptimal design. |
Copyright © 1999-2021 by the D Language Foundation