August 02, 2012
On 2012-08-02 09:29, Walter Bright wrote:

> My experience in writing fast string based code that worked on UTF8 and
> correctly handled multibyte characters was that they are very possible
> and practical, and they are faster.
>
> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If
> it isn't fast, serious users will eschew it and will cook up their own.
> You'll have a nice, pretty, useless toy of std.d.lexer.
>
> I think there's some serious underestimation of how critical this is.

I do understand that the lexer needs to be insanely fast and it needs to operate on UTF-8 and not UTF-32 or anything else.

But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos.

-- 
/Jacob Carlborg
August 02, 2012
> 10. High speed matters a lot

then add a benchmark "suite" to the list - the lexer should be benchmarked from the very first beginning

and it should be designed for multithreading - there is no need for
on-the-fly hash-table updating - maybe just one update on each lex threads end
August 02, 2012
On 8/2/2012 12:33 AM, Bernard Helyer wrote:
> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
>> isn't fast, serious users will eschew it and will cook up their own. You'll
>> have a nice, pretty, useless toy of std.d.lexer.
>
> If you want to throw out some target times, that would be useful.

As fast as the dmd one would be best.

August 02, 2012
On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
> It is for ranges in general. In the general case, a range of UTF-8 or UTF-16
> makes no sense whatsoever. Having range-based functions which understand the
> encodings and optimize accordingly can be very beneficial (which happens with
> strings but can't happen with general ranges without the concept of a
> variably-length encoded range like we have with forward range or random access
> range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work.
> Range-based functions operate on elements, and doing stuff like filter or map or
> reduce on code units doesn't make any sense at all.

Yes, it can work.

August 02, 2012
On 8/2/2012 12:49 AM, Jacob Carlborg wrote:
> But what I still don't understand is how a UTF-8 range is going to be usable by
> other range based functions in Phobos.

Worst case use an adapter range.


August 02, 2012
On 8/2/2012 12:29 AM, Jacob Carlborg wrote:
> On 2012-08-02 09:21, Walter Bright wrote:
>
>> I answered this point a few posts up in the thread.
>
> I've read a few posts up and the only answer I found is that the lexer needs to
> operates on chars. But it does not answer the question how that range type would
> be used by all other range based functions in Phobos.
>
> Have I missed this? Can you please link to your post.
>

You can use an adapter range.

August 02, 2012
On Thursday, August 02, 2012 01:13:04 Walter Bright wrote:
> On 8/2/2012 12:33 AM, Bernard Helyer wrote:
> > On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
> >> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If
> >> it
> >> isn't fast, serious users will eschew it and will cook up their own.
> >> You'll
> >> have a nice, pretty, useless toy of std.d.lexer.
> > 
> > If you want to throw out some target times, that would be useful.
> 
> As fast as the dmd one would be best.

How would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer?

- Jonathan M Davis
August 02, 2012
Am 02.08.2012 10:13, schrieb Walter Bright:
> On 8/2/2012 12:33 AM, Bernard Helyer wrote:
>> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
>>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
>>> isn't fast, serious users will eschew it and will cook up their own. You'll
>>> have a nice, pretty, useless toy of std.d.lexer.
>>
>> If you want to throw out some target times, that would be useful.
>
> As fast as the dmd one would be best.
>

would it be (easily) possible to "extract" the dmd lexer code (and needed interface) for using it as an spererated benchmark reference?

August 02, 2012
Am 02.08.2012 10:13, schrieb Walter Bright:
> On 8/2/2012 12:33 AM, Bernard Helyer wrote:
>> On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
>>> The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
>>> isn't fast, serious users will eschew it and will cook up their own. You'll
>>> have a nice, pretty, useless toy of std.d.lexer.
>>
>> If you want to throw out some target times, that would be useful.
>
> As fast as the dmd one would be best.
>

can the dmd lexer seperated as an lib and become useable from outside - as the benchmark reference
August 02, 2012
On 8/2/2012 12:21 AM, Jonathan M Davis wrote:
>> Because your input range is a range of dchar?
> I think that we're misunderstanding each other here. A typical, well-written,
> range-based function which operates on ranges of dchar will use static if or
> overloads to special-case strings. This means that it will function with any
> range of dchar, but it _also_ will be as efficient with strings as if it just
> operated on strings.

It *still* must convert UTF8 to dchars before presenting them to the consumer of the dchar elements.


> It won't decode anything in the string unless it has to.
> So, having a lexer which operates on ranges of dchar does _not_ make string
> processing less efficient. It just makes it so that it can _also_ operate on
> ranges of dchar which aren't strings.
>
> For instance, my lexer uses this whenever it needs to get at the first
> character in the range:
>
> static if(isNarrowString!R)
>      Unqual!(ElementEncodingType!R) first = range[0];
> else
>      dchar first = range.front;

You're requiring a random access input range that has random access to something other than the range element type?? and you're requiring an isNarrowString to work on an arbitrary range?


> if I need to know the number of code units that make up the code point, I
> explicitly call decode in the case of a narrow string. In either case, code
> units are _not_ being converted to dchar unless they absolutely have to be.

Or you could do away with requiring a special range type and just have it be a UTF8 range.

What I wasn't realizing earlier was that you were positing a range type that has two different kinds of elements. I don't think this is a proper component type.


> Yes. I understand. It has a mapping of pointers to identifiers. My point is
> that nothing but parsers will need that.
> From the standpoint of functionality,
> it's a parser feature, not a lexer feature. So, if it can be done just fine in
> the parser, then that's where it should be. If on the other hand, it _needs_
> to be in the lexer for some reason (e.g. performance), then that's a reason to
> put it there.

If you take it out of the lexer, then:

1. the lexer must allocate storage for every identifier, rather than only for unique identifiers

2. and then the parser must scan the identifier string *again*

3. there must be two hash lookups of each identifier rather than one

It's a suboptimal design.