February 09, 2013
On 2013-02-09 16:10, Andrei Alexandrescu wrote:

> Requiring a random-access range of ubyte with a terminating zero may be
> the most general approach to a fast lexer - and that's fine.

Requiring a random-access range probably makes it easier. People here seems to try to support ranges with less functionality, like input or forward ranges.

-- 
/Jacob Carlborg
February 09, 2013
On 2013-02-09 15:37, Andrei Alexandrescu wrote:

> That's not a problem. You may require that the input has a terminating
> zero.

You cannot just append a terminating zero in the lexer if it's a range?

-- 
/Jacob Carlborg
February 09, 2013
On 2/9/13 10:37 AM, Jacob Carlborg wrote:
> On 2013-02-09 16:10, Andrei Alexandrescu wrote:
>
>> Requiring a random-access range of ubyte with a terminating zero may be
>> the most general approach to a fast lexer - and that's fine.
>
> Requiring a random-access range probably makes it easier. People here
> seems to try to support ranges with less functionality, like input or
> forward ranges.

Yah. The way I see it is, start with a random-access range and then see what the use patterns are. Then possibly refine.

Andrei

February 09, 2013
On 2/9/13 10:38 AM, Jacob Carlborg wrote:
> On 2013-02-09 15:37, Andrei Alexandrescu wrote:
>
>> That's not a problem. You may require that the input has a terminating
>> zero.
>
> You cannot just append a terminating zero in the lexer if it's a range?

You require it. It's client's burden.

Andrei
February 09, 2013
On 2013-02-09 16:46, Andrei Alexandrescu wrote:

> You require it. It's client's burden.

Ok, I see.

-- 
/Jacob Carlborg
February 09, 2013
On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky wrote:
> 08-Feb-2013 19:52, Iain Buclaw пишет:

>> That's still lies. :o)
>>
>
> g++ ? :)
>
>
>>
>> --
>> Iain Buclaw
>>
>> *(p < e ? p++ : p) = (c & 0x0f) + '0';

I'd think it's built by DMC++, Digital Mars C++ compiler.
February 09, 2013
09-Feb-2013 19:46, Andrei Alexandrescu пишет:
> On 2/9/13 10:37 AM, Jacob Carlborg wrote:
>> On 2013-02-09 16:10, Andrei Alexandrescu wrote:
>>
>>> Requiring a random-access range of ubyte with a terminating zero may be
>>> the most general approach to a fast lexer - and that's fine.
>>
>> Requiring a random-access range probably makes it easier. People here
>> seems to try to support ranges with less functionality, like input or
>> forward ranges.
>
> Yah. The way I see it is, start with a random-access range and then see
> what the use patterns are. Then possibly refine.
>

I don't get this. There is no sensible requirement to forbid non-random access. A couple of static ifs and you are done.

And I can confidently say that std.d.lexer has quite some room for optimization in both cases and it doesn't have to sacrifice the generic path.

I intent to continue hacking on Brain's implementation and to help him refine it. Any real help (as in work and analysis) is appreciated, thanks.

-- 
Dmitry Olshansky
February 09, 2013
09-Feb-2013 23:20, SomeDude пишет:
> On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky wrote:
>> 08-Feb-2013 19:52, Iain Buclaw пишет:
>
>>> That's still lies. :o)
>>>
>>
>> g++ ? :)
>>
>>
>>>
>>> --
>>> Iain Buclaw
>>>
>>> *(p < e ? p++ : p) = (c & 0x0f) + '0';
>
> I'd think it's built by DMC++, Digital Mars C++ compiler.

On linux, sure ;)

-- 
Dmitry Olshansky
February 09, 2013
On 2/9/2013 6:37 AM, Andrei Alexandrescu wrote:
> On 2/9/13 3:07 AM, Jonathan M Davis wrote:
>> On Friday, February 08, 2013 23:48:50 Walter Bright wrote:
>>> On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
>>>> It's not quite a use case where
>>>> ranges shine - especially when efficiency is a top priority.
>>>
>>> A more problematic case is dmd's lexer relies on a 0 byte at the end to be a
>>> "sentinel" for the end of file. Without such a sentinel, every consumption
>>> of a source character requires two operations rather than one.
>>
>> I didn't know that. That's a cute trick. But yeah, without controlling the
>> input, it's not possible, and that won't work with a general implementation in
>> a library. It would be trivial enough to put a wrapper around the input to add
>> the 0 byte at the end, but the wrapper would still have to keep checking for
>> empty, so you wouldn't gain anything.
>
> That's not a problem. You may require that the input has a terminating zero.

Perhaps we can formalize this a bit. Define a SentinelInputRange, which has an additional manifest constant with the name 'sentinel'. empty becomes defined as:

    empty = front == sentinel;

popFront() does not advance if empty. Advancing over a SentinelInputRange can be done the usual:

    for (r = range; !r.empty; r.popFront()) { auto e = r.front; }

or can be done as:

    for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }

This should work well with 0 terminated strings. The output of the lexer should also be a SentinelInputRange, with TOKeof as the sentinel.
February 09, 2013
On 2/9/2013 12:43 PM, Walter Bright wrote:> Perhaps we can formalize this a bit. Define a SentinelInputRange, which has an
> additional manifest constant with the name 'sentinel'. empty becomes defined as:
>
>      empty = front == sentinel;
>
> popFront() does not advance if empty. Advancing over a SentinelInputRange can be
> done the usual:
>
>      for (r = range; !r.empty; r.popFront()) { auto e = r.front; }
>
> or can be done as:
>
>      for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }

If it's not clear, SentinelInputRange.front can be called *before* checking empty. If it is empty, then the sentinel is returned.