February 28, 2013
28-Feb-2013 19:31, Andrei Alexandrescu пишет:
> On 2/28/13 5:54 AM, Walter Bright wrote:
>> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>>>> Again, please see how lexer.c works. I assure you, there is no double
>>>> copying going on, nor is there a double test for the terminating 0.
>>>
>>> I know what the lexer does, and remember that it _doesn't_ operate on
>>> ranges,
>>> and there are subtle differences between being able to just use char*
>>> and
>>> trying to handle generic ranges.
>>
>> Hence the need to invent SentinelInputRange.
>
> I don't think the sentinel input range is a blocker for redoing the
> parser (with ranges) in D. This discussion has probably run its course.
> The right thing to do at this point is port the lexer and figure what
> primitives are necessary from its input.
>

And it wasn't the problem that std.d.lexer had anyway. Check the latest results.

The merit of sentinel range IMHO is largely in tapping into C-strings and the like more naturally.



-- 
Dmitry Olshansky
February 28, 2013
On 2/28/13 8:44 AM, Jacob Carlborg wrote:
> On 2013-02-28 12:29, Jonathan M Davis wrote:
>
>> And you were just claiming that the lexer checked the sentinel type in
>> only
>> one place. If that's indeed the case (and I think that it's quite
>> close to
>> being true if it isn't true), then you _wouldn't_ need to use static
>> ifs like
>> this in many places. So, which is it? If you need to check the
>> sentinel often
>> enough that using static ifs is a problem, then it's probably not
>> buying you
>> much of anything over checking empty anyway.
>
> You pick a sentinel that you need to check for anyway, i.e. null or eof.
> But if you don't manually add the sentinel there's nothing that says
> that the sentinel will be there, and therefore you weed to check for
> empty as well.

auto range = assumeWithSentinel(input);

:o)

I'm only half-joking - the awesomeness of assumeSorted suggests we could reuse the pattern elsewhere.


Andrei
February 28, 2013
On 2/28/2013 5:40 AM, Jacob Carlborg wrote:
> I have no doubt in what you're saying. The spec just didn't look accurate
> according to what you said.

I don't know what you mean.

February 28, 2013
On 2/28/2013 7:38 AM, Dmitry Olshansky wrote:
> 28-Feb-2013 14:44, Walter Bright пишет:
>> On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
>>> On 2013-02-28 08:23, Walter Bright wrote:
>>>
>>>> There is always a terminating 0, even if the file ends in 0x1a.
>>>>
>>>> (The 0x1A comes from some old editors that end a file with a control Z.)
>>>
>>> http://dlang.org/lex.html, "End of File" shows this:
>>>
>>> EndOfFile:
>>>      physical end of the file
>>>      \u0000
>>>      \u001A
>>>
>>> Doesn't that mean that the input is empty if the file is empty, \u0000
>>> or \u001A
>>> is encountered?
>>>
>>
>
>
>> Please follow the source code from file to input to the lexer.
>
> Source as spec is no good. Either change the spec or  admit that having a tuple
> of sentinels as manifest constant is fine. In any case constant tuples are
> easily unrolled into case statements.

The spec is correct, so is the code, and tuple sentinels are entirely unnecessary.

February 28, 2013
On 2/28/2013 6:48 AM, Ary Borenszweig wrote:
> On 2/28/13 1:57 AM, Walter Bright wrote:
>> On 2/27/2013 8:01 PM, John Colvin wrote:
>>> Why must sentinel be known at compile time? I don't see what's in the
>>> way of it
>>> being a runtime argument.
>>
>> Performance!
>>
>> It should be usable as a case in a switch statement.
>
> Isn't it possible for the optimizer to inline the function call and then combine
> the next ifs?
>
> if (isSentinel(value)) {
> } else {
>    switch(value) {
>    case ...
>    }
> }

1. I don't know of any C compiler that would do that.

2. There are other cases, as I pointed out to deadalnix.

3. You still can't do lookahead without extra checks

Once again, guys, have a look at lexer.c at the lines I pointed out.
February 28, 2013
On 2/28/2013 6:59 AM, Andrei Alexandrescu wrote:
> On 2/28/13 2:37 AM, deadalnix wrote:
>> I don't see how defining a specific sentinel range here helps.
>
> On first blush I agree. It may as well be a range that by convention is
> sentinel-terminated, and there's calls to front and popFront but never to empty.


Consider the following code from lexer.c:

   p++;
   switch (*p)

Written using an InputRange:

   popFront();
   switch (front)

That code is INVALID. This is why a SentinelInputRange is necessary. You can't just use an InputRange in an invalid manner by convention.

February 28, 2013
28-Feb-2013 20:33, Walter Bright пишет:
> On 2/28/2013 7:38 AM, Dmitry Olshansky wrote:
>> 28-Feb-2013 14:44, Walter Bright пишет:
>>> On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
>>>> On 2013-02-28 08:23, Walter Bright wrote:
>>>>
>>>>> There is always a terminating 0, even if the file ends in 0x1a.
>>>>>
>>>>> (The 0x1A comes from some old editors that end a file with a
>>>>> control Z.)
>>>>
>>>> http://dlang.org/lex.html, "End of File" shows this:
>>>>
>>>> EndOfFile:
>>>>      physical end of the file
>>>>      \u0000
>>>>      \u001A
>>>>
>>>> Doesn't that mean that the input is empty if the file is empty, \u0000
>>>> or \u001A
>>>> is encountered?
>>>>
>>>
>>
>>
>>> Please follow the source code from file to input to the lexer.
>>
>> Source as spec is no good. Either change the spec or  admit that
>> having a tuple
>> of sentinels as manifest constant is fine. In any case constant tuples
>> are
>> easily unrolled into case statements.
>
> The spec is correct, so is the code, and tuple sentinels are entirely
> unnecessary.
>

line 300:
    case 0x1A:          // ^Z means end of file
    case 0:
                        break;

On the lines you noted it claimed that that 0x1a is outdate. Along with the fact that you allocate filesize+2 and fill the last 2 bytes with zeros.

In any case I see 0 and 0x1a as 2 values that act like sentinels i.e. a tuple. And this is what spec says - any one of them is a sentinel. Correct me if I'm wrong.

-- 
Dmitry Olshansky
February 28, 2013
On 2/28/2013 3:29 AM, Jonathan M Davis wrote:
> And you were just claiming that the lexer checked the sentinel type in only
> one place. If that's indeed the case (and I think that it's quite close to
> being true if it isn't true), then you _wouldn't_ need to use static ifs like
> this in many places. So, which is it? If you need to check the sentinel often
> enough that using static ifs is a problem, then it's probably not buying you
> much of anything over checking empty anyway.

Please, again, examine lexer.c.

For example, look at what happens after line 719. Try 794 in particular.

Now look at line 996 and what follows. Note that there is not a single null check there, yet the code is correct and does not run off the end of the data.


> But my point is that outside of strings or arrays, you're almost certainly
> stuck with that.

I've given you two examples (lexer and regexp) where you are certainly not stuck with that, and those two cases matter.


> Pure input ranges fail utterly as you can't save them, so you get _zero_
> lookahead [...]

Yet the lexer will work efficiently and correctly with a SentinelInputRange that is also a ForwardRange. It fits in nicely with the range concepts.

February 28, 2013
On 2/28/2013 7:31 AM, Andrei Alexandrescu wrote:
> I don't think the sentinel input range is a blocker for redoing the parser (with
> ranges) in D. This discussion has probably run its course. The right thing to do
> at this point is port the lexer and figure what primitives are necessary from
> its input.

I don't agree. It is a blocker for getting a fast lexer.

February 28, 2013
On 2/28/2013 7:37 AM, monarch_dodra wrote:
> Just provide it with a disclaimer than it will only ever be useful for a select
> class of algorithms (parser), but that the optimization opportunity will be
> ignored by the rest of phobos.

1. parsing text is very commonplace

2. it is used in std.regexp in the interpreter engine, a quite different use case, but also common to bytecode engines