February 28, 2013
On 2013-02-28 11:44, Walter Bright wrote:

> Please follow the source code from file to input to the lexer.
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c
>
> line 1012 and 1038

I have no doubt in what you're saying. The spec just didn't look accurate according to what you said.

-- 
/Jacob Carlborg
February 28, 2013
On 2013-02-28 12:29, Jonathan M Davis wrote:

> And you were just claiming that the lexer checked the sentinel type in only
> one place. If that's indeed the case (and I think that it's quite close to
> being true if it isn't true), then you _wouldn't_ need to use static ifs like
> this in many places. So, which is it? If you need to check the sentinel often
> enough that using static ifs is a problem, then it's probably not buying you
> much of anything over checking empty anyway.

You pick a sentinel that you need to check for anyway, i.e. null or eof. But if you don't manually add the sentinel there's nothing that says that the sentinel will be there, and therefore you weed to check for empty as well.

-- 
/Jacob Carlborg
February 28, 2013
On 02/28/13 02:11, Walter Bright wrote:
> A SentinelInputRange is an InputRange with the following additions:
> 
> 1. a compile time property 'sentinel' that is the terminating value of the range
> 2. empty is defined as: empty = (front == sentinel)
> 3. it is not necessary for empty to be called before front
> 
> A C style 0-terminated string is an example of a SentinelInputRange.
> 
> The additions to std.range would be:
> 
> 1. isSentinelInputRange(T) which returns true if T is a SentinelInputRange
> 2. a unittest
> 3. documentation of this
> 
> An addition to std.string would be a function that takes a char* and returns a SentinelInputRange.
> 
> Motivation:
> 1. easy conversion of C strings to ranges
> 2. necessary for a fast implementation of a lexer

LookAheadRange, that not only defines 'sentinel', but also 'lookahead' which defines how far one can safely look ahead. When 'lookahead==0' it becomes equivalent to your SentinelInputRange, where 'front' and '[0]' is always valid and can contain the sentinel. When 'lookahead==N', accessing '[N]' is always valid.

Having said that, I've used this approach in a D lexer, and it does not really
matter in practice - avoiding the length (or '\0' sentinel) check makes a
<~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+ tokens).
Which is practically irrelevant both in an IDE context and a compiler context
- other processing will be be orders of magnitude more expensive. An IDE doesn't
need to re-lex the whole file after every key press and 1ms won't make any
difference for a compiler run.

"Necessary for a fast implementation of a lexer" is an exaggeration, but yes,
it is an obvious optimization that has a measurable impact. Before implementing
it I was expecting a larger improvement too; and was somewhat surprised when the
resulting gain was in the noise (~1ms +/- ~3ms). /Relatively/ it is significant
(10% perf gain is never insignificant), but the /absolute/ perf difference isn't
that large.
Which of course isn't an argument against (SentinelInputLookAhead)Range; a std
API like that would be a good thing.

The advantage of SentinelInputRange/LookAheadRange is that it, unlike an array, can be lazy. Which starts to matter when you're dealing with /token/ streams.

artur
February 28, 2013
On 2013-02-28 15:08, Artur Skawina wrote:

> Having said that, I've used this approach in a D lexer, and it does not really
> matter in practice - avoiding the length (or '\0' sentinel) check makes a
> <~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+ tokens).
> Which is practically irrelevant both in an IDE context and a compiler context
> - other processing will be be orders of magnitude more expensive. An IDE doesn't
> need to re-lex the whole file after every key press and 1ms won't make any
> difference for a compiler run.

It's not about lexing a single file like std.datetime. We're takling be able to fast lex, I don't know, 100 or 1000 of files like std.datetime.

-- 
/Jacob Carlborg
February 28, 2013
On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix <deadalnix@gmail.com> wrote:

> On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright wrote:
>> Notice the double read of *p. This is what SentinelInputRange is trying to fix.
>>
>
> But it will be optimized away.

I have to say I agree with deadalnix.

you have essentially in lexer.c:

outer:
while(1)
{
   switch(r.front)
   {
      case 0:
        break outer;
      ...
   }
}

whereas a range where empty is defined as r.front == 0:

while(!r.empty)  // inlined to r.front != 0
{
   switch(r.front) // why would another load occur here?
   {
      // no need to check for 0, already done
      ...
   }
}

If this doesn't translate to the same code, I don't know why not.  The second implementation looks more elegant/straightforward.  I think others have pointed out that other compilers already do this.  I hope we aren't introducing library components to get around DMD optimizer shortcomings.

Note, r cannot be a D string in order to achieve the desired efficiency.  With a sentinel range, length is never checked, and should not have to be updated.

I think it's not necessary to add extra isSentinelRange, etc. checks, just define your sentinel range like any other input range.

-Steve
February 28, 2013
On 02/28/13 15:20, Jacob Carlborg wrote:
> On 2013-02-28 15:08, Artur Skawina wrote:
> 
>> Having said that, I've used this approach in a D lexer, and it does not really
>> matter in practice - avoiding the length (or '\0' sentinel) check makes a
>> <~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+ tokens).
>> Which is practically irrelevant both in an IDE context and a compiler context
>> - other processing will be be orders of magnitude more expensive. An IDE doesn't
>> need to re-lex the whole file after every key press and 1ms won't make any
>> difference for a compiler run.
> 
> It's not about lexing a single file like std.datetime. We're takling be able to fast lex, I don't know, 100 or 1000 of files like std.datetime.

Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a single-digit
ms number, but i'd need to write some code to check the actual number). Smaller
objects take proportionally less. Meaning you'll be I/O bound, even /one/ (disk)
cache miss will have more impact then these kind of optimizations.
Lexing a hundred small files or one 100x as big file is basically the same operation;
the difference will be in I/O + setup/teardown costs, which will be /outside/ the
lexer, so aren't affected by how it accesses input.

artur
February 28, 2013
On 2013-02-28 15:31, Steven Schveighoffer wrote:

> while(!r.empty)  // inlined to r.front != 0
> {
>     switch(r.front) // why would another load occur here?
>     {
>        // no need to check for 0, already done
>        ...
>     }
> }

Don't you have to check for 0 anyway. You could still have more data in the buffer? I doesn't have to be the manually added sentential that is encountered.

-- 
/Jacob Carlborg
February 28, 2013
On 2/28/13 1:57 AM, Walter Bright wrote:
> On 2/27/2013 8:01 PM, John Colvin wrote:
>> Why must sentinel be known at compile time? I don't see what's in the
>> way of it
>> being a runtime argument.
>
> Performance!
>
> It should be usable as a case in a switch statement.

Isn't it possible for the optimizer to inline the function call and then combine the next ifs?

if (isSentinel(value)) {
} else {
  switch(value) {
  case ...
  }
}
February 28, 2013
On Thu, 28 Feb 2013 09:48:02 -0500, Jacob Carlborg <doob@me.com> wrote:

> On 2013-02-28 15:31, Steven Schveighoffer wrote:
>
>> while(!r.empty)  // inlined to r.front != 0
>> {
>>     switch(r.front) // why would another load occur here?
>>     {
>>        // no need to check for 0, already done
>>        ...
>>     }
>> }
>
> Don't you have to check for 0 anyway. You could still have more data in the buffer? I doesn't have to be the manually added sentential that is encountered.

You are missing the point.  Empty is DEFINED as r.front == 0.  Adding a check for 0, would essentially lead to dead code (and technically, the compiler could trim it out).

An important thing about a sentinel input range is that it is not an array -- you cannot maintain or process length, because length is defined by the sentinel.  This becomes an advantage when your goal is simply to process every element -- no time wasted updating length.

-Steve
February 28, 2013
On 2/28/13 2:37 AM, deadalnix wrote:
> I don't see how defining a specific sentinel range here helps.

On first blush I agree. It may as well be a range that by convention is sentinel-terminated, and there's calls to front and popFront but never to empty.

Andrei