View mode: basic / threaded / horizontal-split · Log in · Help
February 28, 2013
Re: Proposal for SentinelInputRange
On 2013-02-28 11:44, Walter Bright wrote:

> Please follow the source code from file to input to the lexer.
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c
>
> line 1012 and 1038

I have no doubt in what you're saying. The spec just didn't look 
accurate according to what you said.

-- 
/Jacob Carlborg
February 28, 2013
Re: Proposal for SentinelInputRange
On 2013-02-28 12:29, Jonathan M Davis wrote:

> And you were just claiming that the lexer checked the sentinel type in only
> one place. If that's indeed the case (and I think that it's quite close to
> being true if it isn't true), then you _wouldn't_ need to use static ifs like
> this in many places. So, which is it? If you need to check the sentinel often
> enough that using static ifs is a problem, then it's probably not buying you
> much of anything over checking empty anyway.

You pick a sentinel that you need to check for anyway, i.e. null or eof. 
But if you don't manually add the sentinel there's nothing that says 
that the sentinel will be there, and therefore you weed to check for 
empty as well.

-- 
/Jacob Carlborg
February 28, 2013
Re: Proposal for SentinelInputRange
On 02/28/13 02:11, Walter Bright wrote:
> A SentinelInputRange is an InputRange with the following additions:
> 
> 1. a compile time property 'sentinel' that is the terminating value of the range
> 2. empty is defined as: empty = (front == sentinel)
> 3. it is not necessary for empty to be called before front
> 
> A C style 0-terminated string is an example of a SentinelInputRange.
> 
> The additions to std.range would be:
> 
> 1. isSentinelInputRange(T) which returns true if T is a SentinelInputRange
> 2. a unittest
> 3. documentation of this
> 
> An addition to std.string would be a function that takes a char* and returns a SentinelInputRange.
> 
> Motivation:
> 1. easy conversion of C strings to ranges
> 2. necessary for a fast implementation of a lexer

LookAheadRange, that not only defines 'sentinel', but also 'lookahead' which
defines how far one can safely look ahead. When 'lookahead==0' it becomes
equivalent to your SentinelInputRange, where 'front' and '[0]' is always
valid and can contain the sentinel. When 'lookahead==N', accessing '[N]' is
always valid.

Having said that, I've used this approach in a D lexer, and it does not really
matter in practice - avoiding the length (or '\0' sentinel) check makes a
<~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+ tokens).
Which is practically irrelevant both in an IDE context and a compiler context
- other processing will be be orders of magnitude more expensive. An IDE doesn't
need to re-lex the whole file after every key press and 1ms won't make any
difference for a compiler run.

"Necessary for a fast implementation of a lexer" is an exaggeration, but yes,
it is an obvious optimization that has a measurable impact. Before implementing
it I was expecting a larger improvement too; and was somewhat surprised when the
resulting gain was in the noise (~1ms +/- ~3ms). /Relatively/ it is significant
(10% perf gain is never insignificant), but the /absolute/ perf difference isn't
that large.
Which of course isn't an argument against (SentinelInputLookAhead)Range; a std
API like that would be a good thing.

The advantage of SentinelInputRange/LookAheadRange is that it, unlike an array,
can be lazy. Which starts to matter when you're dealing with /token/ streams.

artur
February 28, 2013
Re: Proposal for SentinelInputRange
On 2013-02-28 15:08, Artur Skawina wrote:

> Having said that, I've used this approach in a D lexer, and it does not really
> matter in practice - avoiding the length (or '\0' sentinel) check makes a
> <~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+ tokens).
> Which is practically irrelevant both in an IDE context and a compiler context
> - other processing will be be orders of magnitude more expensive. An IDE doesn't
> need to re-lex the whole file after every key press and 1ms won't make any
> difference for a compiler run.

It's not about lexing a single file like std.datetime. We're takling be 
able to fast lex, I don't know, 100 or 1000 of files like std.datetime.

-- 
/Jacob Carlborg
February 28, 2013
Re: Proposal for SentinelInputRange
On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix <deadalnix@gmail.com> wrote:

> On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright wrote:
>> Notice the double read of *p. This is what SentinelInputRange is trying  
>> to fix.
>>
>
> But it will be optimized away.

I have to say I agree with deadalnix.

you have essentially in lexer.c:

outer:
while(1)
{
   switch(r.front)
   {
      case 0:
        break outer;
      ...
   }
}

whereas a range where empty is defined as r.front == 0:

while(!r.empty)  // inlined to r.front != 0
{
   switch(r.front) // why would another load occur here?
   {
      // no need to check for 0, already done
      ...
   }
}

If this doesn't translate to the same code, I don't know why not.  The  
second implementation looks more elegant/straightforward.  I think others  
have pointed out that other compilers already do this.  I hope we aren't  
introducing library components to get around DMD optimizer shortcomings.

Note, r cannot be a D string in order to achieve the desired efficiency.   
With a sentinel range, length is never checked, and should not have to be  
updated.

I think it's not necessary to add extra isSentinelRange, etc. checks, just  
define your sentinel range like any other input range.

-Steve
February 28, 2013
Re: Proposal for SentinelInputRange
On 02/28/13 15:20, Jacob Carlborg wrote:
> On 2013-02-28 15:08, Artur Skawina wrote:
> 
>> Having said that, I've used this approach in a D lexer, and it does not really
>> matter in practice - avoiding the length (or '\0' sentinel) check makes a
>> <~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+ tokens).
>> Which is practically irrelevant both in an IDE context and a compiler context
>> - other processing will be be orders of magnitude more expensive. An IDE doesn't
>> need to re-lex the whole file after every key press and 1ms won't make any
>> difference for a compiler run.
> 
> It's not about lexing a single file like std.datetime. We're takling be able to fast lex, I don't know, 100 or 1000 of files like std.datetime.

Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a single-digit
ms number, but i'd need to write some code to check the actual number). Smaller
objects take proportionally less. Meaning you'll be I/O bound, even /one/ (disk)
cache miss will have more impact then these kind of optimizations. 
Lexing a hundred small files or one 100x as big file is basically the same operation;
the difference will be in I/O + setup/teardown costs, which will be /outside/ the
lexer, so aren't affected by how it accesses input.

artur
February 28, 2013
Re: Proposal for SentinelInputRange
On 2013-02-28 15:31, Steven Schveighoffer wrote:

> while(!r.empty)  // inlined to r.front != 0
> {
>     switch(r.front) // why would another load occur here?
>     {
>        // no need to check for 0, already done
>        ...
>     }
> }

Don't you have to check for 0 anyway. You could still have more data in 
the buffer? I doesn't have to be the manually added sentential that is 
encountered.

-- 
/Jacob Carlborg
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/13 1:57 AM, Walter Bright wrote:
> On 2/27/2013 8:01 PM, John Colvin wrote:
>> Why must sentinel be known at compile time? I don't see what's in the
>> way of it
>> being a runtime argument.
>
> Performance!
>
> It should be usable as a case in a switch statement.

Isn't it possible for the optimizer to inline the function call and then 
combine the next ifs?

if (isSentinel(value)) {
} else {
  switch(value) {
  case ...
  }
}
February 28, 2013
Re: Proposal for SentinelInputRange
On Thu, 28 Feb 2013 09:48:02 -0500, Jacob Carlborg <doob@me.com> wrote:

> On 2013-02-28 15:31, Steven Schveighoffer wrote:
>
>> while(!r.empty)  // inlined to r.front != 0
>> {
>>     switch(r.front) // why would another load occur here?
>>     {
>>        // no need to check for 0, already done
>>        ...
>>     }
>> }
>
> Don't you have to check for 0 anyway. You could still have more data in  
> the buffer? I doesn't have to be the manually added sentential that is  
> encountered.

You are missing the point.  Empty is DEFINED as r.front == 0.  Adding a  
check for 0, would essentially lead to dead code (and technically, the  
compiler could trim it out).

An important thing about a sentinel input range is that it is not an array  
-- you cannot maintain or process length, because length is defined by the  
sentinel.  This becomes an advantage when your goal is simply to process  
every element -- no time wasted updating length.

-Steve
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/13 2:37 AM, deadalnix wrote:
> I don't see how defining a specific sentinel range here helps.

On first blush I agree. It may as well be a range that by convention is 
sentinel-terminated, and there's calls to front and popFront but never 
to empty.

Andrei
2 3 4 5 6 7 8 9 10
Top | Discussion index | About this forum | D home