February 28, 2013
On 2/28/2013 12:20 AM, Jacob Carlborg wrote:
> On 2013-02-28 06:13, Walter Bright wrote:
>
>> No. Look at how lexer.c is fed data.
>
> C strings have a sentinel from the beginning, '\0'.
>

Look at how the source file is transformed into a 0 terminated input. Files are not C strings.

There are no extra copies and no extra tests.
February 28, 2013
On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright wrote:
> On 2/27/2013 11:37 PM, deadalnix wrote:
>> It does check for sentinel in many places (and it has to).
>
> No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558 and 559.
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

escapeSequence does check for it.
February 28, 2013
On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
> On 2013-02-28 08:23, Walter Bright wrote:
>
>> There is always a terminating 0, even if the file ends in 0x1a.
>>
>> (The 0x1A comes from some old editors that end a file with a control Z.)
>
> http://dlang.org/lex.html, "End of File" shows this:
>
> EndOfFile:
>      physical end of the file
>      \u0000
>      \u001A
>
> Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A
> is encountered?
>

Please follow the source code from file to input to the lexer.

https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c

line 1012 and 1038
February 28, 2013
On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>> Again, please see how lexer.c works. I assure you, there is no double
>> copying going on, nor is there a double test for the terminating 0.
>
> I know what the lexer does, and remember that it _doesn't_ operate on ranges,
> and there are subtle differences between being able to just use char* and
> trying to handle generic ranges.

Hence the need to invent SentinelInputRange.


> Given how a lexer works (and I have been working on a lexer off and on
> recently), the only real difference is that you'd just use a couple of static
> ifs like
>
> static if(!isSomeString!R)
> {
>      if(range.empty)
>          break; //or whatever you do at the end
> }
>
> static if(isSomeString!R)
> {
>      case 0:
>          break; //or whatever you do at the end
> }

There are so many places where this would occur, it cries out for a new type.


> So, in the case of a lexer, I don't see sentinel ranges as buying us much. You
> end up having to wrap most any range that you pass to the lexer or whatever
> (including strings so that they'll pass isSentinelRange), you lose out on any
> optimizations of any functions that you call which special-case strings
> (though there probably wouldn't be many of those in a lexer), and all you
> avoid is a couple of static ifs.

And NO, THE SOURCE FILE INPUT IS NEITHER WRAPPED NOR DOUBLE COPIED. Here's how it's done:

https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c

line 1012 and 1038

> The idea of sentinels certainly isn't useless, but anything caring about that
> sort of speed is likely to just use strings or arrays, and those can trivially
> be special cased to avoid unnecessary empty checks and to add the check for
> the sentinel, making the whole sentinel range idea an unnecessary complication
> IMHO.

You can't do efficient lookahead without sentinels, either. Lexers are sensitive to every instruction executed per character read. No sentinels mean double the number of instructions per source character.

InputRanges are an abject failure if "anyone caring about speed" is not going to use them. And yes, I care very much about the D lexing speed.


February 28, 2013
On 2/28/2013 1:19 AM, Peter Alexander wrote:
> On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright wrote:
>> Motivation:
>> 1. easy conversion of C strings to ranges
>
> This is already easy.
>
> struct CString
> {
>      @property char front() { return *p; }
>      void popFront() { ++p; }
>      @property bool empty() const { return !*p; }
>      char* p;
> }
>
> (Yep, rubbish implementation, missing save, constructors etc., but you get the
> idea).

Notice the double read of *p. This is what SentinelInputRange is trying to fix.


>> 2. necessary for a fast implementation of a lexer
>
> This is very domain specific. Is a whole new range concept in the standard
> library necessary?

I've posted in this thread several use cases.


> My worry here is that we are setting a precedent for adding new range concepts.
> If the only justification needed is that is saves a single operation in some
> niche area of computing then we are opening the door to a LOT of different range
> concepts.

There are many cases where speed is a very big deal.


> Are there any other REAL uses for this other than in one line of a lexer
> implementation? I think inclusions into the standard library should require at
> least several distinct and realistic use cases.

I've posted several - including one in Phobos.
February 28, 2013
On 2/28/2013 2:42 AM, deadalnix wrote:
> On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright wrote:
>> On 2/27/2013 11:37 PM, deadalnix wrote:
>>> It does check for sentinel in many places (and it has to).
>>
>> No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558
>> and 559.
>>
>> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c
>
> escapeSequence does check for it.

Note the lookahead is done without checking first. This is the salient point. A sentinel means you are guaranteed to always be able to look one character ahead without checking.

This is used all over the lexer.

The other InputRange forms do not allow peeking ahead without testing *first*.
February 28, 2013
On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright wrote:
> Notice the double read of *p. This is what SentinelInputRange is trying to fix.
>

But it will be optimized away.
February 28, 2013
On Thursday, February 28, 2013 02:54:24 Walter Bright wrote:
> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
> >> Again, please see how lexer.c works. I assure you, there is no double copying going on, nor is there a double test for the terminating 0.
> > 
> > I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.
> 
> Hence the need to invent SentinelInputRange.
> 
> > Given how a lexer works (and I have been working on a lexer off and on recently), the only real difference is that you'd just use a couple of static ifs like
> > 
> > static if(!isSomeString!R)
> > {
> > 
> >      if(range.empty)
> > 
> >          break; //or whatever you do at the end
> > 
> > }
> > 
> > static if(isSomeString!R)
> > {
> > 
> >      case 0:
> >          break; //or whatever you do at the end
> > 
> > }
> 
> There are so many places where this would occur, it cries out for a new type.

And you were just claiming that the lexer checked the sentinel type in only one place. If that's indeed the case (and I think that it's quite close to being true if it isn't true), then you _wouldn't_ need to use static ifs like this in many places. So, which is it? If you need to check the sentinel often enough that using static ifs is a problem, then it's probably not buying you much of anything over checking empty anyway.

> > The idea of sentinels certainly isn't useless, but anything caring about that sort of speed is likely to just use strings or arrays, and those can trivially be special cased to avoid unnecessary empty checks and to add the check for the sentinel, making the whole sentinel range idea an unnecessary complication IMHO.
> 
> You can't do efficient lookahead without sentinels, either. Lexers are sensitive to every instruction executed per character read. No sentinels mean double the number of instructions per source character.

But my point is that outside of strings or arrays, you're almost certainly stuck with that. Most ranges would have to be wrapped in order to act as a sentinel range, and the sentinel range would have to check empty on the internal range on every popFront. The only time that a sentinel range would buy you anything outside of strings or arrays is when you have a range type written specifically to be a sentinel range and which doesn't have to call empty on a range internally. But that means that it's going to have to manage whatever it's holding internally itself and can't wrap another range - and that memory will almost certainly be an array of some sort. And if that's the case, what did we buy by wrapping it in a sentinel range? Why not just use the array directly?

Pretty much the only type of range which isn't wrapping another range which wouldn't have an array internally would be one that was either reading from a file (or some other input) as it iterates or one which was generative. In either case, you would be unable to put a zero on the end without checking whether it was empty with each popFront, meaning that making it a sentinel range was pointless. So, pretty much the only ranges which could gain efficiency as sentinel ranges are those that hold arrays internally, meaning that the only gain from having the sentinel range over special casing the code on strings or arrays is the fact that you won't have to special case strings or arrays at all with static ifs or template constraints, but I'm quite convinced that in most cases, the number of static ifs involved would be quite few, making it quite trivial to special case on strings, meaning that there's pretty much no gain in having the sentinel ranges. It just incurs more overhead that the compiler must optimize away, because you have to wrap strings for them to work as sentinel ranges, or no template constraint will be able to detect that they're sentinel ranges. Of course, you could special case them so that they're treated as sentinel ranges in spite of that, but if you're special casing, why use the sentinel ranges in the first place instead of just special casing on strings?

> InputRanges are an abject failure if "anyone caring about speed" is not going to use them. And yes, I care very much about the D lexing speed.

Pure input ranges fail utterly as you can't save them, so you get _zero_ lookahead unless you start copying their contents somewhere else - which is quite expensive in comparison to slicing an array and which complicates the code. Forward ranges do better, but the lack of random access can be a killer in some cases. For instance, anything involving unicode (which a lexer mostly avoids, because it rarely has to decode unicode, but it still happens sometimes) requires random access for speed at least some of the time (especially when skipping code units), and it makes comparisons faster, because you can compare slices at once rather than compare a character and then pop it, compare, pop, compare, pop... For the fastest results, you need random-access ranges.

There's tons of stuff that's fine with forward ranges (though I seriously question the viability of pure input ranges given how insanely limiting is to be unable to save their state), but there's also plenty of stuff that needs random access, and if you _really_ care about speed, there's a decent chance that you need random access (if nothing else, to be able to pop off multiple elements at a time). That doesn't necessarily mean that you're using an array, but odds are whatever you're using wraps an array if it isn't one. I don't think that there's much of anything which is random access without having an array or pointer underneath the hood. Reading from a file, I'd be very inclined to use something like MmFile or reading the whole file into an array rather than trying to wrap in something that's a pure input range. They're just too limiting.

- Jonathan M Davis
February 28, 2013
On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright wrote:
>> My worry here is that we are setting a precedent for adding new range concepts.
>> If the only justification needed is that is saves a single operation in some
>> niche area of computing then we are opening the door to a LOT of different range
>> concepts.
>
> There are many cases where speed is a very big deal.

Yes, are we going to add new range categories every time there's a performance benefit to doing so?

has16ByteAlignedElements
hasUniqueElements
hasContiguousMemory
isMonotonic
isUnimodal
...
?
February 28, 2013
On 2013-02-28 11:37, Walter Bright wrote:

> No, it doesn't. I suggest examining line 481 of lexer.c. Then examine
> line 558 and 559.
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

You know that you can link to a specific line by clicking on it in the margin.

-- 
/Jacob Carlborg