View mode: basic / threaded / horizontal-split · Log in · Help
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/2013 12:20 AM, Jacob Carlborg wrote:
> On 2013-02-28 06:13, Walter Bright wrote:
>
>> No. Look at how lexer.c is fed data.
>
> C strings have a sentinel from the beginning, '\0'.
>

Look at how the source file is transformed into a 0 terminated input. Files are 
not C strings.

There are no extra copies and no extra tests.
February 28, 2013
Re: Proposal for SentinelInputRange
On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright 
wrote:
> On 2/27/2013 11:37 PM, deadalnix wrote:
>> It does check for sentinel in many places (and it has to).
>
> No, it doesn't. I suggest examining line 481 of lexer.c. Then 
> examine line 558 and 559.
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

escapeSequence does check for it.
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
> On 2013-02-28 08:23, Walter Bright wrote:
>
>> There is always a terminating 0, even if the file ends in 0x1a.
>>
>> (The 0x1A comes from some old editors that end a file with a control Z.)
>
> http://dlang.org/lex.html, "End of File" shows this:
>
> EndOfFile:
>      physical end of the file
>      \u0000
>      \u001A
>
> Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A
> is encountered?
>

Please follow the source code from file to input to the lexer.

https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c

line 1012 and 1038
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>> Again, please see how lexer.c works. I assure you, there is no double
>> copying going on, nor is there a double test for the terminating 0.
>
> I know what the lexer does, and remember that it _doesn't_ operate on ranges,
> and there are subtle differences between being able to just use char* and
> trying to handle generic ranges.

Hence the need to invent SentinelInputRange.


> Given how a lexer works (and I have been working on a lexer off and on
> recently), the only real difference is that you'd just use a couple of static
> ifs like
>
> static if(!isSomeString!R)
> {
>      if(range.empty)
>          break; //or whatever you do at the end
> }
>
> static if(isSomeString!R)
> {
>      case 0:
>          break; //or whatever you do at the end
> }

There are so many places where this would occur, it cries out for a new type.


> So, in the case of a lexer, I don't see sentinel ranges as buying us much. You
> end up having to wrap most any range that you pass to the lexer or whatever
> (including strings so that they'll pass isSentinelRange), you lose out on any
> optimizations of any functions that you call which special-case strings
> (though there probably wouldn't be many of those in a lexer), and all you
> avoid is a couple of static ifs.

And NO, THE SOURCE FILE INPUT IS NEITHER WRAPPED NOR DOUBLE COPIED. Here's how 
it's done:

https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c

line 1012 and 1038

> The idea of sentinels certainly isn't useless, but anything caring about that
> sort of speed is likely to just use strings or arrays, and those can trivially
> be special cased to avoid unnecessary empty checks and to add the check for
> the sentinel, making the whole sentinel range idea an unnecessary complication
> IMHO.

You can't do efficient lookahead without sentinels, either. Lexers are sensitive 
to every instruction executed per character read. No sentinels mean double the 
number of instructions per source character.

InputRanges are an abject failure if "anyone caring about speed" is not going to 
use them. And yes, I care very much about the D lexing speed.
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/2013 1:19 AM, Peter Alexander wrote:
> On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright wrote:
>> Motivation:
>> 1. easy conversion of C strings to ranges
>
> This is already easy.
>
> struct CString
> {
>      @property char front() { return *p; }
>      void popFront() { ++p; }
>      @property bool empty() const { return !*p; }
>      char* p;
> }
>
> (Yep, rubbish implementation, missing save, constructors etc., but you get the
> idea).

Notice the double read of *p. This is what SentinelInputRange is trying to fix.


>> 2. necessary for a fast implementation of a lexer
>
> This is very domain specific. Is a whole new range concept in the standard
> library necessary?

I've posted in this thread several use cases.


> My worry here is that we are setting a precedent for adding new range concepts.
> If the only justification needed is that is saves a single operation in some
> niche area of computing then we are opening the door to a LOT of different range
> concepts.

There are many cases where speed is a very big deal.


> Are there any other REAL uses for this other than in one line of a lexer
> implementation? I think inclusions into the standard library should require at
> least several distinct and realistic use cases.

I've posted several - including one in Phobos.
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/2013 2:42 AM, deadalnix wrote:
> On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright wrote:
>> On 2/27/2013 11:37 PM, deadalnix wrote:
>>> It does check for sentinel in many places (and it has to).
>>
>> No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558
>> and 559.
>>
>> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c
>
> escapeSequence does check for it.

Note the lookahead is done without checking first. This is the salient point. A 
sentinel means you are guaranteed to always be able to look one character ahead 
without checking.

This is used all over the lexer.

The other InputRange forms do not allow peeking ahead without testing *first*.
February 28, 2013
Re: Proposal for SentinelInputRange
On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright 
wrote:
> Notice the double read of *p. This is what SentinelInputRange 
> is trying to fix.
>

But it will be optimized away.
February 28, 2013
Re: Proposal for SentinelInputRange
On Thursday, February 28, 2013 02:54:24 Walter Bright wrote:
> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
> >> Again, please see how lexer.c works. I assure you, there is no double
> >> copying going on, nor is there a double test for the terminating 0.
> > 
> > I know what the lexer does, and remember that it _doesn't_ operate on
> > ranges, and there are subtle differences between being able to just use
> > char* and trying to handle generic ranges.
> 
> Hence the need to invent SentinelInputRange.
> 
> > Given how a lexer works (and I have been working on a lexer off and on
> > recently), the only real difference is that you'd just use a couple of
> > static ifs like
> > 
> > static if(!isSomeString!R)
> > {
> > 
> >      if(range.empty)
> >      
> >          break; //or whatever you do at the end
> > 
> > }
> > 
> > static if(isSomeString!R)
> > {
> > 
> >      case 0:
> >          break; //or whatever you do at the end
> > 
> > }
> 
> There are so many places where this would occur, it cries out for a new
> type.

And you were just claiming that the lexer checked the sentinel type in only 
one place. If that's indeed the case (and I think that it's quite close to 
being true if it isn't true), then you _wouldn't_ need to use static ifs like 
this in many places. So, which is it? If you need to check the sentinel often 
enough that using static ifs is a problem, then it's probably not buying you 
much of anything over checking empty anyway.

> > The idea of sentinels certainly isn't useless, but anything caring about
> > that sort of speed is likely to just use strings or arrays, and those can
> > trivially be special cased to avoid unnecessary empty checks and to add
> > the check for the sentinel, making the whole sentinel range idea an
> > unnecessary complication IMHO.
> 
> You can't do efficient lookahead without sentinels, either. Lexers are
> sensitive to every instruction executed per character read. No sentinels
> mean double the number of instructions per source character.

But my point is that outside of strings or arrays, you're almost certainly 
stuck with that. Most ranges would have to be wrapped in order to act as a 
sentinel range, and the sentinel range would have to check empty on the 
internal range on every popFront. The only time that a sentinel range would 
buy you anything outside of strings or arrays is when you have a range type 
written specifically to be a sentinel range and which doesn't have to call 
empty on a range internally. But that means that it's going to have to manage 
whatever it's holding internally itself and can't wrap another range - and 
that memory will almost certainly be an array of some sort. And if that's the 
case, what did we buy by wrapping it in a sentinel range? Why not just use the 
array directly?

Pretty much the only type of range which isn't wrapping another range which 
wouldn't have an array internally would be one that was either reading from a 
file (or some other input) as it iterates or one which was generative. In 
either case, you would be unable to put a zero on the end without checking 
whether it was empty with each popFront, meaning that making it a sentinel 
range was pointless. So, pretty much the only ranges which could gain efficiency 
as sentinel ranges are those that hold arrays internally, meaning that the 
only gain from having the sentinel range over special casing the code on 
strings or arrays is the fact that you won't have to special case strings or 
arrays at all with static ifs or template constraints, but I'm quite convinced 
that in most cases, the number of static ifs involved would be quite few, 
making it quite trivial to special case on strings, meaning that there's 
pretty much no gain in having the sentinel ranges. It just incurs more 
overhead that the compiler must optimize away, because you have to wrap 
strings for them to work as sentinel ranges, or no template constraint will be 
able to detect that they're sentinel ranges. Of course, you could special case 
them so that they're treated as sentinel ranges in spite of that, but if 
you're special casing, why use the sentinel ranges in the first place instead 
of just special casing on strings?

> InputRanges are an abject failure if "anyone caring about speed" is not
> going to use them. And yes, I care very much about the D lexing speed.

Pure input ranges fail utterly as you can't save them, so you get _zero_ 
lookahead unless you start copying their contents somewhere else - which is 
quite expensive in comparison to slicing an array and which complicates the 
code. Forward ranges do better, but the lack of random access can be a killer 
in some cases. For instance, anything involving unicode (which a lexer mostly 
avoids, because it rarely has to decode unicode, but it still happens 
sometimes) requires random access for speed at least some of the time 
(especially when skipping code units), and it makes comparisons faster, 
because you can compare slices at once rather than compare a character and 
then pop it, compare, pop, compare, pop... For the fastest results, you need 
random-access ranges.

There's tons of stuff that's fine with forward ranges (though I seriously 
question the viability of pure input ranges given how insanely limiting is to 
be unable to save their state), but there's also plenty of stuff that needs 
random access, and if you _really_ care about speed, there's a decent chance 
that you need random access (if nothing else, to be able to pop off multiple 
elements at a time). That doesn't necessarily mean that you're using an array, 
but odds are whatever you're using wraps an array if it isn't one. I don't 
think that there's much of anything which is random access without having an 
array or pointer underneath the hood. Reading from a file, I'd be very inclined 
to use something like MmFile or reading the whole file into an array rather 
than trying to wrap in something that's a pure input range. They're just too 
limiting.

- Jonathan M Davis
February 28, 2013
Re: Proposal for SentinelInputRange
On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright 
wrote:
>> My worry here is that we are setting a precedent for adding 
>> new range concepts.
>> If the only justification needed is that is saves a single 
>> operation in some
>> niche area of computing then we are opening the door to a LOT 
>> of different range
>> concepts.
>
> There are many cases where speed is a very big deal.

Yes, are we going to add new range categories every time there's 
a performance benefit to doing so?

has16ByteAlignedElements
hasUniqueElements
hasContiguousMemory
isMonotonic
isUnimodal
...
?
February 28, 2013
Re: Proposal for SentinelInputRange
On 2013-02-28 11:37, Walter Bright wrote:

> No, it doesn't. I suggest examining line 481 of lexer.c. Then examine
> line 558 and 559.
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

You know that you can link to a specific line by clicking on it in the 
margin.

-- 
/Jacob Carlborg
1 2 3 4 5 6 7 8 9
Top | Discussion index | About this forum | D home