February 28, 2013
28-Feb-2013 05:56, Walter Bright пишет:
> On 2/27/2013 5:53 PM, Zach the Mystic wrote:
>> What if more than one
>> value can end the range, EOF, '\0'?
>
> I have never seen a need for that.

Inside of the D lexer:

EOF is 0 or 0x1a :)

-- 
Dmitry Olshansky
February 28, 2013
On 2/27/2013 10:02 PM, deadalnix wrote:
> On Thursday, 28 February 2013 at 05:28:47 UTC, Walter Bright wrote:
>> On 2/27/2013 9:20 PM, deadalnix wrote:
>>> If the range define empty with something like front == sentinel, the inliner
>>> should kick in a reduce the whole stuff to only one read, no ?
>>
>>     auto c = front;
>>     if (c == sentinel || c == XX)
>>
>> is two reads. This may not seem important, but when you want high speed, it
>> can halve it.
>
> Don't you have to check for both all the time ? You have to check for the
> sentinel anyway.

I suggest again taking a look at the dmd source lexer.c for how to do it. There is no extra check.
February 28, 2013
On 2/27/2013 11:02 PM, Dmitry Olshansky wrote:
> Inside of the D lexer:
>
> EOF is 0 or 0x1a :)

There is always a terminating 0, even if the file ends in 0x1a.

(The 0x1A comes from some old editors that end a file with a control Z.)
February 28, 2013
On 2/27/2013 9:28 PM, Jonathan M Davis wrote:
> But you have to deal with D strings, not C strings if you're dealing with
> ranges. char* isn't a range. So, unless you're talking about wrapping a char*
> in a range, char* isn't going to work. And simply appending 0 to the end of a
> D string isn't enough, because isSentinelnputRange would fail, because
> std.array.empty doesn't match it. So, you need a wrapper even if it's only to
> pass the template constraint. That being the case, regardless of whether
> you're dealing with char* or string, you need a wrapper.

Again, please see how lexer.c works. I assure you, there is no double copying going on, nor is there a double test for the terminating 0.


> And what ranges other than strings can take advantage of anything like this?

I've mentioned a couple already in this thread, and I'll add another - an interpreter bytecode, you can see an example in the (former) std.regexp. Search for "REend".


> What are we gaining here

High speed processing.


>So, why not just special case strings or arrays in the few situations
> where something like this is needed, especially when it would be so easy to
> do?

Sentinels structure the code differently.

February 28, 2013
On Thursday, 28 February 2013 at 07:22:53 UTC, Walter Bright wrote:
> On 2/27/2013 10:02 PM, deadalnix wrote:
>> On Thursday, 28 February 2013 at 05:28:47 UTC, Walter Bright wrote:
>>> On 2/27/2013 9:20 PM, deadalnix wrote:
>>>> If the range define empty with something like front == sentinel, the inliner
>>>> should kick in a reduce the whole stuff to only one read, no ?
>>>
>>>    auto c = front;
>>>    if (c == sentinel || c == XX)
>>>
>>> is two reads. This may not seem important, but when you want high speed, it
>>> can halve it.
>>
>> Don't you have to check for both all the time ? You have to check for the
>> sentinel anyway.
>
> I suggest again taking a look at the dmd source lexer.c for how to do it. There is no extra check.

I did, in fact I know it for quite a while.

It does check for sentinel in many places (and it has to). How is it better than a empty property defined as front == sentinel ?

For instance,

if(!r.empty) {
   switch(r.front) {
       // cases
   }

   // somecode
} else {
    // error code
}

Will be optimized by adding a case for the sentinel in the switch (if dmd don't, then it has to be fixed, LLVM is definitively able to do stuff like that, and I didn't checked, but I'd bet that GCC can do it as well) in the given way :

switch(r.front) {
    case sentinel :
       // error code
       goto Label;

    // cases
}

// somecode

Label:

I don't see how defining a specific sentinel range here helps.
February 28, 2013
On Wednesday, February 27, 2013 23:33:09 Walter Bright wrote:
> On 2/27/2013 9:28 PM, Jonathan M Davis wrote:
> > But you have to deal with D strings, not C strings if you're dealing with ranges. char* isn't a range. So, unless you're talking about wrapping a char* in a range, char* isn't going to work. And simply appending 0 to the end of a D string isn't enough, because isSentinelnputRange would fail, because std.array.empty doesn't match it. So, you need a wrapper even if it's only to pass the template constraint. That being the case, regardless of whether you're dealing with char* or string, you need a wrapper.
> 
> Again, please see how lexer.c works. I assure you, there is no double copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

And no, the lexer doesn't have a double test. The place you're going to be stuck with a double test is most any range which isn't a string, because such ranges won't have sentinel values, and there will be no way to add them (as you really can't append to ranges), and so they'll end up being wrapped in a SentinelRange which will have to check on each popFront whether the wrapped range is now empty making it so that front needs to be the sentinel value. And most any range which _was_ designed to have a sentinel value would have to be managing its own contents (because otherwise, it would just be back to wrapping a range and having to check empty), which likely means that it'll just be a thin wrapper around a string or array anyway.

Strings will still need to be wrapped, because they won't pass isSentinelRange otherwise, but they won't get any extra checks, because the wrapper can just check for 0 on the end and append it if it's not there.

> >So, why not just special case strings or arrays in the few situations
> >
> > where something like this is needed, especially when it would be so easy
> > to
> > do?
> 
> Sentinels structure the code differently.

Given how a lexer works (and I have been working on a lexer off and on recently), the only real difference is that you'd just use a couple of static ifs like

static if(!isSomeString!R)
{
    if(range.empty)
        break; //or whatever you do at the end
}

static if(isSomeString!R)
{
    case 0:
        break; //or whatever you do at the end
}

So, in the case of a lexer, I don't see sentinel ranges as buying us much. You end up having to wrap most any range that you pass to the lexer or whatever (including strings so that they'll pass isSentinelRange), you lose out on any optimizations of any functions that you call which special-case strings (though there probably wouldn't be many of those in a lexer), and all you avoid is a couple of static ifs.

The idea of sentinels certainly isn't useless, but anything caring about that sort of speed is likely to just use strings or arrays, and those can trivially be special cased to avoid unnecessary empty checks and to add the check for the sentinel, making the whole sentinel range idea an unnecessary complication IMHO.

- Jonathan M Davis
February 28, 2013
On 2013-02-28 06:13, Walter Bright wrote:

> No. Look at how lexer.c is fed data.

C strings have a sentinel from the beginning, '\0'.

-- 
/Jacob Carlborg
February 28, 2013
On 2013-02-28 08:23, Walter Bright wrote:

> There is always a terminating 0, even if the file ends in 0x1a.
>
> (The 0x1A comes from some old editors that end a file with a control Z.)

http://dlang.org/lex.html, "End of File" shows this:

EndOfFile:
    physical end of the file
    \u0000
    \u001A

Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A is encountered?

-- 
/Jacob Carlborg
February 28, 2013
On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright wrote:
> Motivation:
> 1. easy conversion of C strings to ranges

This is already easy.

struct CString
{
    @property char front() { return *p; }
    void popFront() { ++p; }
    @property bool empty() const { return !*p; }
    char* p;
}

(Yep, rubbish implementation, missing save, constructors etc., but you get the idea).

> 2. necessary for a fast implementation of a lexer

This is very domain specific. Is a whole new range concept in the standard library necessary?

My worry here is that we are setting a precedent for adding new range concepts. If the only justification needed is that is saves a single operation in some niche area of computing then we are opening the door to a LOT of different range concepts.

Are there any other REAL uses for this other than in one line of a lexer implementation? I think inclusions into the standard library should require at least several distinct and realistic use cases.
February 28, 2013
On 2/27/2013 11:37 PM, deadalnix wrote:
> It does check for sentinel in many places (and it has to).

No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558 and 559.

https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c