February 28, 2013
On 2013-02-28 15:47, Artur Skawina wrote:

> Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a single-digit
> ms number, but i'd need to write some code to check the actual number). Smaller
> objects take proportionally less. Meaning you'll be I/O bound, even /one/ (disk)
> cache miss will have more impact then these kind of optimizations.
> Lexing a hundred small files or one 100x as big file is basically the same operation;
> the difference will be in I/O + setup/teardown costs, which will be /outside/ the
> lexer, so aren't affected by how it accesses input.

You'll have to convince Walter.

-- 
/Jacob Carlborg
February 28, 2013
On 2013-02-28 15:53, Steven Schveighoffer wrote:

> You are missing the point.  Empty is DEFINED as r.front == 0.  Adding a
> check for 0, would essentially lead to dead code (and technically, the
> compiler could trim it out).

Ah, right.

-- 
/Jacob Carlborg
February 28, 2013
On 2/28/13 4:19 AM, Peter Alexander wrote:
> Are there any other REAL uses for this other than in one line of a lexer
> implementation? I think inclusions into the standard library should
> require at least several distinct and realistic use cases.

Singly-linked lists.

Andrei
February 28, 2013
On 2/28/13 5:54 AM, Walter Bright wrote:
> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>>> Again, please see how lexer.c works. I assure you, there is no double
>>> copying going on, nor is there a double test for the terminating 0.
>>
>> I know what the lexer does, and remember that it _doesn't_ operate on
>> ranges,
>> and there are subtle differences between being able to just use char* and
>> trying to handle generic ranges.
>
> Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course. The right thing to do at this point is port the lexer and figure what primitives are necessary from its input.

Andrei


February 28, 2013
On Thursday, 28 February 2013 at 14:31:08 UTC, Steven Schveighoffer wrote:
> On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix
>
> I have to say I agree with deadalnix.
>
> you have essentially in lexer.c:
>
> outer:
> while(1)
> {
>    switch(r.front)
>    {
>       case 0:
>         break outer;
>       ...
>    }
> }
>
> whereas a range where empty is defined as r.front == 0:
>
> while(!r.empty)  // inlined to r.front != 0
> {
>    switch(r.front) // why would another load occur here?
>    {
>       // no need to check for 0, already done
>       ...
>    }
> }
>
> If this doesn't translate to the same code, I don't know why not[SNIP].
>
> -Steve

The difference is that by doing "!r.empty" first, you are actually executing "r.front == 0" each and every time, before doing the rest of the checks proper.

Doing it the other way around however, the only time you actually check the sentinel value is once you've reached the actual sentinel value, or you've run out of values "of interest": EG not actually on every element.

You are basically re-ordering the tests to speed up the usual case.

//----

This is kind of the same logic as in C++'s "operator=": It is recommended to always check for self assignment. However, most quality implementations will *not* execute this check, instead being extra careful about instruction ordering.

While the rare self-assignement case becomes more expensive, the common assignment becomes cheaper. That's where the win is.
February 28, 2013
On Thursday, 28 February 2013 at 15:31:21 UTC, Andrei Alexandrescu wrote:
> On 2/28/13 5:54 AM, Walter Bright wrote:
>> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>>>> Again, please see how lexer.c works. I assure you, there is no double
>>>> copying going on, nor is there a double test for the terminating 0.
>>>
>>> I know what the lexer does, and remember that it _doesn't_ operate on
>>> ranges,
>>> and there are subtle differences between being able to just use char* and
>>> trying to handle generic ranges.
>>
>> Hence the need to invent SentinelInputRange.
>
> I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course. The right thing to do at this point is port the lexer and figure what primitives are necessary from its input.
>
> Andrei

An actual sentinel range is trivial to implement, and the algorithms that can *actually* truly exploit it are rare.

While I'm not against having such ranges in phobos, I'd just be weary to provide too many traits for them, or trying to have phobos exploit them either. Ranges have enough primitives as it is.

Just provide it with a disclaimer than it will only ever be useful for a select class of algorithms (parser), but that the optimization opportunity will be ignored by the rest of phobos.
February 28, 2013
On 2/28/13 6:29 AM, Jonathan M Davis wrote:
> There's tons of stuff that's fine with forward ranges (though I seriously
> question the viability of pure input ranges given how insanely limiting is to
> be unable to save their state), but there's also plenty of stuff that needs
> random access, and if you _really_ care about speed, there's a decent chance
> that you need random access (if nothing else, to be able to pop off multiple
> elements at a time). That doesn't necessarily mean that you're using an array,
> but odds are whatever you're using wraps an array if it isn't one. I don't
> think that there's much of anything which is random access without having an
> array or pointer underneath the hood. Reading from a file, I'd be very inclined
> to use something like MmFile or reading the whole file into an array rather
> than trying to wrap in something that's a pure input range. They're just too
> limiting.

I think this (and the long, long, long text above) is missing a point. The necessity here is defining new range primitives (such as lookahead), not force the existing range notions onto the needed functionality.

Again:

1. Port the blessed lexer as is to D already

2. Figure out what abstraction is needed for lexer's input

3. Reify that abstraction as an existing range type, a variation on an existing range type, or a new range type

Talking abstraction without concrete is a poor approach. Concrete first, abstraction extracted from it.


Andrei
February 28, 2013
28-Feb-2013 14:44, Walter Bright пишет:
> On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
>> On 2013-02-28 08:23, Walter Bright wrote:
>>
>>> There is always a terminating 0, even if the file ends in 0x1a.
>>>
>>> (The 0x1A comes from some old editors that end a file with a control Z.)
>>
>> http://dlang.org/lex.html, "End of File" shows this:
>>
>> EndOfFile:
>>      physical end of the file
>>      \u0000
>>      \u001A
>>
>> Doesn't that mean that the input is empty if the file is empty, \u0000
>> or \u001A
>> is encountered?
>>
>


> Please follow the source code from file to input to the lexer.

Source as spec is no good. Either change the spec or  admit that having a tuple of sentinels as manifest constant is fine. In any case constant tuples are easily unrolled into case statements.

> https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c
>
> line 1012 and 1038


-- 
Dmitry Olshansky
February 28, 2013
On Thu, 28 Feb 2013 10:31:42 -0500, monarch_dodra <monarchdodra@gmail.com> wrote:

> On Thursday, 28 February 2013 at 14:31:08 UTC, Steven Schveighoffer wrote:
>> On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix
>>
>> I have to say I agree with deadalnix.
>>
>> you have essentially in lexer.c:
>>
>> outer:
>> while(1)
>> {
>>    switch(r.front)
>>    {
>>       case 0:
>>         break outer;
>>       ...
>>    }
>> }
>>
>> whereas a range where empty is defined as r.front == 0:
>>
>> while(!r.empty)  // inlined to r.front != 0
>> {
>>    switch(r.front) // why would another load occur here?
>>    {
>>       // no need to check for 0, already done
>>       ...
>>    }
>> }
>>
>> If this doesn't translate to the same code, I don't know why not[SNIP].
>>
>
> The difference is that by doing "!r.empty" first, you are actually executing "r.front == 0" each and every time, before doing the rest of the checks proper.
>
> Doing it the other way around however, the only time you actually check the sentinel value is once you've reached the actual sentinel value, or you've run out of values "of interest": EG not actually on every element.
>
> You are basically re-ordering the tests to speed up the usual case.

If you look at lexer.c, case 0 is the first test.  How is it that the compiler knows that's the least likely case to check?  Even it if it does, sure, the compiler could reorder it.  But then it could include the while loop test into the switch statement, as others have suggested, and we get the same result.

Besides, I don't think a jnz instruction is that expensive.

My impression from Walter's other posts is to avoid the double load/double check, not to change the ordering.

-Steve
February 28, 2013
On 02/28/13 16:00, Jacob Carlborg wrote:
> On 2013-02-28 15:47, Artur Skawina wrote:
> 
>> Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a single-digit
>> ms number, but i'd need to write some code to check the actual number). Smaller
>> objects take proportionally less. Meaning you'll be I/O bound, even /one/ (disk)
>> cache miss will have more impact then these kind of optimizations.
>> Lexing a hundred small files or one 100x as big file is basically the same operation;
>> the difference will be in I/O + setup/teardown costs, which will be /outside/ the
>> lexer, so aren't affected by how it accesses input.
> 
> You'll have to convince Walter.

Actually, no -- like i said: /I'd like to have such std ranges/ - there are real gains to be had. I'm just saying that for the "fast lexer" case the /absolute/ improvement isn't necessarily very large. Modern compilers can do wonders.

The advantage of defining /std/ range types is, well, that they are "std"; everybody doesn't have to reinvent them, often with inadequate docs and subtle differences in behavior. In this case the interesting properties are:

a) terminating sentinel
b) limited lookahead ("a" is a special case of "b")
c) lazyness

It's better to have a common interface than having everybody invent a private one every time some of the above features are needed.


artur