View mode: basic / threaded / horizontal-split · Log in · Help
February 28, 2013
Re: Proposal for SentinelInputRange
On 2013-02-28 15:47, Artur Skawina wrote:

> Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a single-digit
> ms number, but i'd need to write some code to check the actual number). Smaller
> objects take proportionally less. Meaning you'll be I/O bound, even /one/ (disk)
> cache miss will have more impact then these kind of optimizations.
> Lexing a hundred small files or one 100x as big file is basically the same operation;
> the difference will be in I/O + setup/teardown costs, which will be /outside/ the
> lexer, so aren't affected by how it accesses input.

You'll have to convince Walter.

-- 
/Jacob Carlborg
February 28, 2013
Re: Proposal for SentinelInputRange
On 2013-02-28 15:53, Steven Schveighoffer wrote:

> You are missing the point.  Empty is DEFINED as r.front == 0.  Adding a
> check for 0, would essentially lead to dead code (and technically, the
> compiler could trim it out).

Ah, right.

-- 
/Jacob Carlborg
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/13 4:19 AM, Peter Alexander wrote:
> Are there any other REAL uses for this other than in one line of a lexer
> implementation? I think inclusions into the standard library should
> require at least several distinct and realistic use cases.

Singly-linked lists.

Andrei
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/13 5:54 AM, Walter Bright wrote:
> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>>> Again, please see how lexer.c works. I assure you, there is no double
>>> copying going on, nor is there a double test for the terminating 0.
>>
>> I know what the lexer does, and remember that it _doesn't_ operate on
>> ranges,
>> and there are subtle differences between being able to just use char* and
>> trying to handle generic ranges.
>
> Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the 
parser (with ranges) in D. This discussion has probably run its course. 
The right thing to do at this point is port the lexer and figure what 
primitives are necessary from its input.

Andrei
February 28, 2013
Re: Proposal for SentinelInputRange
On Thursday, 28 February 2013 at 14:31:08 UTC, Steven 
Schveighoffer wrote:
> On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix
>
> I have to say I agree with deadalnix.
>
> you have essentially in lexer.c:
>
> outer:
> while(1)
> {
>    switch(r.front)
>    {
>       case 0:
>         break outer;
>       ...
>    }
> }
>
> whereas a range where empty is defined as r.front == 0:
>
> while(!r.empty)  // inlined to r.front != 0
> {
>    switch(r.front) // why would another load occur here?
>    {
>       // no need to check for 0, already done
>       ...
>    }
> }
>
> If this doesn't translate to the same code, I don't know why 
> not[SNIP].
>
> -Steve

The difference is that by doing "!r.empty" first, you are 
actually executing "r.front == 0" each and every time, before 
doing the rest of the checks proper.

Doing it the other way around however, the only time you actually 
check the sentinel value is once you've reached the actual 
sentinel value, or you've run out of values "of interest": EG not 
actually on every element.

You are basically re-ordering the tests to speed up the usual 
case.

//----

This is kind of the same logic as in C++'s "operator=": It is 
recommended to always check for self assignment. However, most 
quality implementations will *not* execute this check, instead 
being extra careful about instruction ordering.

While the rare self-assignement case becomes more expensive, the 
common assignment becomes cheaper. That's where the win is.
February 28, 2013
Re: Proposal for SentinelInputRange
On Thursday, 28 February 2013 at 15:31:21 UTC, Andrei 
Alexandrescu wrote:
> On 2/28/13 5:54 AM, Walter Bright wrote:
>> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>>>> Again, please see how lexer.c works. I assure you, there is 
>>>> no double
>>>> copying going on, nor is there a double test for the 
>>>> terminating 0.
>>>
>>> I know what the lexer does, and remember that it _doesn't_ 
>>> operate on
>>> ranges,
>>> and there are subtle differences between being able to just 
>>> use char* and
>>> trying to handle generic ranges.
>>
>> Hence the need to invent SentinelInputRange.
>
> I don't think the sentinel input range is a blocker for redoing 
> the parser (with ranges) in D. This discussion has probably run 
> its course. The right thing to do at this point is port the 
> lexer and figure what primitives are necessary from its input.
>
> Andrei

An actual sentinel range is trivial to implement, and the 
algorithms that can *actually* truly exploit it are rare.

While I'm not against having such ranges in phobos, I'd just be 
weary to provide too many traits for them, or trying to have 
phobos exploit them either. Ranges have enough primitives as it 
is.

Just provide it with a disclaimer than it will only ever be 
useful for a select class of algorithms (parser), but that the 
optimization opportunity will be ignored by the rest of phobos.
February 28, 2013
Re: Proposal for SentinelInputRange
On 2/28/13 6:29 AM, Jonathan M Davis wrote:
> There's tons of stuff that's fine with forward ranges (though I seriously
> question the viability of pure input ranges given how insanely limiting is to
> be unable to save their state), but there's also plenty of stuff that needs
> random access, and if you _really_ care about speed, there's a decent chance
> that you need random access (if nothing else, to be able to pop off multiple
> elements at a time). That doesn't necessarily mean that you're using an array,
> but odds are whatever you're using wraps an array if it isn't one. I don't
> think that there's much of anything which is random access without having an
> array or pointer underneath the hood. Reading from a file, I'd be very inclined
> to use something like MmFile or reading the whole file into an array rather
> than trying to wrap in something that's a pure input range. They're just too
> limiting.

I think this (and the long, long, long text above) is missing a point. 
The necessity here is defining new range primitives (such as lookahead), 
not force the existing range notions onto the needed functionality.

Again:

1. Port the blessed lexer as is to D already

2. Figure out what abstraction is needed for lexer's input

3. Reify that abstraction as an existing range type, a variation on an 
existing range type, or a new range type

Talking abstraction without concrete is a poor approach. Concrete first, 
abstraction extracted from it.


Andrei
February 28, 2013
Re: Proposal for SentinelInputRange
28-Feb-2013 14:44, Walter Bright пишет:
> On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
>> On 2013-02-28 08:23, Walter Bright wrote:
>>
>>> There is always a terminating 0, even if the file ends in 0x1a.
>>>
>>> (The 0x1A comes from some old editors that end a file with a control Z.)
>>
>> http://dlang.org/lex.html, "End of File" shows this:
>>
>> EndOfFile:
>>      physical end of the file
>>      \u0000
>>      \u001A
>>
>> Doesn't that mean that the input is empty if the file is empty, \u0000
>> or \u001A
>> is encountered?
>>
>


> Please follow the source code from file to input to the lexer.

Source as spec is no good. Either change the spec or  admit that having 
a tuple of sentinels as manifest constant is fine. In any case constant 
tuples are easily unrolled into case statements.

> https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c
>
> line 1012 and 1038


-- 
Dmitry Olshansky
February 28, 2013
Re: Proposal for SentinelInputRange
On Thu, 28 Feb 2013 10:31:42 -0500, monarch_dodra <monarchdodra@gmail.com>  
wrote:

> On Thursday, 28 February 2013 at 14:31:08 UTC, Steven Schveighoffer  
> wrote:
>> On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix
>>
>> I have to say I agree with deadalnix.
>>
>> you have essentially in lexer.c:
>>
>> outer:
>> while(1)
>> {
>>    switch(r.front)
>>    {
>>       case 0:
>>         break outer;
>>       ...
>>    }
>> }
>>
>> whereas a range where empty is defined as r.front == 0:
>>
>> while(!r.empty)  // inlined to r.front != 0
>> {
>>    switch(r.front) // why would another load occur here?
>>    {
>>       // no need to check for 0, already done
>>       ...
>>    }
>> }
>>
>> If this doesn't translate to the same code, I don't know why not[SNIP].
>>
>
> The difference is that by doing "!r.empty" first, you are actually  
> executing "r.front == 0" each and every time, before doing the rest of  
> the checks proper.
>
> Doing it the other way around however, the only time you actually check  
> the sentinel value is once you've reached the actual sentinel value, or  
> you've run out of values "of interest": EG not actually on every element.
>
> You are basically re-ordering the tests to speed up the usual case.

If you look at lexer.c, case 0 is the first test.  How is it that the  
compiler knows that's the least likely case to check?  Even it if it does,  
sure, the compiler could reorder it.  But then it could include the while  
loop test into the switch statement, as others have suggested, and we get  
the same result.

Besides, I don't think a jnz instruction is that expensive.

My impression from Walter's other posts is to avoid the double load/double  
check, not to change the ordering.

-Steve
February 28, 2013
Re: Proposal for SentinelInputRange
On 02/28/13 16:00, Jacob Carlborg wrote:
> On 2013-02-28 15:47, Artur Skawina wrote:
> 
>> Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a single-digit
>> ms number, but i'd need to write some code to check the actual number). Smaller
>> objects take proportionally less. Meaning you'll be I/O bound, even /one/ (disk)
>> cache miss will have more impact then these kind of optimizations.
>> Lexing a hundred small files or one 100x as big file is basically the same operation;
>> the difference will be in I/O + setup/teardown costs, which will be /outside/ the
>> lexer, so aren't affected by how it accesses input.
> 
> You'll have to convince Walter.

Actually, no -- like i said: /I'd like to have such std ranges/ - there are
real gains to be had. I'm just saying that for the "fast lexer" case the
/absolute/ improvement isn't necessarily very large. Modern compilers can do
wonders.

The advantage of defining /std/ range types is, well, that they are "std";
everybody doesn't have to reinvent them, often with inadequate docs and
subtle differences in behavior. In this case the interesting properties are:

a) terminating sentinel
b) limited lookahead ("a" is a special case of "b")
c) lazyness

It's better to have a common interface than having everybody invent a
private one every time some of the above features are needed. 


artur
3 4 5 6 7 8 9 10 11
Top | Discussion index | About this forum | D home