February 08, 2013
On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
> I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.

Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges. And to do that, you'd either have to keep calculating the index for each token as its generated or save the range with ever token (rather than just having a pointer) so that you could determine the index later if you needed to. And depending on the range, all of that saving could be expensive.

And for any other type of range, you'd literally have to count the code units as you iterated in order to figure out what the index is (and you'd have to keep saving the range as you went along if you wanted to slice it at all, since it wouldn't actually be sliceable, and so getting to a particular index in the range would be very expensive even if you kept track of it). And for syntax highlighting and some error reporting and a variety of other uses, you need to be able to determine where in the text a token was (not just its column and line number). And that's simply a lot easier with a pointer.

So, dealing with generic ranges is a bit problematic - far more so than any issues with character types. If the lexer is well-written, the extra overhead had be obviated by having the lexer function do stuff a bit differently depending on the type of the range, but regardless, restricting it to strings or pointers would be cleaner in that regard. It's not quite a use case where ranges shine - especially when efficiency is a top priority.

- Jonathan M Davis
February 08, 2013
08-Feb-2013 12:01, Jonathan M Davis пишет:
> On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
>> I think it would be reasonable for a lexer to require a range of ubyte
>> as input, and carry its own decoding. In the first approximation it may
>> even require a random-access range of ubyte.
>
> Another big issue is the fact that in some ways, using a pointer like dmd's
> lexer does is actually superior to using a range. In particular, it's trivial
> to determine where in the text a token is, because you can simply subtract the
> pointer in the token from the initial pointer. Strings would be okay too,
> because you can subtract their ptr properties. But the closest that you'll get
> with ranges is to subtract their lengths, and the only ranges that are likely
> to define length are random-access ranges.

Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file.


-- 
Dmitry Olshansky
February 08, 2013
On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:
> 08-Feb-2013 12:01, Jonathan M Davis пишет:
> > On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
> >> I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.
> > 
> > Another big issue is the fact that in some ways, using a pointer like
> > dmd's
> > lexer does is actually superior to using a range. In particular, it's
> > trivial to determine where in the text a token is, because you can simply
> > subtract the pointer in the token from the initial pointer. Strings would
> > be okay too, because you can subtract their ptr properties. But the
> > closest that you'll get with ranges is to subtract their lengths, and the
> > only ranges that are likely to define length are random-access ranges.
> 
> Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file.

I said that the only ones which are "likely" to define length are random-access range. There _are_ other ranges which can, but in most cases, if you can know the length, you can do random access as well. Regardless, the main issue still stands in that dealing with keeping track of the index of the code unit of a token is more complicated and generally more expensive with ranges than it is with a pointer. Some range types will do better than others, but short of using a string's ptr property, there's always going to be some additional overhead in comparison to pointers to keep track of the indices or to keep a range or slice of one as part of a token. The pointer's just more lightweight. That doesn't make ranges unacceptable by any means. It just means that they're going to take at least a slight performance hit in comparison to pointers.

- Jonathan M Davis
February 08, 2013
08-Feb-2013 13:40, Jonathan M Davis пишет:
> On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:
>> 08-Feb-2013 12:01, Jonathan M Davis пишет:
>>> On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
>>>> I think it would be reasonable for a lexer to require a range of ubyte
>>>> as input, and carry its own decoding. In the first approximation it may
>>>> even require a random-access range of ubyte.
>>>
>>> Another big issue is the fact that in some ways, using a pointer like
>>> dmd's
>>> lexer does is actually superior to using a range. In particular, it's
>>> trivial to determine where in the text a token is, because you can simply
>>> subtract the pointer in the token from the initial pointer. Strings would
>>> be okay too, because you can subtract their ptr properties. But the
>>> closest that you'll get with ranges is to subtract their lengths, and the
>>> only ranges that are likely to define length are random-access ranges.
>>
>> Not true, certain ranges know length but can't be random access as
>> indexing is O(lgN) or worse. Including a stripe of chunks as taken from
>> file.
>
> I said that the only ones which are "likely" to define length are random-access
> range. There _are_ other ranges which can, but in most cases, if you can know
> the length, you can do random access as well.

Well I honestly disagree about the promise of knowing length - being able to index. "The most ranges" is arrays and wrappers on top of these.
Given current realities oF D and Phobos I'm afraid you are right though.

 Regardless, the main issue still
> stands in that dealing with keeping track of the index of the code unit of a
> token is more complicated and generally more expensive with ranges than it is
> with a pointer.

If target is random access range just use offset throughout. It's basically becomes base + offset vs base + pointer i.e. non-issue

If not then pointer argument no longer applies and you can just as well use separate counter on per popFront. It'd not that costly in this case and flexibility tramps other concerns with forward ranges in any case.

> Some range types will do better than others, but short of
> using a string's ptr property, there's always going to be some additional
> overhead in comparison to pointers to keep track of the indices or to keep a
> range or slice of one as part of a token. The pointer's just more lightweight.
> That doesn't make ranges unacceptable by any means. It just means that they're
> going to take at least a slight performance hit in comparison to pointers.

See above. Pointer to something inside of a buffer == index in buffer, typically even with pointer you can't drop the 'buffer' reference itself.

-- 
Dmitry Olshansky
February 08, 2013
On Tuesday, 5 February 2013 at 19:00:42 UTC, Dmitry Olshansky wrote:
> Thanks.
> I've made a pass through the code and found some places to improve.
> Sadly I've been rather busy at work today. Anyway get ready for pull requests should my ideas prove to be worthy performance-wise.

http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

Your suggestions seem to have worked. We're below 20ms on my machine for the datetime module.
February 08, 2013
On 2013-02-08 16:08, Brian Schott wrote:

> http://hackerpilot.github.com/experimental/std_lexer/images/times3.png
>
> Your suggestions seem to have worked. We're below 20ms on my machine for
> the datetime module.

DMD is still consistently faster, :(

-- 
/Jacob Carlborg
February 08, 2013
On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:
> DMD is still consistently faster, :(

We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.
February 08, 2013
08-Feb-2013 19:12, Jacob Carlborg пишет:
> On 2013-02-08 16:08, Brian Schott wrote:
>
>> http://hackerpilot.github.com/experimental/std_lexer/images/times3.png
>>
>> Your suggestions seem to have worked. We're below 20ms on my machine for
>> the datetime module.
>
> DMD is still consistently faster, :(
>

Keep in mind the D runtime start-up cost. In the end on small files DMD always wins because of slim C-runtime.

To estimate runtime startup lag I've compared run times of

int main(){ return 0; } in C (gcc):

------------------------
Total time (ms): 31637.7
Repetitions    : 35000
Sample mode    : 0.8 (23649 ocurrences)
Median time    : 0.873
Avg time       : 0.903935
Std dev.       : 0.204107
Minimum        : 0.552
Maximum        : 9.057
95% conf.int.  : [0.503892, 1.30398]  e = 0.400043
99% conf.int.  : [0.378189, 1.42968]  e = 0.525746
EstimatedAvg95%: [0.901796, 0.906073]  e = 0.00213832
EstimatedAvg99%: [0.901125, 0.906745]  e = 0.00281023

void main(){} in D (ldc):
------------------------
Total time (ms): 15429.4
Repetitions    : 7000
Sample mode    : 2.1 (1785 ocurrences)
Median time    : 2.128
Avg time       : 2.2042
Std dev.       : 0.466834
Minimum        : 1.286
Maximum        : 19.933
95% conf.int.  : [1.28922, 3.11918]  e = 0.914978
99% conf.int.  : [1.00171, 3.40668]  e = 1.20248
EstimatedAvg95%: [2.19326, 2.21514]  e = 0.0109361
EstimatedAvg99%: [2.18983, 2.21857]  e = 0.0143724
dmitry@dmitry-VirtualBox ~ $

I think that the mode could serve as an indicator of the most probable outcome.
For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus the same OS costs to launch a process, so from now on let's keep in mind these extra ~ 1.3 ms of bias in favor of C-based lexer.


-- 
Dmitry Olshansky
February 08, 2013
On 2013-02-08 16:28, Dmitry Olshansky wrote:
> 08-Feb-2013 19:12, Jacob Carlborg пишет:
>> On 2013-02-08 16:08, Brian Schott wrote:
>>
>>> http://hackerpilot.github.com/experimental/std_lexer/images/times3.png
>>>
>>> Your suggestions seem to have worked. We're below 20ms on my machine for
>>> the datetime module.
>>
>> DMD is still consistently faster, :(
>>
>
> Keep in mind the D runtime start-up cost. In the end on small files DMD
> always wins because of slim C-runtime.
>
> To estimate runtime startup lag I've compared run times of
>
> int main(){ return 0; } in C (gcc):
>
> ------------------------
> Total time (ms): 31637.7
> Repetitions    : 35000
> Sample mode    : 0.8 (23649 ocurrences)
> Median time    : 0.873
> Avg time       : 0.903935
> Std dev.       : 0.204107
> Minimum        : 0.552
> Maximum        : 9.057
> 95% conf.int.  : [0.503892, 1.30398]  e = 0.400043
> 99% conf.int.  : [0.378189, 1.42968]  e = 0.525746
> EstimatedAvg95%: [0.901796, 0.906073]  e = 0.00213832
> EstimatedAvg99%: [0.901125, 0.906745]  e = 0.00281023
>
> void main(){} in D (ldc):
> ------------------------
> Total time (ms): 15429.4
> Repetitions    : 7000
> Sample mode    : 2.1 (1785 ocurrences)
> Median time    : 2.128
> Avg time       : 2.2042
> Std dev.       : 0.466834
> Minimum        : 1.286
> Maximum        : 19.933
> 95% conf.int.  : [1.28922, 3.11918]  e = 0.914978
> 99% conf.int.  : [1.00171, 3.40668]  e = 1.20248
> EstimatedAvg95%: [2.19326, 2.21514]  e = 0.0109361
> EstimatedAvg99%: [2.18983, 2.21857]  e = 0.0143724
> dmitry@dmitry-VirtualBox ~ $
>
> I think that the mode could serve as an indicator of the most probable
> outcome.
> For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus
> the same OS costs to launch a process, so from now on let's keep in mind
> these extra ~ 1.3 ms of bias in favor of C-based lexer.

Ok, I see. But wouldn't it be more fair to compare either GCC to GDC or Clang to LDC.

-- 
/Jacob Carlborg
February 08, 2013
On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:
> On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:
>> DMD is still consistently faster, :(
>
> We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.

For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property":

http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png

Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.