February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
> I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.
Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges. And to do that, you'd either have to keep calculating the index for each token as its generated or save the range with ever token (rather than just having a pointer) so that you could determine the index later if you needed to. And depending on the range, all of that saving could be expensive.
And for any other type of range, you'd literally have to count the code units as you iterated in order to figure out what the index is (and you'd have to keep saving the range as you went along if you wanted to slice it at all, since it wouldn't actually be sliceable, and so getting to a particular index in the range would be very expensive even if you kept track of it). And for syntax highlighting and some error reporting and a variety of other uses, you need to be able to determine where in the text a token was (not just its column and line number). And that's simply a lot easier with a pointer.
So, dealing with generic ranges is a bit problematic - far more so than any issues with character types. If the lexer is well-written, the extra overhead had be obviated by having the lexer function do stuff a bit differently depending on the type of the range, but regardless, restricting it to strings or pointers would be cleaner in that regard. It's not quite a use case where ranges shine - especially when efficiency is a top priority.
- Jonathan M Davis
|
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | 08-Feb-2013 12:01, Jonathan M Davis пишет: > On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote: >> I think it would be reasonable for a lexer to require a range of ubyte >> as input, and carry its own decoding. In the first approximation it may >> even require a random-access range of ubyte. > > Another big issue is the fact that in some ways, using a pointer like dmd's > lexer does is actually superior to using a range. In particular, it's trivial > to determine where in the text a token is, because you can simply subtract the > pointer in the token from the initial pointer. Strings would be okay too, > because you can subtract their ptr properties. But the closest that you'll get > with ranges is to subtract their lengths, and the only ranges that are likely > to define length are random-access ranges. Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file. -- Dmitry Olshansky |
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:
> 08-Feb-2013 12:01, Jonathan M Davis пишет:
> > On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
> >> I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.
> >
> > Another big issue is the fact that in some ways, using a pointer like
> > dmd's
> > lexer does is actually superior to using a range. In particular, it's
> > trivial to determine where in the text a token is, because you can simply
> > subtract the pointer in the token from the initial pointer. Strings would
> > be okay too, because you can subtract their ptr properties. But the
> > closest that you'll get with ranges is to subtract their lengths, and the
> > only ranges that are likely to define length are random-access ranges.
>
> Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file.
I said that the only ones which are "likely" to define length are random-access range. There _are_ other ranges which can, but in most cases, if you can know the length, you can do random access as well. Regardless, the main issue still stands in that dealing with keeping track of the index of the code unit of a token is more complicated and generally more expensive with ranges than it is with a pointer. Some range types will do better than others, but short of using a string's ptr property, there's always going to be some additional overhead in comparison to pointers to keep track of the indices or to keep a range or slice of one as part of a token. The pointer's just more lightweight. That doesn't make ranges unacceptable by any means. It just means that they're going to take at least a slight performance hit in comparison to pointers.
- Jonathan M Davis
|
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | 08-Feb-2013 13:40, Jonathan M Davis пишет: > On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote: >> 08-Feb-2013 12:01, Jonathan M Davis пишет: >>> On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote: >>>> I think it would be reasonable for a lexer to require a range of ubyte >>>> as input, and carry its own decoding. In the first approximation it may >>>> even require a random-access range of ubyte. >>> >>> Another big issue is the fact that in some ways, using a pointer like >>> dmd's >>> lexer does is actually superior to using a range. In particular, it's >>> trivial to determine where in the text a token is, because you can simply >>> subtract the pointer in the token from the initial pointer. Strings would >>> be okay too, because you can subtract their ptr properties. But the >>> closest that you'll get with ranges is to subtract their lengths, and the >>> only ranges that are likely to define length are random-access ranges. >> >> Not true, certain ranges know length but can't be random access as >> indexing is O(lgN) or worse. Including a stripe of chunks as taken from >> file. > > I said that the only ones which are "likely" to define length are random-access > range. There _are_ other ranges which can, but in most cases, if you can know > the length, you can do random access as well. Well I honestly disagree about the promise of knowing length - being able to index. "The most ranges" is arrays and wrappers on top of these. Given current realities oF D and Phobos I'm afraid you are right though. Regardless, the main issue still > stands in that dealing with keeping track of the index of the code unit of a > token is more complicated and generally more expensive with ranges than it is > with a pointer. If target is random access range just use offset throughout. It's basically becomes base + offset vs base + pointer i.e. non-issue If not then pointer argument no longer applies and you can just as well use separate counter on per popFront. It'd not that costly in this case and flexibility tramps other concerns with forward ranges in any case. > Some range types will do better than others, but short of > using a string's ptr property, there's always going to be some additional > overhead in comparison to pointers to keep track of the indices or to keep a > range or slice of one as part of a token. The pointer's just more lightweight. > That doesn't make ranges unacceptable by any means. It just means that they're > going to take at least a slight performance hit in comparison to pointers. See above. Pointer to something inside of a buffer == index in buffer, typically even with pointer you can't drop the 'buffer' reference itself. -- Dmitry Olshansky |
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Tuesday, 5 February 2013 at 19:00:42 UTC, Dmitry Olshansky wrote: > Thanks. > I've made a pass through the code and found some places to improve. > Sadly I've been rather busy at work today. Anyway get ready for pull requests should my ideas prove to be worthy performance-wise. http://hackerpilot.github.com/experimental/std_lexer/images/times3.png Your suggestions seem to have worked. We're below 20ms on my machine for the datetime module. |
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On 2013-02-08 16:08, Brian Schott wrote: > http://hackerpilot.github.com/experimental/std_lexer/images/times3.png > > Your suggestions seem to have worked. We're below 20ms on my machine for > the datetime module. DMD is still consistently faster, :( -- /Jacob Carlborg |
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:
> DMD is still consistently faster, :(
We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.
|
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | 08-Feb-2013 19:12, Jacob Carlborg пишет: > On 2013-02-08 16:08, Brian Schott wrote: > >> http://hackerpilot.github.com/experimental/std_lexer/images/times3.png >> >> Your suggestions seem to have worked. We're below 20ms on my machine for >> the datetime module. > > DMD is still consistently faster, :( > Keep in mind the D runtime start-up cost. In the end on small files DMD always wins because of slim C-runtime. To estimate runtime startup lag I've compared run times of int main(){ return 0; } in C (gcc): ------------------------ Total time (ms): 31637.7 Repetitions : 35000 Sample mode : 0.8 (23649 ocurrences) Median time : 0.873 Avg time : 0.903935 Std dev. : 0.204107 Minimum : 0.552 Maximum : 9.057 95% conf.int. : [0.503892, 1.30398] e = 0.400043 99% conf.int. : [0.378189, 1.42968] e = 0.525746 EstimatedAvg95%: [0.901796, 0.906073] e = 0.00213832 EstimatedAvg99%: [0.901125, 0.906745] e = 0.00281023 void main(){} in D (ldc): ------------------------ Total time (ms): 15429.4 Repetitions : 7000 Sample mode : 2.1 (1785 ocurrences) Median time : 2.128 Avg time : 2.2042 Std dev. : 0.466834 Minimum : 1.286 Maximum : 19.933 95% conf.int. : [1.28922, 3.11918] e = 0.914978 99% conf.int. : [1.00171, 3.40668] e = 1.20248 EstimatedAvg95%: [2.19326, 2.21514] e = 0.0109361 EstimatedAvg99%: [2.18983, 2.21857] e = 0.0143724 dmitry@dmitry-VirtualBox ~ $ I think that the mode could serve as an indicator of the most probable outcome. For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus the same OS costs to launch a process, so from now on let's keep in mind these extra ~ 1.3 ms of bias in favor of C-based lexer. -- Dmitry Olshansky |
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 2013-02-08 16:28, Dmitry Olshansky wrote: > 08-Feb-2013 19:12, Jacob Carlborg пишет: >> On 2013-02-08 16:08, Brian Schott wrote: >> >>> http://hackerpilot.github.com/experimental/std_lexer/images/times3.png >>> >>> Your suggestions seem to have worked. We're below 20ms on my machine for >>> the datetime module. >> >> DMD is still consistently faster, :( >> > > Keep in mind the D runtime start-up cost. In the end on small files DMD > always wins because of slim C-runtime. > > To estimate runtime startup lag I've compared run times of > > int main(){ return 0; } in C (gcc): > > ------------------------ > Total time (ms): 31637.7 > Repetitions : 35000 > Sample mode : 0.8 (23649 ocurrences) > Median time : 0.873 > Avg time : 0.903935 > Std dev. : 0.204107 > Minimum : 0.552 > Maximum : 9.057 > 95% conf.int. : [0.503892, 1.30398] e = 0.400043 > 99% conf.int. : [0.378189, 1.42968] e = 0.525746 > EstimatedAvg95%: [0.901796, 0.906073] e = 0.00213832 > EstimatedAvg99%: [0.901125, 0.906745] e = 0.00281023 > > void main(){} in D (ldc): > ------------------------ > Total time (ms): 15429.4 > Repetitions : 7000 > Sample mode : 2.1 (1785 ocurrences) > Median time : 2.128 > Avg time : 2.2042 > Std dev. : 0.466834 > Minimum : 1.286 > Maximum : 19.933 > 95% conf.int. : [1.28922, 3.11918] e = 0.914978 > 99% conf.int. : [1.00171, 3.40668] e = 1.20248 > EstimatedAvg95%: [2.19326, 2.21514] e = 0.0109361 > EstimatedAvg99%: [2.18983, 2.21857] e = 0.0143724 > dmitry@dmitry-VirtualBox ~ $ > > I think that the mode could serve as an indicator of the most probable > outcome. > For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus > the same OS costs to launch a process, so from now on let's keep in mind > these extra ~ 1.3 ms of bias in favor of C-based lexer. Ok, I see. But wouldn't it be more fair to compare either GCC to GDC or Clang to LDC. -- /Jacob Carlborg |
February 08, 2013 Re: Request for comments: std.d.lexer | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brian Schott | On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote: > On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote: >> DMD is still consistently faster, :( > > We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc. For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property": http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime. |
Copyright © 1999-2021 by the D Language Foundation