Request for comments: std.d.lexer (page 17)

February 09, 2013

Re: Request for comments: std.d.lexer

Posted by deadalnix
in reply to deadalnix

Permalink

deadalnix

Posted in reply to deadalnix

Permalink

On Saturday, 9 February 2013 at 07:15:57 UTC, deadalnix wrote:
> On Friday, 8 February 2013 at 20:54:32 UTC, Jonathan M Davis wrote:
>> On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:
>>> Complication - yes, slight performance cost is what I doubt it in RA
>>> case. Seems like a compiler/optimizer issue.
>>
>> The main problem isn't RA ranges. It's quite probable that the optimizer takes
>> care of any minor additional overhead that's incurred there. The issue is
>> really pure forward ranges, because you're forced to count the code units as
>> they're processed (or even save the range as you go along, given how expensive
>> it would be to go find the index even if you have it). But we really have no
>> way of knowing how bad it is without hard data. It would certainly be trivial
>> to end up doing other stuff in the implementation which cost far more.
>>
>
> For pure InputRanges, that is pretty bad as the decoding of UTF chars basically have to be done 2 times, and each codepoint popped individually, or the lexer have to embed its own homebrew version of std.utf .

Wow that is complete bullshit xD I completely messed up what I wanted to say.

So, std.utf.decodeFront pop or not the utf character. And in case it does not, you ends up having to do extra hanky panky (and duplicate logic in std.utf to know if it does pop or not, which is likely to be very bug prone).

On 2/8/2013 12:01 AM, Jonathan M Davis wrote: > It's not quite a use case where > ranges shine - especially when efficiency is a top priority. A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.

On Saturday, February 09, 2013 08:19:33 deadalnix wrote: > So, std.utf.decodeFront pop or not the utf character. And in case it does not, you ends up having to do extra hanky panky (and duplicate logic in std.utf to know if it does pop or not, which is likely to be very bug prone). Well, with the pull request that I have, it'll always pop for input ranges and never for anything else. I'm disinclined to have it pop off elements at all, because I'd like it to function as much like decode as possible, but there's no choice with pure input ranges, and it may be better to just make it always pop the elements off. I'll have to think about it. I hate pure input ranges in general. They're just so frustrating to deal with, and I believe that you can always have a forward range if you're willing to put a bit more effort into it. The lexer that I'm writing doesn't even support pure input ranges, because it has to be able to save to do what it does, and the only reason that decodeFront exists is because I wrote it to support that use case. And needing to operate on ranges of code units is likely to be quite rare, and the function is quite new, so I don't expect that much code is affected by any bugs in decodeFront or that much would be broken were it to be changed. - Jonathan M Davis

On Friday, February 08, 2013 23:48:50 Walter Bright wrote: > On 2/8/2013 12:01 AM, Jonathan M Davis wrote: > > It's not quite a use case where > > ranges shine - especially when efficiency is a top priority. > > A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one. I didn't know that. That's a cute trick. But yeah, without controlling the input, it's not possible, and that won't work with a general implementation in a library. It would be trivial enough to put a wrapper around the input to add the 0 byte at the end, but the wrapper would still have to keep checking for empty, so you wouldn't gain anything. - Jonathan M Davis

On Feb 9, 2013 8:01 AM, "Walter Bright" <newshound2@digitalmars.com> wrote: > > On 2/4/2013 7:19 PM, Brian Schott wrote: >> >> Still only half speed. I'm becoming more and more convinced that Walter is >> actually a wizard. > > > I worship the Dark Arts. More like cursed the god's when you wrote it. :-)

On 2/9/13 3:07 AM, Jonathan M Davis wrote: > On Friday, February 08, 2013 23:48:50 Walter Bright wrote: >> On 2/8/2013 12:01 AM, Jonathan M Davis wrote: >>> It's not quite a use case where >>> ranges shine - especially when efficiency is a top priority. >> >> A more problematic case is dmd's lexer relies on a 0 byte at the end to be a >> "sentinel" for the end of file. Without such a sentinel, every consumption >> of a source character requires two operations rather than one. > > I didn't know that. That's a cute trick. But yeah, without controlling the > input, it's not possible, and that won't work with a general implementation in > a library. It would be trivial enough to put a wrapper around the input to add > the 0 byte at the end, but the wrapper would still have to keep checking for > empty, so you wouldn't gain anything. That's not a problem. You may require that the input has a terminating zero. Andrei

On 2013-02-09 09:05, Jonathan M Davis wrote: > The lexer that I'm writing doesn't even support pure input ranges, because it > has to be able to save to do what it does, and the only reason that > decodeFront exists is because I wrote it to support that use case. And needing > to operate on ranges of code units is likely to be quite rare, and the > function is quite new, so I don't expect that much code is affected by any bugs > in decodeFront or that much would be broken were it to be changed. Following this discussion it seems it's complicated to support ranges in a lexer and still maintain the speed. Is it really worth the trouble to support ranges? -- /Jacob Carlborg

On 2013-02-08 17:35, Brian Schott wrote: > On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote: >> Would be intereesting to try gdc as dmd on linux uses gcc. > > GDC decided to randomly not fail to build on my system, so it's time for > MOAR CHARTS. > > dmd -inline -noboundscheck -O -w -wi -m64 -property > ldc2 -O2 -release -vectorize -m64 > gdc -O3 -fno-bounds-check -frelease -m64 > http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png Do you have the exact code you test available somewhere? -- /Jacob Carlborg

On 2/9/13 10:05 AM, Jacob Carlborg wrote: > On 2013-02-09 09:05, Jonathan M Davis wrote: > >> The lexer that I'm writing doesn't even support pure input ranges, >> because it >> has to be able to save to do what it does, and the only reason that >> decodeFront exists is because I wrote it to support that use case. And >> needing >> to operate on ranges of code units is likely to be quite rare, and the >> function is quite new, so I don't expect that much code is affected by >> any bugs >> in decodeFront or that much would be broken were it to be changed. > > Following this discussion it seems it's complicated to support ranges in > a lexer and still maintain the speed. Is it really worth the trouble to > support ranges? Requiring a random-access range of ubyte with a terminating zero may be the most general approach to a fast lexer - and that's fine. Andrei

Forums