std.d.lexer requirements (page 9)

On Thursday, August 02, 2012 10:58:21 Walter Bright wrote: > On 8/2/2012 4:49 AM, deadalnix wrote: > > How is that different than a manually done range of dchar ? > > The decoding is rarely necessary, even if non-ascii data is there. However, the range cannot decide if decoding is necessary - the receiver has to, hence the receiver does the decoding. Hence why special-casing must be used to deal with variably-length encodings like UTF-8. If it's not something that the range can handle through front and popFront, then it's not going to work as a range. Having it be _possible_ to special-case but not require it allows for the type to works a range but still be efficient of the function operating on the range codes for it. But if you require the function to code for it, then it's not really a range. - Jonathan M Davis

"Jonathan M Davis" , dans le message (digitalmars.D:174059), a écrit : > In either case, because the consumer must do something other than simply operate on front, popFront, empty, etc., you're _not_ dealing with the range API but rather working around it. In some case a range of dchar is useful. In some case a range of char is sufficient, and much more efficient. And for the UTF-aware programer, it makes much sense. The fact that you sometimes have to buffer some information because the meaning of one element is affected by the previous element is a normal issue in many algoritms, it's not working arround anything. Your lexer uses range API, would you say it is just working arroung range because you have to take into account several caracters (let they be dchars) at the same time to know what they are meaning?

On 2012-08-02 22:26, Andrei Alexandrescu wrote: > On 8/2/12 2:17 PM, Michel Fortin wrote: >> I wonder how your call with Walter will turn out. > > What call? The skype call you suggested: "First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicatin" -- /Jacob Carlborg

Andrei Alexandrescu , dans le message (digitalmars.D:174060), a écrit : > I agree frontUnit and popFrontUnit are more generic because they allow other ranges to define them. Any range of dchar could have a representation (or you may want to call it something else) that returns a range of char (or ubyte). And I think they are more generic because they use a generic API (ie range), that is very powerful: the representation can provide length, slicing, etc... that is different that the dchar length or whatever. You don't want to duplicate all range methods by postfixing Unit... >> I wonder how your call with Walter will turn out. > > What call? You proposed Jonathan to call Walter in an earlier post. I believe there is a misunderstandment.

Jacob Carlborg , dans le message (digitalmars.D:174069), a écrit : > On 2012-08-02 10:15, Walter Bright wrote: > >> Worst case use an adapter range. > > And that is better than a plain string? > because its front method does not do any decoding.

August 02, 2012

Re: std.d.lexer requirements

Posted by Dmitry Olshansky
in reply to Walter Bright

Permalink

Dmitry Olshansky

Posted in reply to Walter Bright

Permalink

On 02-Aug-12 08:30, Walter Bright wrote:
> On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
>> On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
>>> 1. It should accept as input an input range of UTF8. I feel it is a
>>> mistake
>>> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
>>> should use an 'adapter' range to convert the input to UTF8. (This is
>>> what
>>> component programming is all about.)
>>
>> But that's not how ranges of characters work. They're ranges of dchar.
>> Ranges
>> don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have
>> to create
>> special wrappers around string or wstring to have ranges of UTF-8. The
>> way
>> that it's normally done is to have ranges of dchar and then special-case
>> range-based functions for strings. Then the function can operate on
>> any range
>> of dchar but still operates on strings efficiently.
>
> I have argued against making ranges of characters dchars, because of
> performance reasons. This will especially adversely affect the
> performance of the lexer.
>
> The fact is, files are normally in UTF8 and just about everything else
> is in UTF8. Prematurely converting to UTF-32 is a performance disaster.
> Note that the single largest thing holding back regex performance is
> that premature conversion to dchar and back to char.

Well, it doesn't convert back to UTF-8 as it just slices of the input :)

Otherwise very true especially with ctRegex that used to recieve quite some hype even in its present state. 33% of time spent is doing and redoing UTF-8 decoding.
(Note that quite some extra work on top of what lexer does is done, e.g. lexer is largely deterministic but regex has some of try-rollback).

> If lexer is required to accept dchar ranges, its performance will drop
> at least in half, and people are going to go reinvent their own lexers.
>

Yes, it slows things down. Decoding (if any) should kick in only where it's absolutely necessary and be an integral part of lexer automation.


-- 
Dmitry Olshansky

On 8/2/12, Jacob Carlborg <doob@me.com> wrote: > It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI. It can do that immediately for the text that's visible in the window because ~100 lines of text can be lexed pretty damn instantly. As soon as that's done the GUI should be responsive and the rest of the text buffer should be lexed in the background.

On 8/2/12 4:44 PM, Jacob Carlborg wrote: > On 2012-08-02 22:26, Andrei Alexandrescu wrote: >> On 8/2/12 2:17 PM, Michel Fortin wrote: > >>> I wonder how your call with Walter will turn out. >> >> What call? > > The skype call you suggested: > > "First, after having read the large back-and-forth Jonathan/Walter in > one sitting, it's becoming obvious to me you'll never understand each > other on this nontrivial matter through this medium. I suggest you set > up a skype/phone call. Once you get past the first 30 seconds of social > awkwardness of hearing each other's voice, you'll make fantastic > progress in communicatin" Oh, ok, I thought I'd be on the call. That would be between Jonathan and Walter. Andrei

Michel Fortin wrote: > The next issue, which I haven's seen discussed here is that for a parser > to be efficient it should operate on buffers. You can make it work with > arbitrary ranges, but if you don't have a buffer you can slice when you > need to preserve a string, you're going to have to build the string > character by character, which is not efficient at all. But then you can > only really return slices if the underlying representation is the same > as the output representation, and unless your API has a templated output > type, you're going to special case a lot of things. Instead of returning whole strings, you may provide another sub-range for the string itself. I wrote JSON parser using this approach (https://github.com/pszturmaj/json-streaming-parser) and thanks to that it is possible to parse json without a single heap allocation. This could be also used in XML parser.

Forums