View mode: basic / threaded / horizontal-split · Log in · Help
August 02, 2012
Re: std.d.lexer requirements
On Thursday, August 02, 2012 10:58:21 Walter Bright wrote:
> On 8/2/2012 4:49 AM, deadalnix wrote:
> > How is that different than a manually done range of dchar ?
> 
> The decoding is rarely necessary, even if non-ascii data is there. However,
> the range cannot decide if decoding is necessary - the receiver has to,
> hence the receiver does the decoding.

Hence why special-casing must be used to deal with variably-length encodings 
like UTF-8. If it's not something that the range can handle through front and 
popFront, then it's not going to work as a range. Having it be _possible_ to 
special-case but not require it allows for the type to works a range but still 
be efficient of the function operating on the range codes for it. But if you 
require the function to code for it, then it's not really a range.

- Jonathan M Davis
August 02, 2012
Re: std.d.lexer requirements
On 2012-08-02 10:15, Walter Bright wrote:

> Worst case use an adapter range.

And that is better than a plain string?

-- 
/Jacob Carlborg
August 02, 2012
Re: std.d.lexer requirements
"Jonathan M Davis" , dans le message (digitalmars.D:174059), a écrit :
> In either case, because the consumer must do something other than simply 
> operate on front, popFront, empty, etc., you're _not_ dealing with the range 
> API but rather working around it.

In some case a range of dchar is useful. In some case a range of char is 
sufficient, and much more efficient. And for the UTF-aware programer, it 
makes much sense.

The fact that you sometimes have to buffer some information because the 
meaning of one element is affected by the previous element is a normal 
issue in many algoritms, it's not working arround anything. Your lexer 
uses range API, would you say it is just working arroung range because 
you have to take into account several caracters (let they be dchars) at 
the same time to know what they are meaning?
August 02, 2012
Re: std.d.lexer requirements
On 2012-08-02 22:26, Andrei Alexandrescu wrote:
> On 8/2/12 2:17 PM, Michel Fortin wrote:

>> I wonder how your call with Walter will turn out.
>
> What call?

The skype call you suggested:

"First, after having read the large back-and-forth Jonathan/Walter in 
one sitting, it's becoming obvious to me you'll never understand each 
other on this nontrivial matter through this medium. I suggest you set 
up a skype/phone call. Once you get past the first 30 seconds of social 
awkwardness of hearing each other's voice, you'll make fantastic 
progress in communicatin"

-- 
/Jacob Carlborg
August 02, 2012
Re: std.d.lexer requirements
Andrei Alexandrescu , dans le message (digitalmars.D:174060), a écrit :
> I agree frontUnit and popFrontUnit are more generic because they allow 
> other ranges to define them.

Any range of dchar could have a representation (or you may want to call 
it something else) that returns a range of char (or ubyte). And I think 
they are more generic because they use a generic API (ie range), that is 
very powerful: the representation can provide length, slicing, etc... 
that is different that the dchar length or whatever. You don't want to 
duplicate all range methods by postfixing Unit...

>> I wonder how your call with Walter will turn out.
> 
> What call?

You proposed Jonathan to call Walter in an earlier post. I believe there 
is a misunderstandment.
August 02, 2012
Re: std.d.lexer requirements
Jacob Carlborg , dans le message (digitalmars.D:174069), a écrit :
> On 2012-08-02 10:15, Walter Bright wrote:
> 
>> Worst case use an adapter range.
> 
> And that is better than a plain string?
> 
because its front method does not do any decoding.
August 02, 2012
Re: std.d.lexer requirements
On 02-Aug-12 08:30, Walter Bright wrote:
> On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
>> On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
>>> 1. It should accept as input an input range of UTF8. I feel it is a
>>> mistake
>>> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
>>> should use an 'adapter' range to convert the input to UTF8. (This is
>>> what
>>> component programming is all about.)
>>
>> But that's not how ranges of characters work. They're ranges of dchar.
>> Ranges
>> don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have
>> to create
>> special wrappers around string or wstring to have ranges of UTF-8. The
>> way
>> that it's normally done is to have ranges of dchar and then special-case
>> range-based functions for strings. Then the function can operate on
>> any range
>> of dchar but still operates on strings efficiently.
>
> I have argued against making ranges of characters dchars, because of
> performance reasons. This will especially adversely affect the
> performance of the lexer.
>
> The fact is, files are normally in UTF8 and just about everything else
> is in UTF8. Prematurely converting to UTF-32 is a performance disaster.
> Note that the single largest thing holding back regex performance is
> that premature conversion to dchar and back to char.

Well, it doesn't convert back to UTF-8 as it just slices of the input :)

Otherwise very true especially with ctRegex that used to recieve quite 
some hype even in its present state. 33% of time spent is doing and 
redoing UTF-8 decoding.
(Note that quite some extra work on top of what lexer does is done, e.g. 
lexer is largely deterministic but regex has some of try-rollback).

> If lexer is required to accept dchar ranges, its performance will drop
> at least in half, and people are going to go reinvent their own lexers.
>

Yes, it slows things down. Decoding (if any) should kick in only where 
it's absolutely necessary and be an integral part of lexer automation.


-- 
Dmitry Olshansky
August 02, 2012
Re: std.d.lexer requirements
On 8/2/12, Jacob Carlborg <doob@me.com> wrote:
> It still needs to update the editor view with the correct syntax
> highlighting which needs to be done in the same thread as the rest of
> the GUI.

It can do that immediately for the text that's visible in the window
because ~100 lines of text can be lexed pretty damn instantly. As soon
as that's done the GUI should be responsive and the rest of the text
buffer should be lexed in the background.
August 02, 2012
Re: std.d.lexer requirements
On 8/2/12 4:44 PM, Jacob Carlborg wrote:
> On 2012-08-02 22:26, Andrei Alexandrescu wrote:
>> On 8/2/12 2:17 PM, Michel Fortin wrote:
>
>>> I wonder how your call with Walter will turn out.
>>
>> What call?
>
> The skype call you suggested:
>
> "First, after having read the large back-and-forth Jonathan/Walter in
> one sitting, it's becoming obvious to me you'll never understand each
> other on this nontrivial matter through this medium. I suggest you set
> up a skype/phone call. Once you get past the first 30 seconds of social
> awkwardness of hearing each other's voice, you'll make fantastic
> progress in communicatin"

Oh, ok, I thought I'd be on the call. That would be between Jonathan and 
Walter.

Andrei
August 02, 2012
Re: std.d.lexer requirements
Michel Fortin wrote:
> The next issue, which I haven's seen discussed here is that for a parser
> to be efficient it should operate on buffers. You can make it work with
> arbitrary ranges, but if you don't have a buffer you can slice when you
> need to preserve a string, you're going to have to build the string
> character by character, which is not efficient at all. But then you can
> only really return slices if the underlying representation is the same
> as the output representation, and unless your API has a templated output
> type, you're going to special case a lot of things.

Instead of returning whole strings, you may provide another sub-range 
for the string itself. I wrote JSON parser using this approach 
(https://github.com/pszturmaj/json-streaming-parser) and thanks to that 
it is possible to parse json without a single heap allocation. This 
could be also used in XML parser.
5 6 7 8 9 10 11 12 13
Top | Discussion index | About this forum | D home