August 03, 2012
On 8/3/2012 2:01 AM, Jacob Carlborg wrote:
> On 2012-08-03 09:35, Walter Bright wrote:
>
>> Look in doc.c at highlightCode2() for how to call the lexer by itself.
>
> So basically:
>
> Token tok;
>
> //start timer
>
> while (tok.value != TOKeof)
>      lex.scan(&tok);
>
> //end timer
>
> Something like that?


Pretty much. Or just call lex.nextToken() until you get a TOKeof.

August 03, 2012
On 8/3/2012 1:21 AM, Christophe Travert wrote:
> This range does not have to be a string, it can be a something over a
> file, stream, socket. It can also be the result of an algorithm, because
> you *can* use algorithm on ranges of char, and it makes sense if you
> know what you are doing.

Correct, that's the whole point of using a range - it can come from anything.

> If Walter discovers the lexer does not work with a socket, a
> "file.byChunk.join", and has to do expensive utf-8 decoding for the
> lexer to work because it can only use range of dchar, and not range of
> char (except that it special-cased strings), he may not be happy.

No may about it, I wouldn't be happy :-)
August 03, 2012
On 8/3/2012 3:07 AM, Walter Bright wrote:
> On 8/3/2012 1:21 AM, Christophe Travert wrote:
>> This range does not have to be a string, it can be a something over a
>> file, stream, socket. It can also be the result of an algorithm, because
>> you *can* use algorithm on ranges of char, and it makes sense if you
>> know what you are doing.
>
> Correct, that's the whole point of using a range - it can come from anything.

For example, let's suppose we want to do D syntax highlighting in our IDE. It is highly unlikely that the text editor's data structure is a simple string. It's likely to be an array of lines, or something like that. It's simple to provide a range interface to that, where it presents the data structure as a sequence of chars.

But that wouldn't work with Jonathan's string specialization.

Nor would the string specialization work if the input was a 1024 byte file buffer.
August 03, 2012
>> Correct, that's the whole point of using a range - it can come from anything.
>
> For example, let's suppose we want to do D syntax highlighting in our IDE. It is highly unlikely that the text editor's data structure is a simple string. It's likely to be an array of lines, or something like that. It's simple to provide a range interface to that, where it presents the data structure as a sequence of chars.
>

Would this be an argument for putting the computation of source locations (i.e. line + offset or similar) into the range / into an template argument / policy, so that it's done in the most effective way for the client?

Kate for example has a "range"-type that marks a span in the text buffer. This way the lexer can return token with the correct "textrange" attached and you don't need to recompute the text ranges from line/col numbers.



August 03, 2012
Le 02/08/2012 20:08, Walter Bright a écrit :
> On 8/2/2012 4:52 AM, deadalnix wrote:
>> Le 02/08/2012 09:30, Walter Bright a écrit :
>>> On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
>>>> On 2012-08-02 02:10, Walter Bright wrote:
>>>>
>>>>> 1. It should accept as input an input range of UTF8. I feel it is a
>>>>> mistake to templatize it for UTF16 and UTF32. Anyone desiring to
>>>>> feed it
>>>>> UTF16 should use an 'adapter' range to convert the input to UTF8.
>>>>> (This
>>>>> is what component programming is all about.)
>>>>
>>>> I'm no expert on ranges but won't that prevent slicing? Slicing is one
>>>> of the
>>>> main reasons for why the Tango XML parser is so amazingly fast.
>>>>
>>>
>>> You don't want to use slicing on the lexer. The reason is that your
>>> slices will be spread all over memory, as source files can be huge, and
>>> all that memory will be retained and never released. What you want is a
>>> compact representation after lexing. Compactness also helps a lot with
>>> memory caching.
>>>
>>
>> Token are not kept in memory. You usually consume them for other
>> processing and
>> throw them away.
>>
>> It isn't an issue.
>
> The tokens are not kept, correct. But the identifier strings, and the
> string literals, are kept, and if they are slices into the input buffer,
> then everything I said applies.
>

Ok, what do you think of that :

lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway.

The token always contains as a member a slice. The slice come either from the source or from a memory chunk allocated by the lexer.

If the lexer allocate chunks, it will reuse the same memory location for the same string. Considering the following mecanism to compare slice, this will require 2 comparaisons for identifier lexed with that method :

if(a.length != b.length) return false;
if(a.ptr == b.ptr) return true;
// Regular char by char comparison.

Is that a suitable option ?
August 03, 2012
Le 03/08/2012 05:41, Andrei Alexandrescu a écrit :
> On 8/2/12 11:08 PM, Jonathan M Davis wrote:
>> You're not going to get as fast a lexer if it's not written
>> specifically for D.
>> Writing a generic lexer is a different problem. It's also one that
>> needs to be
>> solved, but I think that it's a mistake to think that a generic lexer
>> is going
>> to be able to be as fast as one specifically optimized for D.
>
> Do you have any evidence to back that up? I mean you're just saying it.
>
> Andrei
>

I think it can be as fast if the lexer is compile time created. I'd also argue that it is better to have a D lexer ASAP, even if it isn't as configurable as it could be.

Better something now that something better later. We can make it better later anyway.
August 03, 2012
Le 02/08/2012 20:14, Marco Leise a écrit :
> Am Thu, 02 Aug 2012 14:26:58 +0200
> schrieb "Adam D. Ruppe"<destructionator@gmail.com>:
>
>> On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
>>> lexer really isn't the performance bottleneck of dmd (or any
>>> compiler of a non trivial language).
>>
>> What if we're just using this lexer in something like a
>> syntax highlighting text editor? I'd be annoyed if it
>> stopped typing for a little bit cuz of processing.
>
> My pet peeve for editors is how to make the lexer work on only what I just edited. It is really not trivial, but eased for example by editors that insert closing } automatically, so the scopes don't mess up. A missing curly bracket results in the parser reinterpreting the whole file. The same goes for string terminators:
>
> If I type `` at the top of std.datetime in Mono-D right after "module std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive again.
> But if I use ' or ", then the editor inserts the second terminator and is done with reparsing in 1-2 seconds.
>
> Can we really have a good lexer that is equally well suited for compilers as for editors ?
>

A common solution to that problem is to cancel the computation when user type something again.
August 03, 2012
deadalnix , dans le message (digitalmars.D:174155), a écrit :
>> The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.
>>
> 
> Ok, what do you think of that :
> 
> lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway.
> 
> The token always contains as a member a slice. The slice come either from the source or from a memory chunk allocated by the lexer.

If I may add, there are several possitilities here:
 1- a real slice of the input range
 2- a slice of the input range created with .save and takeExactly
 3- a slice allocated in GC memory by the lexer
 4- a slice of memory owned by the lexer, which is reused for the next
token (thus, the next call to popFront invalidates the token).
 5- a slice of memory from a lookup table.

All are useful in certain situations.
#1 is usable for sliceable ranges, and is definitely efficient when you
don't have a huge amont of code to parse.
#2 is usable for forward ranges.
#3 is usable for any range, but I would not recommand it...
#4 is usable for any range,
#5 is best if you perform complicated operations with the tokens.

#1/#2 should not be very hard to code: when you start to lex a new token, you save the range, and when you found the end of the token, you just use takeExactly on the saved ranged.

#4 requires to use an internal buffer. That's more code, but you have to do them in a second step if you want to be able to use input range (which you have too). Actually, the buffer may be external, if you use a buffered-range adapter to make a forward range out of an input range. Having an internal buffer may be more efficient. That something that has to be profiled.

#3 can be obtained from #4 by map!(x => x.dup).

#5 requires one of the previous to be implemented. You need to have a slice saved somewhere before having a look at the look-up table. Therefore, I think #5 should be obtained without a high loss of efficiency by an algorithm external to the lexer. This would probably bring many ways to use the lexer. For example, you can filter-out many tokens that you don't want before building the table, which avoids to have an overfull look-up table if you are only interested in a subset of tokens.

#1/#2 with adapter ranges might be the only thing that is required to code, although the API should allow to define #4 and #5, for the user to use the adapters blindly, or if an internal implementation proves to be significantly more efficient.

-- 
Christophe
August 03, 2012
On Friday, 3 August 2012 at 14:49:55 UTC, travert@phare.normalesup.org
>
> If I may add, there are several possitilities here:
>  1- a real slice of the input range
>  2- a slice of the input range created with .save and takeExactly
>  3- a slice allocated in GC memory by the lexer
>  4- a slice of memory owned by the lexer, which is reused for the next
> token (thus, the next call to popFront invalidates the token).
>  5- a slice of memory from a lookup table.

#6 None. Just provide information of token type and location (i.e. for syntax highlighting).
August 03, 2012
On 03-Aug-12 10:35, Jacob Carlborg wrote:
> On 2012-08-03 00:25, Dmitry Olshansky wrote:
>
>> OT:
>> It never ceases to amaze me how people miss this very simple point:
>> GUI runs on its own thread and shouldn't ever block on something (save
>> for message pump itself, of course). Everything else (including possibly
>> slow rendering) done on the side and then result (once ready) swiftly
>> indicated on GUI.
>
> I'm not entirely sure what you mean be "rendering" but most GUI systems
> are not thread safe and all GUI updates need to happen in the same thread.
>
Draw thing to an off screen  bitmap  then blit it to window (aye, pass back to UI thread a reference to the buffer with pixels). This technique been in use for decades. Imagine drawing some large intricate fractal it could easily take few seconds.

-- 
Dmitry Olshansky