August 03, 2012
On 2012-08-03 18:49, Dmitry Olshansky wrote:

> Draw thing to an off screen  bitmap  then blit it to window (aye, pass
> back to UI thread a reference to the buffer with pixels). This technique
> been in use for decades. Imagine drawing some large intricate fractal it
> could easily take few seconds.

Ok, now I see.

-- 
/Jacob Carlborg
August 03, 2012
On 8/3/2012 4:40 AM, Tobias Pankrath wrote:
> Would this be an argument for putting the computation of source locations (i.e.
> line + offset or similar) into the range / into an template argument / policy,
> so that it's done in the most effective way for the client?
>
> Kate for example has a "range"-type that marks a span in the text buffer. This
> way the lexer can return token with the correct "textrange" attached and you
> don't need to recompute the text ranges from line/col numbers.

Worth thinking about.


August 03, 2012
On 8/3/2012 6:18 AM, deadalnix wrote:
> lexer can have a parameter that tell if it should build a table of token or
> slice the input. The second is important, for instance for an IDE : lexing will
> occur often, and you prefer slicing here because you already have the source
> file in memory anyway.

A string may span multiple lines - IDEs do not store the text as one string.

> If the lexer allocate chunks, it will reuse the same memory location for the
> same string. Considering the following mecanism to compare slice, this will
> require 2 comparaisons for identifier lexed with that method :
>
> if(a.length != b.length) return false;
> if(a.ptr == b.ptr) return true;
> // Regular char by char comparison.
>
> Is that a suitable option ?

You're talking about doing for strings what is done for identifiers - returning a unique handle for each. I don't think this works very well for string literals, as there seem to be few duplicates.
August 04, 2012
On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
> The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.

String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \© in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices.

Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory.

The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around.

- Jonathan M Davis
August 04, 2012
Le 03/08/2012 21:59, Walter Bright a écrit :
> On 8/3/2012 6:18 AM, deadalnix wrote:
>> lexer can have a parameter that tell if it should build a table of
>> token or
>> slice the input. The second is important, for instance for an IDE :
>> lexing will
>> occur often, and you prefer slicing here because you already have the
>> source
>> file in memory anyway.
>
> A string may span multiple lines - IDEs do not store the text as one
> string.
>
>> If the lexer allocate chunks, it will reuse the same memory location
>> for the
>> same string. Considering the following mecanism to compare slice, this
>> will
>> require 2 comparaisons for identifier lexed with that method :
>>
>> if(a.length != b.length) return false;
>> if(a.ptr == b.ptr) return true;
>> // Regular char by char comparison.
>>
>> Is that a suitable option ?
>
> You're talking about doing for strings what is done for identifiers -
> returning a unique handle for each. I don't think this works very well
> for string literals, as there seem to be few duplicates.

That option have the benefice to allow very fast identifier comparison (like DMD does) but don't impose it. For instance, you could use that trick in a single thread, but another identifier table for another.

It allow to avoid completely the problem with multithreading you mention, while keeping most identifiers comparison really fast.

It allow also for several allocation scheme for the slice, that fit different needs, as shown by Christophe Travert.
August 04, 2012
Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :
> On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
>> The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.
> 
> String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \© in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices.
> 
> Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory.
> 
> The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around.
> 
> - Jonathan M Davis

I thought it was not the lexer's job to process litterals. Just split the input in tokens, and provide minimal info: TokenType, line and col along with the representation from the input. That's enough for a syntax highlighting tools for example. Otherwise you'll end up doing complex interpretation and the lexer will not be that efficient. Litteral interpretation can be done in a second step. Do you think doing litteral interpretation separately when you need it would be less efficient?

-- 
Christophe
August 04, 2012
On 04-Aug-12 14:02, Christophe Travert wrote:
> Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :
>> On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
>>> The tokens are not kept, correct. But the identifier strings, and the string
>>> literals, are kept, and if they are slices into the input buffer, then
>>> everything I said applies.
>>
>> String literals often _can't_ be slices unless you leave them in their
>> original state rather than giving the version that they translate to (e.g.
>> leaving \© in the string rather than replacing it with its actual,
>> unicode value). And since you're not going to be able to create the literal
>> using whatever type the range is unless it's a string of some variety, that
>> means that the literals often can't be slices, which - depending on the
>> implementation - would make it so that that they can't _ever_ be slices.
>>
>> Identifiers are a different story, since they don't have to be translated at
>> all, but regardless of whether keeping a slice would be better than creating a
>> new string, the identifier table will be far superior, since then you only need
>> one copy of each identifier. So, it ultimately doesn't make sense to use slices
>> in either case even without considering issues like them being spread across
>> memory.
>>
>> The only place that I'd expect a slice in a token is in the string which
>> represents the text which was lexed, and that won't normally be kept around.
>>
>> - Jonathan M Davis
>
> I thought it was not the lexer's job to process litterals. Just split
> the input in tokens, and provide minimal info: TokenType, line and col
> along with the representation from the input. That's enough for a syntax
> highlighting tools for example. Otherwise you'll end up doing complex
> interpretation and the lexer will not be that efficient. Litteral
> interpretation can be done in a second step. Do you think doing litteral
> interpretation separately when you need it would be less efficient?
>
Most likely - since you re-read the same memory twice to do it.

-- 
Dmitry Olshansky
August 04, 2012
Dmitry Olshansky , dans le message (digitalmars.D:174214), a écrit :
> Most likely - since you re-read the same memory twice to do it.

You're probably right, but if you do this right after the token is generated, the memory should still be near the processor. And the operation on the first read should be very basic: just check nothing illegal appears, and check for the end of the token. The cost is not negligible, but what you do with litteral tokens can vary much, and what the lexer will propose may not be what the user want, so the user may suffer the cost of the litteral decoding (including allocation of the decoded string, the copy of the caracters, etc), that he doesn't want, or will have to re-do his own way...

-- 
Christophe
August 04, 2012
On 04-Aug-12 14:55, Christophe Travert wrote:
> Dmitry Olshansky , dans le message (digitalmars.D:174214), a écrit :
>> Most likely - since you re-read the same memory twice to do it.
>
> You're probably right, but if you do this right after the token is
> generated, the memory should still be near the processor. And the
> operation on the first read should be very basic: just check nothing
> illegal appears, and check for the end of the token.

q{ .. }
"\x13\x27 ...\u1212"

In most cases it takes around the same time to check correctness and output it as simply pass it by. (see also re-decoding unicode in identifiers, though that's rare to see unicode chars in identifier)


> The cost is not
> negligible, but what you do with litteral tokens can vary much, and what
> the lexer will propose may not be what the user want, so the user may
> suffer the cost of the litteral decoding (including allocation of the
> decoded string, the copy of the caracters, etc), that he doesn't want,
> or will have to re-do his own way...
>
I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.

-- 
Dmitry Olshansky
August 04, 2012
On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
> I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...

- Jonathan M Davis