August 02, 2012
On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
> On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
> > Then just pass the same identifier table to the function which creates the token range. That doesn't require another type.
> 
> You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.

Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole.

- Jonathan M Davis
August 02, 2012
On 8/1/2012 10:34 PM, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
>> On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
>>> Then just pass the same identifier table to the function which creates the
>>> token range. That doesn't require another type.
>>
>> You're still going to require another type, otherwise you'll have to
>> duplicate the state in every token allocation, with resultant heavy memory
>> and initialization costs.
>
> Why would you be duplicating any state in the token? The range would have the
> state, not the token. The token would only have the data specific to the token,
> not the range as a whole.

I meant two types: the Lexer and the Token. The Lexer can present a range interface.


August 02, 2012
On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
> The lexer must use char or it will not be acceptable as anything but a toy for performance reasons.

Avoiding decoding can be done with strings and operating on ranges of dchar, so you'd be operating almost entirely on ASCII. Are you saying that there's a performance issue aside from decoding?

> Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings.

Why is there any converting to dchar going on here? I don't see why any would be necessary. If you reading in a file as a string or char[] (as would be typical), then you're operating on a string, and then the only time that any decoding will be necessary is when you actually need to operate on a unicode character, which is very rare in D's grammar. It's only when operating on something _other_ than a string that you'd have to actually deal with dchars.

> > Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
> > nothing
> > else will care about it. At most, it should be an optional feature of the
> > lexer. But it certainly could be added that way.
> 
> I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.

My point is that it's the sort of thing that _only_ a parser would care about. So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.

- Jonathan M Davis
August 02, 2012
On Wednesday, August 01, 2012 22:41:01 Walter Bright wrote:
> On 8/1/2012 10:34 PM, Jonathan M Davis wrote:
> > On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
> >> On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
> >>> Then just pass the same identifier table to the function which creates
> >>> the
> >>> token range. That doesn't require another type.
> >> 
> >> You're still going to require another type, otherwise you'll have to
> >> duplicate the state in every token allocation, with resultant heavy
> >> memory
> >> and initialization costs.
> > 
> > Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole.
> 
> I meant two types: the Lexer and the Token. The Lexer can present a range interface.

What I was expecting there to  be was a type which was a range of tokens.  You passed the source string to a function which returned that range, and you iterated over it to process each token. What you appear to have been arguing for is another type which you get the range from which holds additional state (e.g. the indentifier table). So, you're saying now that you only meant that the token type needs to be separate from the range type (which I would have thought would be the only way to do it), or are you indeed saying that there should be another type that the range comes from and which holds additional state?

- Jonathan M Davis
August 02, 2012
On 8/1/2012 10:53 PM, Jonathan M Davis wrote:
> What I was expecting there to  be was a type which was a range of tokens.  You
> passed the source string to a function which returned that range, and you
> iterated over it to process each token. What you appear to have been arguing
> for is another type which you get the range from which holds additional state
> (e.g. the indentifier table). So, you're saying now that you only meant that
> the token type needs to be separate from the range type (which I would have
> thought would be the only way to do it), or are you indeed saying that there
> should be another type that the range comes from and which holds additional
> state?

You could do it either way.

August 02, 2012
On 8/1/2012 10:44 PM, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
>> The lexer must use char or it will not be acceptable as anything but a toy
>> for performance reasons.
>
> Avoiding decoding can be done with strings and operating on ranges of dchar,
> so you'd be operating almost entirely on ASCII. Are you saying that there's a
> performance issue aside from decoding?

1. Encoding it into a dchar is a performance problem. Source code sits in files that are nearly always in UTF8. So your input range MUST check every single char and convert it to UTF32 as necessary. Plus, there's that additional step removed from sticking the file input buffer directly into the lexer's input.

2. Storing strings as dchars is a performance and memory problem (4x as much data and hence time).

Remember, nearly ALL of D source will be ASCII. All performance considerations must be tilted towards the usual case.


>> Somebody has to convert the input files into dchars, and then back into
>> chars. That blows for performance. Think billions and billions of
>> characters going through, not just a few random strings.
>
> Why is there any converting to dchar going on here?

Because your input range is a range of dchar?


> I don't see why any would
> be necessary. If you reading in a file as a string or char[] (as would be
> typical), then you're operating on a string, and then the only time that any
> decoding will be necessary is when you actually need to operate on a unicode
> character, which is very rare in D's grammar. It's only when operating on
> something _other_ than a string that you'd have to actually deal with dchars.

That's what I've been saying. So why have an input range of dchars, which must be decoded in advance, otherwise it wouldn't be a range of dchars?


>>> Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
>>> nothing
>>> else will care about it. At most, it should be an optional feature of the
>>> lexer. But it certainly could be added that way.
>>
>> I hate to say "trust me on this", but if you don't, have a look at dmd's
>> lexer and how it handles identifiers, then look at dmd's symbol table.
>
> My point is that it's the sort of thing that _only_ a parser would care about.
> So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.

I think you are misunderstanding. The lexer doesn't have a *symbol* table in it. It has a mapping from identifiers to unique handles. It needs to be there, otherwise the semantic analysis has to scan identifier strings a second time.

August 02, 2012
On 2012-08-02 07:44, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
>> The lexer must use char or it will not be acceptable as anything but a toy
>> for performance reasons.
>
> Avoiding decoding can be done with strings and operating on ranges of dchar,
> so you'd be operating almost entirely on ASCII. Are you saying that there's a
> performance issue aside from decoding?
>
>> Somebody has to convert the input files into dchars, and then back into
>> chars. That blows for performance. Think billions and billions of
>> characters going through, not just a few random strings.
>
> Why is there any converting to dchar going on here? I don't see why any would
> be necessary. If you reading in a file as a string or char[] (as would be
> typical), then you're operating on a string, and then the only time that any
> decoding will be necessary is when you actually need to operate on a unicode
> character, which is very rare in D's grammar. It's only when operating on
> something _other_ than a string that you'd have to actually deal with dchars.

I think you two are saying the same things.

-- 
/Jacob Carlborg
August 02, 2012
On 2012-08-02 02:10, Walter Bright wrote:

> 1. It should accept as input an input range of UTF8. I feel it is a
> mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
> UTF16 should use an 'adapter' range to convert the input to UTF8. (This
> is what component programming is all about.)

I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.

-- 
/Jacob Carlborg
August 02, 2012
On Wednesday, August 01, 2012 21:52:15 Jonathan M Davis wrote:
> And as much as there are potential performance issues with Phobos' choice of treating strings as ranges of dchar, if it were to continue to treat them as ranges of code units, it's pretty much a guarantee that there would be a _ton_ of bugs caused by it. Most programmers have absolutely no clue about how unicode works and don't really want to know. They want strings to just work. Phobos' approach of defaulting to correct but making it possible to make the code faster through specializations is very much in line with D's typical approach of making things safe by default but allowing the programmer to do unsafe things for optimizations when they know what they're doing.

Another thing that I should point out is that a range of UTF-8 or UTF-16 wouldn't work with many range-based functions at all. Most of std.algorithm and its ilk would be completely useless. Range-based functions operate on a ranges elements, so operating on a range of code units would mean operating on code units, which is going to be _wrong_ almost all the time. Think about what would happen if you used a function like map or filter on a range of code units. The resultant range would be completely invalid as far as unicode goes. Range-based functions need to be operating on _characters_. Technically, not even code points gets us there, so it's _still_ buggy. It's just a _lot_ closer to being correct and works 99.99+% of the time.

If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add a concept of variably-length encoded ranges so that it's possible to treat them as both their encoding and whatever they represent (e.g. code point or grapheme in the case of ranges of code units).

- Jonathan M Davis
August 02, 2012
On 2012-08-02 08:39, Walter Bright wrote:

> That's what I've been saying. So why have an input range of dchars,
> which must be decoded in advance, otherwise it wouldn't be a range of
> dchars?

Your first requirement is a bit strange and confusing:

"1. It should accept as input an input range of UTF8."

To my understand ranges in Phobos are designed to operate on dchars, not chars, regardless of the original input type. So if you can create a range that operates on UTF8 I don't know if that will be usable with the rest of the Phobos functions expecting ranges of dchars. Basically making this type of range useless since it cannot use any other functionality in Phobos.

The current state of Phobos: you either have a range of dchars or you don't have a range, you have an array or something else.

-- 
/Jacob Carlborg