August 02, 2012
On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
> > Sure, you could have a function which specifically operates on ranges of code units and understands how unicode works and is written accordingly, but then that function is specific to ranges of code units and is only generic with regards to various ranges of code units. It can't operate on generic ranges like functions such as map and filter can.
> 
> Yes, and I think that's exactly what the doctor prescribed here.

It may be the best approach for the lexer (though I'm not convinced; I'll have to think about it more), but Walter seems to be arguing that that strings should be treated as ranges of code units in general, which I think is completely wrong.

- Jonathan M Davis
August 02, 2012
On 8/2/12 7:18 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
>>> Sure, you could have a function which specifically operates on ranges of
>>> code units and understands how unicode works and is written accordingly,
>>> but then that function is specific to ranges of code units and is only
>>> generic with regards to various ranges of code units. It can't operate on
>>> generic ranges like functions such as map and filter can.
>>
>> Yes, and I think that's exactly what the doctor prescribed here.
>
> It may be the best approach for the lexer (though I'm not convinced; I'll have
> to think about it more),

Your insights are always appreciated; even their Cliff notes :o).

> but Walter seems to be arguing that that strings
> should be treated as ranges of code units in general, which I think is
> completely wrong.

I think Walter has very often emphasized the need for the lexer to be faster than the usual client software. My perception is that he's discussing lexer design in understanding there's a need for a less comfortable approach, namely do decoding in client.

Andrei
August 02, 2012
On Thursday, August 02, 2012 19:30:47 Andrei Alexandrescu wrote:
> On 8/2/12 7:18 PM, Jonathan M Davis wrote:
> Your insights are always appreciated; even their Cliff notes :o).

LOL. Well, I'm not about to decide on the best approach to this without thinking through it more. What I've been doing manages to deal quite nicely with avoiding unnecessary decoding and still allows for the lexing of ranges of dchar which aren't strings (though there's obviously an efficiency hit there), and it really isn't complicated or messy thanks to some basic mixins that I've been using. Switching to operating specifically on code units and not accepting ranges of dchar at all has some serious ramifications, and I have to think through them all before I take a position on that.

> > but Walter seems to be arguing that that strings
> > should be treated as ranges of code units in general, which I think is
> > completely wrong.
> 
> I think Walter has very often emphasized the need for the lexer to be faster than the usual client software. My perception is that he's discussing lexer design in understanding there's a need for a less comfortable approach, namely do decoding in client.

That may be, but if he's arguing that strings should _always_ be treated as range of code units - as in all D programs, most of which don't have anything to do with lexers (other than when they're compiled) - then I'm definitely going to object to that, and it's my understanding that that's what he's arguing. But maybe I've misunderstood.

I've been arguing that strings should still be treated as ranges of code points and that that does not preclude making the lexer efficiently operate on code units when operating on strings even if it operates on ranges of dchar. I think that whether making the lexer operate on ranges of dchar but specialize on strings is a better approach or making it operate specifically on ranges of code units is a better approach depends on what we want it to be usable with. It should be just as fast with strings in either case, so it becomes a question of how we want to handle ranges which _aren't_ strings.

I suppose that we could make it operate on code units and just let ranges of dchar have UTF-32 as their code unit (since dchar is both a code unit and a code point), then ranges of dchar will still work but ranges of char and wchar will _also_ work. Hmmm. As I said, I'll have to think this through a bit.

- Jonathan M Davis
August 03, 2012
On Thursday, August 02, 2012 19:52:35 Jonathan M Davis wrote:
> I suppose that we could make it operate on code units and just let ranges of dchar have UTF-32 as their code unit (since dchar is both a code unit and a code point), then ranges of dchar will still work but ranges of char and wchar will _also_ work. Hmmm. As I said, I'll have to think this through a bit.

LOL. It looks like taking this approach results in almost identical code to what I've been doing. The main difference is that if you're dealing with a range other than a string, you need to use decode instead of front, which means that decode is going to need to work with more than just strings (probably stride too). I'll have to create a pull request for that.

But unless you restrict it to strings and ranges of code units which are random access, you still have to worry about stuff like using range[0] vs range.front depending on the type, so my mixin approach is still applicable, and it makes it very easy to switch what I'm doing, since there are very few lines that need to be tweaked.

So, I guess that I'll be taking the approach of taking ranges of char, wchar, and dchar and treat them all as ranges of code units. So, it'll work with everything that it worked with before but will now also work with ranges of char and wchar. There's still a performance hit if you do something like passing it filter!"true(source), but there's no way to fix that without disallowing dchar ranges entirely, which would be unnecessarily restrictive. Once you allow arbitrary ranges of char rather than requiring strings, the extra code required to allow ranges of wchar and dchar is trivial. It's stuff like worrying about range[0] vs range.front which complicates things (even if front is a code unit rather than a code point), and using string mixins makes it so that the code with the logic is just as simple as it would be with strings. So, I think that I can continue almost exactly as I have been and still achieve what Walter wants. The main issue that I have (beyond finishing what I haven't gotten to yet) is changing how I handle errors and comments, since I currently have them as tokens, but that shouldn't be terribly hard to fix.

- Jonathan M Davis
August 03, 2012
On 8/2/2012 3:38 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
>> Remember, its the consumer doing the decoding, not the input range.
>
> But that's the problem. The consumer has to treat character ranges specially
> to make this work. It's not generic. If it were generic, then it would simply
> be using front, popFront, etc. It's going to have to special case strings to
> do the buffering that you're suggesting. And if you have to special case
> strings, then how is that any different from what we have now?

No, the consumer can do its own buffering. It only needs a 4 character buffer, worst case.


> If you're arguing that strings should be treated as ranges of code units, then
> pretty much _every_ range-based function will have to special case strings to
> even work correctly - otherwise it'll be operating on individual code points
> rather than code points (e.g. filtering code units rather than code points,
> which would generate an invalid string). This makes the default behavior
> incorrect, forcing _everyone_ to special case strings _everywhere_ if they
> want correct behavior with ranges which are strings. And efficiency means
> nothing if the result is wrong.

No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I am not making a general argument about ranges, characters, or Phobos.


> As it is now, the default behavior of strings with range-based functions is
> correct but inefficient, so at least we get correct code. And if someone wants
> their string processing to be efficient, then they special case strings and do
> things like the buffering that you're suggesting. So, we have correct by
> default with efficiency as an option. The alternative that you seem to be
> suggesting (making strings be treated as ranges of code units) means that it
> would be fast by default but correct as an option, which is completely
> backwards IMHO. Efficiency is important, but it's pointless how efficient
> something is if it's wrong, and expecting that your average programmer is
> going to write unicode-aware code which functions correctly is completely
> unrealistic.

Efficiency for the *lexer* is of *paramount* importance. I don't anticipate std.d.lexer will be implemented by some random newbie, I expect it to be carefully implemented and to do Unicode correctly, regardless of how difficult or easy that may be.

I seem to utterly fail at making this point.

The same point applies to std.regex - efficiency is terribly, terribly important for it. Everyone judges regexes by their speed, and nobody cares how hard they are to implement to get that speed.

To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.
August 03, 2012
On 8/2/2012 4:30 PM, Andrei Alexandrescu wrote:
> I think Walter has very often emphasized the need for the lexer to be faster
> than the usual client software. My perception is that he's discussing lexer
> design in understanding there's a need for a less comfortable approach, namely
> do decoding in client.

+1


August 03, 2012
On Thursday, August 02, 2012 19:40:13 Walter Bright wrote:
> No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I am not making a general argument about ranges, characters, or Phobos.

I think that this is the main point of misunderstanding then. From your comments, you seemed to me to be arguing for strings being treated as ranges of code units all the time, which really doesn't make sense.

- Jonathan M Davis
August 03, 2012
On 8/2/12 10:40 PM, Walter Bright wrote:
> To reiterate another point, since we are in the compiler business,
> people will expect std.d.lexer to be of top quality, not some bag on the
> side. It needs to be usable as a base for writing a professional quality
> compiler. It's the reason why I'm pushing much harder on this than I do
> for other modules.

The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.

Andrei
August 03, 2012
On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:
> On 8/2/12 10:40 PM, Walter Bright wrote:
> > To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.
> 
> The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.

You're not going to get as fast a lexer if it's not written specifically for D. Writing a generic lexer is a different problem. It's also one that needs to be solved, but I think that it's a mistake to think that a generic lexer is going to be able to be as fast as one specifically optimized for D. And there are several people already working on the generic stuff (like Phillipe).

- Jonathan M Davis
August 03, 2012
On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu wrote:
> The lexer must be configurable enough to tokenize other languages than D.

You're going to have to defend that one.
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19