View mode: basic / threaded / horizontal-split · Log in · Help
August 02, 2012
Re: std.d.lexer requirements
On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
> > Sure, you could have a function which specifically operates on ranges of
> > code units and understands how unicode works and is written accordingly,
> > but then that function is specific to ranges of code units and is only
> > generic with regards to various ranges of code units. It can't operate on
> > generic ranges like functions such as map and filter can.
> 
> Yes, and I think that's exactly what the doctor prescribed here.

It may be the best approach for the lexer (though I'm not convinced; I'll have 
to think about it more), but Walter seems to be arguing that that strings 
should be treated as ranges of code units in general, which I think is 
completely wrong.

- Jonathan M Davis
August 02, 2012
Re: std.d.lexer requirements
On 8/2/12 7:18 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
>>> Sure, you could have a function which specifically operates on ranges of
>>> code units and understands how unicode works and is written accordingly,
>>> but then that function is specific to ranges of code units and is only
>>> generic with regards to various ranges of code units. It can't operate on
>>> generic ranges like functions such as map and filter can.
>>
>> Yes, and I think that's exactly what the doctor prescribed here.
>
> It may be the best approach for the lexer (though I'm not convinced; I'll have
> to think about it more),

Your insights are always appreciated; even their Cliff notes :o).

> but Walter seems to be arguing that that strings
> should be treated as ranges of code units in general, which I think is
> completely wrong.

I think Walter has very often emphasized the need for the lexer to be 
faster than the usual client software. My perception is that he's 
discussing lexer design in understanding there's a need for a less 
comfortable approach, namely do decoding in client.

Andrei
August 02, 2012
Re: std.d.lexer requirements
On Thursday, August 02, 2012 19:30:47 Andrei Alexandrescu wrote:
> On 8/2/12 7:18 PM, Jonathan M Davis wrote:
> Your insights are always appreciated; even their Cliff notes :o).

LOL. Well, I'm not about to decide on the best approach to this without 
thinking through it more. What I've been doing manages to deal quite nicely 
with avoiding unnecessary decoding and still allows for the lexing of ranges 
of dchar which aren't strings (though there's obviously an efficiency hit 
there), and it really isn't complicated or messy thanks to some basic mixins 
that I've been using. Switching to operating specifically on code units and not 
accepting ranges of dchar at all has some serious ramifications, and I have to 
think through them all before I take a position on that.

> > but Walter seems to be arguing that that strings
> > should be treated as ranges of code units in general, which I think is
> > completely wrong.
> 
> I think Walter has very often emphasized the need for the lexer to be
> faster than the usual client software. My perception is that he's
> discussing lexer design in understanding there's a need for a less
> comfortable approach, namely do decoding in client.

That may be, but if he's arguing that strings should _always_ be treated as 
range of code units - as in all D programs, most of which don't have anything 
to do with lexers (other than when they're compiled) - then I'm definitely 
going to object to that, and it's my understanding that that's what he's 
arguing. But maybe I've misunderstood.

I've been arguing that strings should still be treated as ranges of code 
points and that that does not preclude making the lexer efficiently operate on 
code units when operating on strings even if it operates on ranges of dchar. I 
think that whether making the lexer operate on ranges of dchar but specialize 
on strings is a better approach or making it operate specifically on ranges of 
code units is a better approach depends on what we want it to be usable with. 
It should be just as fast with strings in either case, so it becomes a 
question of how we want to handle ranges which _aren't_ strings.

I suppose that we could make it operate on code units and just let ranges of 
dchar have UTF-32 as their code unit (since dchar is both a code unit and a 
code point), then ranges of dchar will still work but ranges of char and wchar 
will _also_ work. Hmmm. As I said, I'll have to think this through a bit.

- Jonathan M Davis
August 03, 2012
Re: std.d.lexer requirements
On Thursday, August 02, 2012 19:52:35 Jonathan M Davis wrote:
> I suppose that we could make it operate on code units and just let ranges of
> dchar have UTF-32 as their code unit (since dchar is both a code unit and a
> code point), then ranges of dchar will still work but ranges of char and
> wchar will _also_ work. Hmmm. As I said, I'll have to think this through a
> bit.

LOL. It looks like taking this approach results in almost identical code to 
what I've been doing. The main difference is that if you're dealing with a 
range other than a string, you need to use decode instead of front, which 
means that decode is going to need to work with more than just strings 
(probably stride too). I'll have to create a pull request for that.

But unless you restrict it to strings and ranges of code units which are 
random access, you still have to worry about stuff like using range[0] vs 
range.front depending on the type, so my mixin approach is still applicable, 
and it makes it very easy to switch what I'm doing, since there are very few 
lines that need to be tweaked.

So, I guess that I'll be taking the approach of taking ranges of char, wchar, 
and dchar and treat them all as ranges of code units. So, it'll work with 
everything that it worked with before but will now also work with ranges of 
char and wchar. There's still a performance hit if you do something like 
passing it filter!"true(source), but there's no way to fix that without 
disallowing dchar ranges entirely, which would be unnecessarily restrictive. 
Once you allow arbitrary ranges of char rather than requiring strings, the 
extra code required to allow ranges of wchar and dchar is trivial. It's stuff 
like worrying about range[0] vs range.front which complicates things (even if 
front is a code unit rather than a code point), and using string mixins makes 
it so that the code with the logic is just as simple as it would be with 
strings. So, I think that I can continue almost exactly as I have been and 
still achieve what Walter wants. The main issue that I have (beyond finishing 
what I haven't gotten to yet) is changing how I handle errors and comments, 
since I currently have them as tokens, but that shouldn't be terribly hard to 
fix.

- Jonathan M Davis
August 03, 2012
Re: std.d.lexer requirements
On 8/2/2012 3:38 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
>> Remember, its the consumer doing the decoding, not the input range.
>
> But that's the problem. The consumer has to treat character ranges specially
> to make this work. It's not generic. If it were generic, then it would simply
> be using front, popFront, etc. It's going to have to special case strings to
> do the buffering that you're suggesting. And if you have to special case
> strings, then how is that any different from what we have now?

No, the consumer can do its own buffering. It only needs a 4 character buffer, 
worst case.


> If you're arguing that strings should be treated as ranges of code units, then
> pretty much _every_ range-based function will have to special case strings to
> even work correctly - otherwise it'll be operating on individual code points
> rather than code points (e.g. filtering code units rather than code points,
> which would generate an invalid string). This makes the default behavior
> incorrect, forcing _everyone_ to special case strings _everywhere_ if they
> want correct behavior with ranges which are strings. And efficiency means
> nothing if the result is wrong.

No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I 
am not making a general argument about ranges, characters, or Phobos.


> As it is now, the default behavior of strings with range-based functions is
> correct but inefficient, so at least we get correct code. And if someone wants
> their string processing to be efficient, then they special case strings and do
> things like the buffering that you're suggesting. So, we have correct by
> default with efficiency as an option. The alternative that you seem to be
> suggesting (making strings be treated as ranges of code units) means that it
> would be fast by default but correct as an option, which is completely
> backwards IMHO. Efficiency is important, but it's pointless how efficient
> something is if it's wrong, and expecting that your average programmer is
> going to write unicode-aware code which functions correctly is completely
> unrealistic.

Efficiency for the *lexer* is of *paramount* importance. I don't anticipate 
std.d.lexer will be implemented by some random newbie, I expect it to be 
carefully implemented and to do Unicode correctly, regardless of how difficult 
or easy that may be.

I seem to utterly fail at making this point.

The same point applies to std.regex - efficiency is terribly, terribly important 
for it. Everyone judges regexes by their speed, and nobody cares how hard they 
are to implement to get that speed.

To reiterate another point, since we are in the compiler business, people will 
expect std.d.lexer to be of top quality, not some bag on the side. It needs to 
be usable as a base for writing a professional quality compiler. It's the reason 
why I'm pushing much harder on this than I do for other modules.
August 03, 2012
Re: std.d.lexer requirements
On 8/2/2012 4:30 PM, Andrei Alexandrescu wrote:
> I think Walter has very often emphasized the need for the lexer to be faster
> than the usual client software. My perception is that he's discussing lexer
> design in understanding there's a need for a less comfortable approach, namely
> do decoding in client.

+1
August 03, 2012
Re: std.d.lexer requirements
On Thursday, August 02, 2012 19:40:13 Walter Bright wrote:
> No, I'm arguing that the LEXER should accept a UTF8 input range for its
> input. I am not making a general argument about ranges, characters, or
> Phobos.

I think that this is the main point of misunderstanding then. From your 
comments, you seemed to me to be arguing for strings being treated as ranges 
of code units all the time, which really doesn't make sense.

- Jonathan M Davis
August 03, 2012
Re: std.d.lexer requirements
On 8/2/12 10:40 PM, Walter Bright wrote:
> To reiterate another point, since we are in the compiler business,
> people will expect std.d.lexer to be of top quality, not some bag on the
> side. It needs to be usable as a base for writing a professional quality
> compiler. It's the reason why I'm pushing much harder on this than I do
> for other modules.

The lexer must be configurable enough to tokenize other languages than 
D. I confess I'm very unhappy that there seem to be no less than three 
people determined to write lexers for D. We're wasting precious talent 
and resources doubly. Once, several people are working in parallel on 
the same product. Second, none of them is actually solving the problem 
that should be solved.

Andrei
August 03, 2012
Re: std.d.lexer requirements
On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:
> On 8/2/12 10:40 PM, Walter Bright wrote:
> > To reiterate another point, since we are in the compiler business,
> > people will expect std.d.lexer to be of top quality, not some bag on the
> > side. It needs to be usable as a base for writing a professional quality
> > compiler. It's the reason why I'm pushing much harder on this than I do
> > for other modules.
> 
> The lexer must be configurable enough to tokenize other languages than
> D. I confess I'm very unhappy that there seem to be no less than three
> people determined to write lexers for D. We're wasting precious talent
> and resources doubly. Once, several people are working in parallel on
> the same product. Second, none of them is actually solving the problem
> that should be solved.

You're not going to get as fast a lexer if it's not written specifically for D. 
Writing a generic lexer is a different problem. It's also one that needs to be 
solved, but I think that it's a mistake to think that a generic lexer is going 
to be able to be as fast as one specifically optimized for D. And there are 
several people already working on the generic stuff (like Phillipe).

- Jonathan M Davis
August 03, 2012
Re: std.d.lexer requirements
On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu 
wrote:
> The lexer must be configurable enough to tokenize other 
> languages than D.

You're going to have to defend that one.
7 8 9 10 11 12 13 14 15
Top | Discussion index | About this forum | D home