std.d.lexer requirements (page 10) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » std.d.lexer requirements (page 10)

August 02, 2012

Re: std.d.lexer requirements

Posted by José Armando García Sancio

José Armando García Sancio

On Thu, Aug 2, 2012 at 1:26 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
>> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
>> > On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
>> >> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
>> >>> It is for ranges in general. In the general case, a range of UTF-8 or
>> >>> UTF-16 makes no sense whatsoever. Having range-based functions which
>> >>> understand the encodings and optimize accordingly can be very beneficial
>> >>> (which happens with strings but can't happen with general ranges without
>> >>> the concept of a variably-length encoded range like we have with forward
>> >>> range or random access range), but to actually have a range of UTF-8 or
>> >>> UTF-16 just wouldn't work. Range-based functions operate on elements,
>> >>> and
>> >>> doing stuff like filter or map or reduce on code units doesn't make any
>> >>> sense at all.
>> >>
>> >> Yes, it can work.
>> >
>> > How?
>>
>> Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.
>
> And how on earth is that going to work as a range? Range-based functions operate on elements. They use empty, front, popFront, etc. If front doesn't return an element that a range-based function can operate on without caring what it is, then that type isn't going to work as a range. If you need the consumer to be doing something special, then that means you need to special case it for that range type. And that's what you're doing when you special- case range-base functions for strings.

A little bit off topic but...

People have been composing/decorating Streams/Ranges for probably 30 years now. Examples: input stream, output stream, byte stream, char stream, buffered stream, cipher stream, base64 stream, etc.

If you need more example. Consider an HTTPS request. At the lowest level you have a byte stream/range. No sane developer wants to deal with HTTPS request at this level so you decorate it with an SSL stream/range. That is still too low level so you decorate this with a char stream/range. Still too low level? Decorate it with a modal line buffered stream/range. We are getting closer but it still not the correct range abstraction so then you need a modal http stream/range. You need the modal part if you want to support http streaming.

August 02, 2012

Re: std.d.lexer requirements

Posted by Walter Bright
in reply to Dmitry Olshansky

Walter Bright

Posted in reply to Dmitry Olshansky

On 8/2/2012 1:20 PM, Dmitry Olshansky wrote:
> +1 Another point is that it's crystal clear *how* to optimize lexer, and gain
> some precious time. It is well trodden path with well known destination.

I'm less sure how well known it is. DMC++ was known for being *several times* faster than other C++ compilers, not just a few percent faster.

But we do have the DMD lexer which is useful as a benchmark and a guide. I won't say it couldn't be made faster, but it does set a minimum bar for performance.

August 02, 2012

Re: std.d.lexer requirements

Posted by Walter Bright
in reply to Jacob Carlborg

Walter Bright

Posted in reply to Jacob Carlborg

On 8/2/2012 1:41 PM, Jacob Carlborg wrote:
> On 2012-08-02 21:35, Walter Bright wrote:
>
>> A good IDE should do its parsing in a separate thread, so the main user
>> input thread remains crisp and responsive.
>>
>> If the user edits the text while the parsing is in progress, the
>> background parsing thread simply abandons the current parse and starts
>> over.
>
> It still needs to update the editor view with the correct syntax highlighting
> which needs to be done in the same thread as the rest of the GUI.
>

The rendering code should be in yet a third thread.

An editor I wrote years ago had the rendering code in a separate thread from user input. You never had to wait to type in commands, the rendering would catch up when it could. What was also effective was the rendering would abandon a render midstream and restart it if it detected that the underlying data had changed in the meantime. This meant that the display was never more than one render out of date.

Although the code itself wasn't any faster, it certainly *felt* faster with this approach. It made for crisp editing even on a pig slow machine.

August 02, 2012

Re: std.d.lexer requirements

Posted by Walter Bright
in reply to Jonathan M Davis

Walter Bright

Posted in reply to Jonathan M Davis

On 8/2/2012 1:26 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
>> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
>>> On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
>>>> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
>>>>> It is for ranges in general. In the general case, a range of UTF-8 or
>>>>> UTF-16 makes no sense whatsoever. Having range-based functions which
>>>>> understand the encodings and optimize accordingly can be very beneficial
>>>>> (which happens with strings but can't happen with general ranges without
>>>>> the concept of a variably-length encoded range like we have with forward
>>>>> range or random access range), but to actually have a range of UTF-8 or
>>>>> UTF-16 just wouldn't work. Range-based functions operate on elements,
>>>>> and
>>>>> doing stuff like filter or map or reduce on code units doesn't make any
>>>>> sense at all.
>>>>
>>>> Yes, it can work.
>>>
>>> How?
>>
>> Keep a 6 character buffer in your consumer. If you read a char with the high
>> bit set, start filling that buffer and then decode it.
>
> And how on earth is that going to work as a range?

1. read a character from the range
2. if the character is the start of a multibyte character, put the character in the buffer
3. keep reading from the range until you've got the whole of the multibyte character
4. convert that 6 (or 4) character buffer into a dchar

Remember, its the consumer doing the decoding, not the input range.

> I agree that we should be making string operations more efficient by taking code
> units into account, but I completely disagree that we can do that generically.

The requirement I listed was that the input range present UTF8 characters. Not any random character type.

August 02, 2012

Re: std.d.lexer requirements

Posted by Dmitry Olshansky
in reply to Walter Bright

Dmitry Olshansky

Posted in reply to Walter Bright

On 03-Aug-12 02:10, Walter Bright wrote:
> On 8/2/2012 1:41 PM, Jacob Carlborg wrote:
>> On 2012-08-02 21:35, Walter Bright wrote:
>>
>>> A good IDE should do its parsing in a separate thread, so the main user
>>> input thread remains crisp and responsive.
>>>
>>> If the user edits the text while the parsing is in progress, the
>>> background parsing thread simply abandons the current parse and starts
>>> over.
>>
>> It still needs to update the editor view with the correct syntax
>> highlighting
>> which needs to be done in the same thread as the rest of the GUI.
>>
>
> The rendering code should be in yet a third thread.
>
OT:
It never ceases to amaze me how people miss this very simple point:
GUI runs on its own thread and shouldn't ever block on something (save for message pump itself, of course). Everything else (including possibly slow rendering) done on the side and then result (once ready) swiftly indicated on GUI.

Recent Windows 8 talks were in fact nothing new in this regard, but now even API is made so that it's harder to do something blocking in GUI thread.


> An editor I wrote years ago had the rendering code in a separate thread
> from user input. You never had to wait to type in commands, the
> rendering would catch up when it could. What was also effective was the
> rendering would abandon a render midstream and restart it if it detected
> that the underlying data had changed in the meantime. This meant that
> the display was never more than one render out of date.
>
> Although the code itself wasn't any faster, it certainly *felt* faster
> with this approach. It made for crisp editing even on a pig slow machine.
>


-- 
Dmitry Olshansky

August 02, 2012

Re: std.d.lexer requirements

Posted by Jonathan M Davis
in reply to Walter Bright

Jonathan M Davis

Posted in reply to Walter Bright

On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
> Remember, its the consumer doing the decoding, not the input range.

But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic. If it were generic, then it would simply be using front, popFront, etc. It's going to have to special case strings to do the buffering that you're suggesting. And if you have to special case strings, then how is that any different from what we have now?

If you're arguing that strings should be treated as ranges of code units, then pretty much _every_ range-based function will have to special case strings to even work correctly - otherwise it'll be operating on individual code points rather than code points (e.g. filtering code units rather than code points, which would generate an invalid string). This makes the default behavior incorrect, forcing _everyone_ to special case strings _everywhere_ if they want correct behavior with ranges which are strings. And efficiency means nothing if the result is wrong.

As it is now, the default behavior of strings with range-based functions is correct but inefficient, so at least we get correct code. And if someone wants their string processing to be efficient, then they special case strings and do things like the buffering that you're suggesting. So, we have correct by default with efficiency as an option. The alternative that you seem to be suggesting (making strings be treated as ranges of code units) means that it would be fast by default but correct as an option, which is completely backwards IMHO. Efficiency is important, but it's pointless how efficient something is if it's wrong, and expecting that your average programmer is going to write unicode-aware code which functions correctly is completely unrealistic.

- Jonathan M Davis

August 02, 2012

Re: std.d.lexer requirements

Posted by Piotr Szturmaj
in reply to Walter Bright

Piotr Szturmaj

Posted in reply to Walter Bright

Walter Bright wrote:
> On 8/2/2012 1:26 PM, Jonathan M Davis wrote:
>> On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
>>> Keep a 6 character buffer in your consumer. If you read a char with
>>> the high
>>> bit set, start filling that buffer and then decode it.
>>
>> And how on earth is that going to work as a range?
>
> 1. read a character from the range
> 2. if the character is the start of a multibyte character, put the
> character in the buffer
> 3. keep reading from the range until you've got the whole of the
> multibyte character
> 4. convert that 6 (or 4) character buffer into a dchar

Working example: https://github.com/pszturmaj/json-streaming-parser/blob/master/json.d#L18 :-)

August 02, 2012

Re: std.d.lexer requirements

Posted by Andrei Alexandrescu
in reply to Jonathan M Davis

Andrei Alexandrescu

Posted in reply to Jonathan M Davis

On 8/2/12 6:38 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
>> Remember, its the consumer doing the decoding, not the input range.
>
> But that's the problem. The consumer has to treat character ranges specially
> to make this work. It's not generic.

It is generic! It's just in another dimension: it operates on any range of _bytes_.

Andrei

August 02, 2012

Re: std.d.lexer requirements

Posted by Jonathan M Davis
in reply to Andrei Alexandrescu

Jonathan M Davis

Posted in reply to Andrei Alexandrescu

On Thursday, August 02, 2012 18:41:23 Andrei Alexandrescu wrote:
> On 8/2/12 6:38 PM, Jonathan M Davis wrote:
> > On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
> >> Remember, its the consumer doing the decoding, not the input range.
> > 
> > But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic.
> 
> It is generic! It's just in another dimension: it operates on any range of _bytes_.

So, a function which does the buffering of code units like Walter suggests is generic? It's doing something that makes no sense outside of strings. And if it's doing that with strings and something else with everything else (which it _has_ to do if the same function is going to work with both unicode as well as range types that have nothing to do with unicode), then it's special casing strings and therefore is _not_ generic.

Sure, you could have a function which specifically operates on ranges of code units and understands how unicode works and is written accordingly, but then that function is specific to ranges of code units and is only generic with regards to various ranges of code units. It can't operate on generic ranges like functions such as map and filter can.

- Jonathan M Davis

August 02, 2012

Re: std.d.lexer requirements

Posted by Andrei Alexandrescu
in reply to Jonathan M Davis

Andrei Alexandrescu

Posted in reply to Jonathan M Davis

On 8/2/12 6:54 PM, Jonathan M Davis wrote:
> So, a function which does the buffering of code units like Walter suggests is
> generic?

Of course, because it operates on bytes read from memory, files, or sockets etc.

> It's doing something that makes no sense outside of strings.

Right. The bytes represent UTF-8 encoded strings, except their type is ubyte so there's no processing in the library.

> And if
> it's doing that with strings and something else with everything else (which it
> _has_ to do if the same function is going to work with both unicode as well as
> range types that have nothing to do with unicode), then it's special casing
> strings and therefore is _not_ generic.

This is automatically destroyed because its assumption was destroyed.

> Sure, you could have a function which specifically operates on ranges of code
> units and understands how unicode works and is written accordingly, but then
> that function is specific to ranges of code units and is only generic with
> regards to various ranges of code units. It can't operate on generic ranges
> like functions such as map and filter can.

Yes, and I think that's exactly what the doctor prescribed here.


Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation