August 02, 2012
Le 02/08/2012 10:44, Walter Bright a écrit :
> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
>> On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
>>> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
>>>> It is for ranges in general. In the general case, a range of UTF-8 or
>>>> UTF-16 makes no sense whatsoever. Having range-based functions which
>>>> understand the encodings and optimize accordingly can be very
>>>> beneficial
>>>> (which happens with strings but can't happen with general ranges
>>>> without
>>>> the concept of a variably-length encoded range like we have with
>>>> forward
>>>> range or random access range), but to actually have a range of UTF-8 or
>>>> UTF-16 just wouldn't work. Range-based functions operate on
>>>> elements, and
>>>> doing stuff like filter or map or reduce on code units doesn't make any
>>>> sense at all.
>>>
>>> Yes, it can work.
>>
>> How?
>
> Keep a 6 character buffer in your consumer. If you read a char with the
> high bit set, start filling that buffer and then decode it.
>

How is that different than a manually done range of dchar ?
August 02, 2012
Le 02/08/2012 09:30, Walter Bright a écrit :
> On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
>> On 2012-08-02 02:10, Walter Bright wrote:
>>
>>> 1. It should accept as input an input range of UTF8. I feel it is a
>>> mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
>>> UTF16 should use an 'adapter' range to convert the input to UTF8. (This
>>> is what component programming is all about.)
>>
>> I'm no expert on ranges but won't that prevent slicing? Slicing is one
>> of the
>> main reasons for why the Tango XML parser is so amazingly fast.
>>
>
> You don't want to use slicing on the lexer. The reason is that your
> slices will be spread all over memory, as source files can be huge, and
> all that memory will be retained and never released. What you want is a
> compact representation after lexing. Compactness also helps a lot with
> memory caching.
>

Token are not kept in memory. You usually consume them for other processing and throw them away.

It isn't an issue.
August 02, 2012
On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
> lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language).

What if we're just using this lexer in something like a
syntax highlighting text editor? I'd be annoyed if it
stopped typing for a little bit cuz of processing.
August 02, 2012
On 8/2/12 6:07 AM, Walter Bright wrote:
> Why? I've never seen any UTF16 or UTF32 D source in the wild.

Here's a crazy idea that I'll hang to this one remark. No, two crazy ideas.

First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicating.

Regarding the problem at hand, it's becoming painfully obvious to me that the lexer MUST do its own decoding internally. Hence, a very simple thing to do is have the entire lexer only deal with ranges of ubyte. If someone passes a char[], the lexer's front end can simply call s.representation and obtain the underlying ubyte[].

If someone passes some range of char, the lexer uses an adapter (e.g. map()) that casts every char to ubyte, which is a zero-cost operation. Then it uses the same core operating on ranges of ubyte.

In the first implementation, the lexer may actually refuse any range of 16-bit or 32-bit elements (wchar[], range of wchar, dchar[], range of dchar). Later on the core may be evolved to handle range of ushort and range of dchar. The front-end would use again representation() against wchar[], cast with range of wchar, and would just pass the dchar[] and range of dchar around.

This makes the core simple and efficient (I think Jonathan's use of static if and mixins everywhere, while well-intended, complicates matters without necessity).

And as such we have a lexer! Which operates with ranges, just has a simple front-end clarifying that the lexer must do its own decoding.

Works?


Andrei
August 02, 2012
On 02-Aug-12 12:44, Walter Bright wrote:
> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
>> On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
>>> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
>>>> It is for ranges in general. In the general case, a range of UTF-8 or
>>>> UTF-16 makes no sense whatsoever. Having range-based functions which
>>>> understand the encodings and optimize accordingly can be very
>>>> beneficial
>>>> (which happens with strings but can't happen with general ranges
>>>> without
>>>> the concept of a variably-length encoded range like we have with
>>>> forward
>>>> range or random access range), but to actually have a range of UTF-8 or
>>>> UTF-16 just wouldn't work. Range-based functions operate on
>>>> elements, and
>>>> doing stuff like filter or map or reduce on code units doesn't make any
>>>> sense at all.
>>>
>>> Yes, it can work.
>>
>> How?
>
> Keep a 6 character buffer in your consumer. If you read a char with the
> high bit set, start filling that buffer and then decode it.
>
4 bytes is enough.

Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF specifically so that it could be encoded in 4 bytes of UTF-8.


P.S. Looks like I'm too late for this party ;)


-- 
Dmitry Olshansky
August 02, 2012
On 8/2/2012 4:49 AM, deadalnix wrote:
> How is that different than a manually done range of dchar ?

The decoding is rarely necessary, even if non-ascii data is there. However, the range cannot decide if decoding is necessary - the receiver has to, hence the receiver does the decoding.


August 02, 2012
On 8/2/2012 8:46 AM, Dmitry Olshansky wrote:
>> Keep a 6 character buffer in your consumer. If you read a char with the
>> high bit set, start filling that buffer and then decode it.
>>
> 4 bytes is enough.
>
> Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF
> specifically so that it could be encoded in 4 bytes of UTF-8.

Yeah, but I thought 6 bytes would future proof it! (Inevitably, the Unicode committee will add more.)

>
> P.S. Looks like I'm too late for this party ;)
>
>

It affects you strongly, too, so I'm glad to see you join in.

August 02, 2012
On 8/2/2012 4:27 AM, deadalnix wrote:
> Really nice idea. It is still easy to wrap the Range in another Range that
> process errors in a custom way.

It's suboptimal for a high speed lexer/parser, as already explained.

Speed is of paramount importance for a lexer, and it overrides things like elegance and simplicity.

August 02, 2012
On 8/2/2012 4:47 AM, deadalnix wrote:
> Le 02/08/2012 10:13, Walter Bright a écrit :
>> As fast as the dmd one would be best.
>>
>
> That'd be great but . . .
>
> lexer really isn't the performance bottleneck of dmd (or any compiler of a non
> trivial language). Additionally, anybody that have touched dmd source code can
> agree that usability/maintainability isn't as its best.
>
> Sacrificing some perfs in a non bottleneck area to increase ease of use make
> perfect sense.

A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it.

And in my profiling, lexing speed is a big factor in the debug build times.

The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being the fastest compiler available by a wide margin. A big advantage D has is also compile speed.

Please do not underestimate its importance.
August 02, 2012
On 8/2/2012 4:52 AM, deadalnix wrote:
> Le 02/08/2012 09:30, Walter Bright a écrit :
>> On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
>>> On 2012-08-02 02:10, Walter Bright wrote:
>>>
>>>> 1. It should accept as input an input range of UTF8. I feel it is a
>>>> mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
>>>> UTF16 should use an 'adapter' range to convert the input to UTF8. (This
>>>> is what component programming is all about.)
>>>
>>> I'm no expert on ranges but won't that prevent slicing? Slicing is one
>>> of the
>>> main reasons for why the Tango XML parser is so amazingly fast.
>>>
>>
>> You don't want to use slicing on the lexer. The reason is that your
>> slices will be spread all over memory, as source files can be huge, and
>> all that memory will be retained and never released. What you want is a
>> compact representation after lexing. Compactness also helps a lot with
>> memory caching.
>>
>
> Token are not kept in memory. You usually consume them for other processing and
> throw them away.
>
> It isn't an issue.

The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.