Thread overview
UTF8 and unary encoding
Sep 12, 2016
Jonathan M Davis
September 12, 2016
While looking at https://en.wikipedia.org/wiki/Unary_coding I found that UTF8 uses unary encoding for the length of multibyte sequences. Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals that indeed "The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence. When reading from a stream, a reader can process all fully received sequences without first having to wait for either the leading byte of a next sequence or an end-of-stream indication."

We don't use that explicitly; instead, we load each byte of multi-sequences. Who'd be interested in looking whether Phobos' primitives can be faster with multibyte-rich text?


Andrei
September 12, 2016
On Monday, September 12, 2016 07:37:05 Andrei Alexandrescu via Digitalmars-d wrote:
> While looking at https://en.wikipedia.org/wiki/Unary_coding I found that UTF8 uses unary encoding for the length of multibyte sequences. Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals that indeed "The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence. When reading from a stream, a reader can process all fully received sequences without first having to wait for either the leading byte of a next sequence or an end-of-stream indication."
>
> We don't use that explicitly; instead, we load each byte of multi-sequences. Who'd be interested in looking whether Phobos' primitives can be faster with multibyte-rich text?

Aren't we already doing that with stride? It reads the number of bytes in a code point from the first code unit and then if we're dealing with a random access range of char or an array of char, then we skip that many code units without reading them. The fact that we auto-decode in many cases does mean that all of the bytes are read in a number of cases where they wouldn't need to be if we were dealing with ranges of char, but in the cases where we aren't auto-decoding, we should already be taking advantage of this in general via stride (though obviously, there could be specific places where the code is not skipping bytes like it should).

Or am I misunderstanding what you're talking about doing here?

- Jonathan M Davis

September 12, 2016
On 9/12/16 11:59 AM, Jonathan M Davis via Digitalmars-d wrote:
> Aren't we already doing that with stride? It reads the number of bytes in a
> code point from the first code unit and then if we're dealing with a random
> access range of char or an array of char, then we skip that many code units
> without reading them.

Oh, ok. I'd either forgotten or the code has been improved since I last looked at it. Thanks! -- Andrei