On Friday, 7 May 2021 at 15:24:42 UTC, Andrei Alexandrescu wrote:
>- We put a String type in the standard library. It uses UTF8 inside and supports iteration by either bytes, UTF8, UTF16, or UTF32. It manages its own memory so no need for the GC. It disallows remote coupling across callers/callees. Case closed.
This is a bit orthogonal, but... An important characteristic of utf-8 arrays is that they are simultaneously a random access range of bytes and an input range of utf-8 characters. For efficiency it's often important to switch back and forth between these two interpretations.
byLine
is one type of example, where a byte oriented search is done (e.g. with memchr
), but afterward the representation array is accessed as utf-8 input range.
byLine
implementations will usually work by iterating forward, but there are random access use cases as well. For example, it is perfectly reasonable to divide a utf-8 array in roughly in half using byte offsets, then searching for the nearest utf-8 character boundary. At after this both halves are treated as utf-8 input ranges, not random access.
This switching between interpretations doesn't fit well with current distinction between char[]
and byte[]
. A numbers of algorithms in phobos operate on one or the other, but not both.
It'd be very useful to have an approach to utf-8 strings that enabled switching interpretations easily, without casting.
--Jon