On Saturday, 8 May 2021 at 21:47:21 UTC, guai wrote:
> On Saturday, 8 May 2021 at 20:22:28 UTC, Jon Degenhardt wrote:
> If all an algorithm needs to do is split a string roughly in half, then use the byte offsets to find the halfway point and then look for a utf-8 character boundary. If the algorithm is based on some other boundary, say, token boundaries, then find one of those boundaries.
Those algorithms you talking about are either doesn't need strings at all but instead byte/char arrays or would produce garbage for any input other than ascii.
I don't understand the point you are trying to make. Perhaps you could rephrase.
I've implemented any number of these types of algorithms. Its very common to mix interpretation as unicode strings with interpretation as utf-8 bytes. e.g. Maybe its necessary to do case-conversion at some stage of processing. This has to be done on unicode characters, not bytes. But needing to do such processing at some point does exclude such treating the data as utf-8 bytes for other purposes.
Also, a char[]
in D is defined to be utf-8, and a string
is an immutable(char)[]
. So why would utf-8 data, including non-ascii characters, read into a char[]
produce garbage? The answer is that it wouldn't. No, you cannot simply start on an arbitrary byte boundary, but nobody has suggested this.
> Your example with log files mixes binary data with text. Properly done logger will escape delimiters inside text chunks, so it isn't even a string per se, it's some binary data from which you need to extract a string first.
Again, I'm not following the logic. Log files may or may not include binary data. But I'm sure why that matters. I'm talking about log files where the text portions are encoded as utf-8.
> A lot of bugs are caused by this mixing of text with binary. And I think it is better to distinguish them properly on a type level.
Perhaps it would help if you described what you mean by "binary". I tend to think of "binary" as things like image data, binary serialization formats, base-64 coding, compressed or encrypted text. These are quite different than utf-8 encoded unicode text.