May 08, 2021

On Friday, 7 May 2021 at 22:34:19 UTC, Jon Degenhardt wrote:

>

byLine implementations will usually work by iterating forward, but there are random access use cases as well. For example, it is perfectly reasonable to divide a utf-8 array in roughly in half using byte offsets, then searching for the nearest utf-8 character boundary. At after this both halves are treated as utf-8 input ranges, not random access.

In my experience treating a string as byte array is almost never a good thing. Person doing it must be very careful and truly understand what they are doing.
What are those use cases other than byLine where this is useful?
Dividing utf-8 array and searching for the nearest char may split inside a combining character which isn't a thing you usually want. Especially when human would read this text.
Conceptually string is a sequence of characters. A range of dchar in D's terms.

May 08, 2021

On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:

>

Dividing utf-8 array and searching for the nearest char may split inside a combining character which isn't a thing you usually want.

It is not difficult to recognize this case and go back 1 to 3 bytes to reach a correct splitting place. UTF-8 was designed with this in mind.

  • I can imagine, that this can be useful in divide-and-conquer algorithms, like binary search.
  • Or when you've got for whatever reason the possibility to do larger jumps while scanning a string, e.g. when you know there are now 50 letters ahead, that do not contain a certain token you are looking for, you can safely jump 50 bytes, go back to the next splitting point and continue linear search there.
  • Or you want to cut a string into pieces of a certain length (again 50?), where the exact length is not so much important. So you just jump ahead 50, go back again and split at this point. If there are a lot of non ascii characters in between, this is of course shorter, but maybe ok, because speed is more important.
  • You want to process pieces of a string in parallel: Cut it in 16 pieces and let your 16 cores work on each of them.
May 08, 2021

On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:

>

On Friday, 7 May 2021 at 22:34:19 UTC, Jon Degenhardt wrote:

>

byLine implementations will usually work by iterating forward, but there are random access use cases as well. For example, it is perfectly reasonable to divide a utf-8 array in roughly in half using byte offsets, then searching for the nearest utf-8 character boundary. At after this both halves are treated as utf-8 input ranges, not random access.

In my experience treating a string as byte array is almost never a good thing. Person doing it must be very careful and truly understand what they are doing.
What are those use cases other than byLine where this is useful?
Dividing utf-8 array and searching for the nearest char may split inside a combining character which isn't a thing you usually want. Especially when human would read this text.
Conceptually string is a sequence of characters. A range of dchar in D's terms.

Data and log file processing are common cases. Single byte ascii characters are normally used to delimit structure in such files. Record delimiters, field delimiters, name-value pair delimiters, escape syntax, etc. A common way to operate on such files is to identify structural boundaries by finding the requisite single byte ascii characters and treating the contained data as opaque (uninterpreted) sequences of utf-8 bytes.

The details depend on the file format. But the key part is that single byte ascii characters can be unambiguously identified without interpreting other characters in a utf-8 data stream. Of course, when it comes time to interpreting the data inside these data streams it is necessary to operate on cohesive blocks. Yes graphemes, but also things like numbers. It's not useful to split a number in the middle and then call std.conv.to!double on it.

Operating on the single byte structural elements allows deferring interpretation of multi-byte unicode content until it is needed. This is why it's useful to switch back and forth between a byte-oriented view and a UTF character view. Operating on bytes is faster (e.g. memchr, no utf-8 decoding), enables parallelization (depending on the type of file), and can be used with fixed size buffer reads and writes.

--Jon

May 08, 2021

On Saturday, 8 May 2021 at 16:25:31 UTC, Berni44 wrote:

>

On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:

>

Dividing utf-8 array and searching for the nearest char may split inside a combining character which isn't a thing you usually want.

It is not difficult to recognize this case and go back 1 to 3 bytes to reach a correct splitting place. UTF-8 was designed with this in mind.

  • I can imagine, that this can be useful in divide-and-conquer algorithms, like binary search.
    ... (more examples) ..
  • You want to process pieces of a string in parallel: Cut it in 16 pieces and let your 16 cores work on each of them.

Exactly. All the ideas you listed apply. Parallelization is very often useful.

May 08, 2021

On Saturday, 8 May 2021 at 16:25:31 UTC, Berni44 wrote:

>

On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:

>

Dividing utf-8 array and searching for the nearest char may split inside a combining character which isn't a thing you usually want.

It is not difficult to recognize this case and go back 1 to 3 bytes to reach a correct splitting place. UTF-8 was designed with this in mind.

I ment this combining characters. they are language-specific, but most of the time the string does not contain any clue which language is it.

>
  • I can imagine, that this can be useful in divide-and-conquer algorithms, like binary search.

They must be applied with great careful to non-ascii texts. What about RTL for example? You cannot split inside RTL block

>
  • Or you want to cut a string into pieces of a certain length (again 50?), where the exact length is not so much important.

For what business task would I do that? I may want to split a string on some char subsequence for lexing. But one cannot assume lengths of those chunks.

>

So you just jump ahead 50, go back again and split at this point. If there are a lot of non ascii characters in between, this is of course shorter, but maybe ok, because speed is more important.

Not sure if speed is more important than correctness.

>
  • You want to process pieces of a string in parallel: Cut it in 16 pieces and let your 16 cores work on each of them.

I'm not sure if this is possible with all the quirks of unicode. Never herd even of parallel processors of structured texts like xml.

May 08, 2021

On Saturday, 8 May 2021 at 19:06:48 UTC, guai wrote:

>

I ment this combining characters. they are language-specific, but most of the time the string does not contain any clue which language is it.

The thing is making the range be of dchars doesn't help with this.

This kind of thinking is why Phobos does the autodecoding thing it does now, converting utf-8 to a range of dchar as it sees it... but those combining characters are still (or rather can be) two separate dchars!

So right now Phobos does something that seems useful... but actually isn't. All of the bad, none of the good.

BTW I also like to point out that Ascii actually has a lot of the same mysteries we ascribe to unicode. Like variable width chars: \t is an ascii char. Zero width char, ascii has \0 and \a. Negative width char? Is \b one? idk.

But there's still a lot of times you can treat it as bytes and get away with it.

This is why I'm not sold on Andrei's new String idea myself. I totally agree making char[] a range of dchars is a bad idea. But I think the only right thing to do is to expose what it actually is and then both educate and empower the user to do what they need themselves.

May 08, 2021

On Saturday, 8 May 2021 at 18:44:00 UTC, Jon Degenhardt wrote:

>

On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:

>

On Friday, 7 May 2021 at 22:34:19 UTC, Jon Degenhardt wrote:

>

byLine implementations will usually work by iterating forward, but there are random access use cases as well. For example, it is perfectly reasonable to divide a utf-8 array in roughly in half using byte offsets, then searching for the nearest utf-8 character boundary. At after this both halves are treated as utf-8 input ranges, not random access.

In my experience treating a string as byte array is almost never a good thing. Person doing it must be very careful and truly understand what they are doing.
What are those use cases other than byLine where this is useful?
Dividing utf-8 array and searching for the nearest char may split inside a combining character which isn't a thing you usually want. Especially when human would read this text.
Conceptually string is a sequence of characters. A range of dchar in D's terms.

Data and log file processing are common cases. Single byte ascii characters are normally used to delimit structure in such files. Record delimiters, field delimiters, name-value pair delimiters, escape syntax, etc. A common way to operate on such files is to identify structural boundaries by finding the requisite single byte ascii characters and treating the contained data as opaque (uninterpreted) sequences of utf-8 bytes.

The details depend on the file format. But the key part is that single byte ascii characters can be unambiguously identified without interpreting other characters in a utf-8 data stream. Of course, when it comes time to interpreting the data inside these data streams it is necessary to operate on cohesive blocks. Yes graphemes, but also things like numbers. It's not useful to split a number in the middle and then call std.conv.to!double on it.

Operating on the single byte structural elements allows deferring interpretation of multi-byte unicode content until it is needed. This is why it's useful to switch back and forth between a byte-oriented view and a UTF character view. Operating on bytes is faster (e.g. memchr, no utf-8 decoding), enables parallelization (depending on the type of file), and can be used with fixed size buffer reads and writes.

--Jon

When you work with log files first you pull it in as a byte stream, split in chunks. Then make a string out of each of them. Once you've done it, you process it like a string with all the rules of unicode. For example split it into words. And then you may want to convert a word to bytes back again.
But you cannot split a string wherever you want treating it as bytes. It most certainly wouldn't work with all the languages out there.
With string you cannot get a char by index, you must read them sequentially. You can search, you can tokenize, rewind and reinterpret maybe.

May 08, 2021

On Saturday, 8 May 2021 at 19:30:03 UTC, Adam D. Ruppe wrote:

>

On Saturday, 8 May 2021 at 19:06:48 UTC, guai wrote:

>

I ment this combining characters. they are language-specific, but most of the time the string does not contain any clue which language is it.

The thing is making the range be of dchars doesn't help with this.

At least it won't induce more problems

May 08, 2021

On Saturday, 8 May 2021 at 19:06:48 UTC, guai wrote:

>

I ment this combining characters. they are language-specific, but most of the time the string does not contain any clue which language is it.

You are talking about generic algorithms that work for every script. But unicode allows for algorithms only supporting subsets. If your subset doesn't contain combining characters, you don't need to care about them. And else you may need to go back to the next base character. Depends on the usecase.

> >
  • I can imagine, that this can be useful in divide-and-conquer algorithms, like binary search.

They must be applied with great careful to non-ascii texts. What about RTL for example? You cannot split inside RTL block

Oh, yes, you can! Think of an algorithm which is doing cryptographic analysis and counting consecutive pairs of ascii characters. For that it doesn't matter if there is RTL text cut into pieces.

> >
  • Or you want to cut a string into pieces of a certain length (again 50?), where the exact length is not so much important.

For what business task would I do that?

Simple wrapping to avoid loosing text when printing, or to avoid having to scroll vertically. Is probably not useful for a high quality program...

>

I may want to split a string on some char subsequence for lexing. But one cannot assume lengths of those chunks.

Depending on the use case you may know ahead.

> >

So you just jump ahead 50, go back again and split at this point. If there are a lot of non ascii characters in between, this is of course shorter, but maybe ok, because speed is more important.

Not sure if speed is more important than correctness.

Of course, this again depends on the use case. You can't say that in general.

> >
  • You want to process pieces of a string in parallel: Cut it in 16 pieces and let your 16 cores work on each of them.

I'm not sure if this is possible with all the quirks of unicode.

Think again of the cryptographic analysis above, for an example. (Or checking wikipedia entries for whatever automatically.)

Keep in mind, that we do not always have to support everything of unicode. If we know ahead, that our text contains mainly ascii and aside from this only a few base characters, but never combining characters and so on, we can use different algorithms which might be simpler or faster or both. To make sure, that this constraint holds, is then something, that has to be done outside of the algorithm.

>

Never herd even of parallel processors of structured texts like xml.

I would judge it much more difficult to process xml in parallel than to do the same with unicode.

May 08, 2021

On Saturday, 8 May 2021 at 19:33:45 UTC, guai wrote:

>

...
But you cannot split a string wherever you want treating it as bytes. It most certainly wouldn't work with all the languages out there.

Sure you can. It's necessary to take of advantage of the properties of utf-8 encoding to do it. That is, it's necessary to find a nearby utf-8 character boundary, but utf-8 is defined in a manner that enables this. Take a look at section 2.5 Encoding Forms in the Unicode Standards doc. It describes exactly this.

>

With string you cannot get a char by index, you must read them sequentially.

Correct, you cannot find a unicode character using a character based index without processing sequentially. But for large classes of algorithms this is not necessary. That is, there is often no need to find, for example, the 100th character. If all an algorithm needs to do is split a string roughly in half, then use the byte offsets to find the halfway point and then look for a utf-8 character boundary. If the algorithm is based on some other boundary, say, token boundaries, then find one of those boundaries.