On Saturday, 8 May 2021 at 19:06:48 UTC, guai wrote:
> I ment this combining characters. they are language-specific, but most of the time the string does not contain any clue which language is it.
You are talking about generic algorithms that work for every script. But unicode allows for algorithms only supporting subsets. If your subset doesn't contain combining characters, you don't need to care about them. And else you may need to go back to the next base character. Depends on the usecase.
> >
- I can imagine, that this can be useful in divide-and-conquer algorithms, like binary search.
They must be applied with great careful to non-ascii texts. What about RTL for example? You cannot split inside RTL block
Oh, yes, you can! Think of an algorithm which is doing cryptographic analysis and counting consecutive pairs of ascii characters. For that it doesn't matter if there is RTL text cut into pieces.
> >
- Or you want to cut a string into pieces of a certain length (again 50?), where the exact length is not so much important.
For what business task would I do that?
Simple wrapping to avoid loosing text when printing, or to avoid having to scroll vertically. Is probably not useful for a high quality program...
> I may want to split a string on some char subsequence for lexing. But one cannot assume lengths of those chunks.
Depending on the use case you may know ahead.
> > So you just jump ahead 50, go back again and split at this point. If there are a lot of non ascii characters in between, this is of course shorter, but maybe ok, because speed is more important.
Not sure if speed is more important than correctness.
Of course, this again depends on the use case. You can't say that in general.
> >
- You want to process pieces of a string in parallel: Cut it in 16 pieces and let your 16 cores work on each of them.
I'm not sure if this is possible with all the quirks of unicode.
Think again of the cryptographic analysis above, for an example. (Or checking wikipedia entries for whatever automatically.)
Keep in mind, that we do not always have to support everything of unicode. If we know ahead, that our text contains mainly ascii and aside from this only a few base characters, but never combining characters and so on, we can use different algorithms which might be simpler or faster or both. To make sure, that this constraint holds, is then something, that has to be done outside of the algorithm.
> Never herd even of parallel processors of structured texts like xml.
I would judge it much more difficult to process xml in parallel than to do the same with unicode.