August 14
On 8/13/2019 12:08 AM, Walter Bright wrote:
> If people want impactful things to work on, fixing each failure is worthwhile (each in separate PRs).

First fix: https://github.com/dlang/phobos/pull/7133
6 days ago
On Wednesday, 14 August 2019 at 09:29:30 UTC, Gregor Mückl wrote:

> There is no single universally correct way to segment a string. Grapheme segmentation requires a correct assumption of the text encoding in the string and also the assumption that the encoding is flawless. Neither may be guaranteed in general. There is a lot of ways to corrupt UTF-8 strings, for example.

Are you meaning that there's no way to verify that assumptions?
Sorting algorithms in Phobos are returning a SortedRange.

> And then there is a question of the length of a grapheme: IIRC they can consist of up to 6 or 7 code points with each of them encoded in a varying number of bytes in UTF-8, UTF-16 or UCS-2. So what data type do you use for representing graphemes then that is both not wasteful and doesn't require dynamic memory management?

It's performance the rationale of not using dynamic memory management, if that it's unavoidable to have a correct behaviour?

> Then there are other nasty quirks around graphemes: their encoding is not unique. This Unicode TR gives a good impression of how complex this single aspect is: https://unicode.org/reports/tr15/
> So if you want to use graphemes, do you want to keep the original encoding or do you implicitly convert them to NFC or NFD? NFC tends to be better for language processing, NFD tends to be better for text rendering (with exceptions). If you don't normalize, semantically equivalent graphemes may not be equal under comparison.

It's performance the rationale of not using normalisation, that solves all the problems you have mentioned above?

> At this point you're probably approaching the complexity of libraries like ICU. You can take a look at it if you want a good scare. ;)

The original question still is not answered: can you provide example of algorithms and use cases that don't need grapheme segmentation?

6 days ago
On Wednesday, 14 August 2019 at 17:12:00 UTC, H. S. Teoh wrote:

> - Taking substrings: does not need grapheme segmentation; you just slice the string.

What is the use case of slicing some multi-codeunit encoded grapheme in the middle?

> - Copying one string to another: does not need grapheme segmentation, - you just use memcpy (or equivalent).
> - Concatenating n strings: does not need grapheme segmentation, you just use memcpy (or equivalent).  In D, you just use array append,  or  std.array.appender if you get fancy.

That use case is not string processing, but general memory handling of an opaque type

> - Comparing one string to another: does not need grapheme  segmentation;
>   you either use strcmp/memcmp

That use case is not string processing, but general memory comparison of an opaque type

>, or if you need more delicate semantics,
> call one of the standard Unicode string collation algorithms (std.uni, meaning, your code does not need to worry about grapheme segmentation, and besides, Unicode collation algorithms operate at the code point  level, not at the grapheme level).

So this use case algorithm needs a proper handling of encoded code units, and can't be satisfied simply removing auto decoding

> - Matching a substring: does not need grapheme segmentation;  most
>   applications just need subarray matching, i.e., treat the  substring as
>   an opaque blob of bytes, and match it against the target.

That use case is not string processing, but general memory comparison  of an opaque type

> If  you need more delicate semantics, there are standard Unicode  algorithms for
> substring matching (i.e., user code does not need to worry about the low-level details -- the inputs are basically opaque Unicode strings whose internal structure is unimportant).

Again, removing auto decoding does not change anything for that.

> You really only need grapheme segmentation when:
> - Implementing a text layout algorithm where you need to render glyphs
> to some canvas.
> - Measuring the size of some piece of text for output alignment
>   purposes: in this case, grapheme segmentation isn't enough; you need font size information and other such details (like kerning, spacing parameters, etc.).

What about all the example above in the thread, about the wrong way of working of auto decoding right now?

Retro, correct substrings slicing, correct indexing, et cetera

> Ultimately, the whole point behind removing autodecoding is to put the onus on the user code to decide what kind of iteration it wants: code units, code points, or graphemes. (Or just use one of the standard algorithms and don't reinvent the square wheel.)

There will be always a default way to iterate, see below

>> Are they really SO common that the correct default is go for code points?
>
> The whole point behind removing autodecoding is so that we do NOT default to code points, which is currently the default.  We want to put the choice in the user's hand, not silently default to iteration by code point under the illusion of correctness, which is actually incorrect for non-trivial inputs.

The illusion of correctness should be turned into correctness, then.

>> Is it not better to have as a default the grapheme segmentation, the correct way of handling a string, instead?
>
> Grapheme segmentation is very complex, and therefore, very slow.  Most string processing doesn't actually need grapheme segmentation.

Can you provide string processing that doesn't need grapheme segmentation?
The examples listed above are not string processing example.

> Setting that as the default would mean D string processing will be excruciatingly slow by default, and furthermore all that extra work will be mostly for nothing because most of the time we don't need it anyway.

From the examples above, most of the time you simply need opaque memory management, so decaying the string/dstring/wstring to a binary blob, but that's not string processing

My (refined) point still stands: can you provide example of (text processing) algorithms and use cases that don't need grapheme segmentation?

6 days ago
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
> https://github.com/dlang/phobos/pull/7130

Thank you for working on this!

Surprisingly, the amount of breakage this causes seems rather small. I sent a few PRs for the modules that I am listed as a code owner of.

However, I noticed that one kind of the breakage is silent (the code compiles and runs, but behaves differently). This makes me uneasy, as it would be difficult to ensure that programs are fully and correctly updated for a (hypothetical) transition to no-autodecode.

I found two cases of such silent breakage. One was in std.stdio:

https://github.com/dlang/phobos/pull/7140

If there was a warning or it was an error to implicitly convert char to dchar, the breakage would have been detected during compilation. I'm sure we discussed this before. (Allowing a char, which might have a value >=0x80, to implicitly convert to dchar, which would be nonsense, is problematic, etc.)

The other instance of silent breakage was in std.regex. This unittest assert started failing:

https://github.com/dlang/phobos/blob/5cb4d927e56725a38b0b1ea1548d9954083d3290/std/regex/package.d#L629

I haven't looked into that, perhaps someone more familiar with std.regex and std.uni could have a look.

6 days ago
On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:

> My (refined) point still stands: can you provide example of (text processing) algorithms and use cases that don't need grapheme segmentation?

Parsing XML, HTML and other such things is what people usually have in mind. In general, all sorts of text where human-readable parts are interleaved with (easier to handle) machine instructions.

6 days ago
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev wrote:
> I haven't looked into that, perhaps someone more familiar with std.regex and std.uni could have a look.

In std.uni, there is genericDecodeGrapheme, which needs to:

1. Work with strings of any width
2. Work with input ranges of dchars
3. Advance the given range by ref

With autodecoding on, the first case is handled by .front / .popFront.

With autodecoding off, there is no direct equivalent any more. The problem is that the function needs to peek ahead (which can be multiple range elements for ranges of narrow char types, which is not possible for input ranges).

- Replacing .front / .popFront with std.utf.decodeFront does not work because the function does not do these operations in the same place, so we need to save the range before decodeFront advances it, but we can't .save() input ranges from case 2 above.

- Using byDchar does not work because .byDchar does not take its range by ref, so advancing the byDchar range will not advance the range passed by ref to genericDecodeGrapheme. I tried to use std.range.refRange for this but hit a compiler ICE ("precedence not defined for token 'cantexp'").

Perhaps there is already a construct in Phobos that can solve this?

6 days ago
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev wrote:
> I haven't looked into that, perhaps someone more familiar with std.regex and std.uni could have a look.

I should add that the std.uni "silent" breakage also was due to `dchar c = str.front`, and would have been found by disallowing char->dchar implicit conversion.
6 days ago
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev wrote:
> On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
>> https://github.com/dlang/phobos/pull/7130
>
> [...]
>
> However, I noticed that one kind of the breakage is silent (the code compiles and runs, but behaves differently). This makes me uneasy, as it would be difficult to ensure that programs are fully and correctly updated for a (hypothetical) transition to no-autodecode.

I remembered this article from the wiki where you pointed this out back
in 2014:

https://wiki.dlang.org/Element_type_of_string_ranges

See also the forum thread that it links to.

6 days ago
On Thursday, 15 August 2019 at 15:01:22 UTC, Les De Ridder wrote:
> I remembered this article from the wiki where you pointed this out back
> in 2014:
>
> https://wiki.dlang.org/Element_type_of_string_ranges

I completely forgot about that. Thanks for bringing it up, looks like it's still relevant :)
6 days ago
On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
> From the examples above, most of the time you simply need opaque memory management, so decaying the string/dstring/wstring to a binary blob, but that's not string processing

This is the point we're trying to get across to you: this isn't sufficient. Depending on the context and the script/language, you need access to the string at various levels. E.g. a font renderer needs to sometimes iterate code points, not graphemes in order to compose the correct glyphs.

Binary blob comparisons for comparing strings are *also* not sufficient, again depending on both script/language of the text in the string and the context in which the comparison is performed. If the comparison is to be purely semantic, the following strings should be equal: "\u00f6" and "\u006f\u0308". They both represent the same "Latin Small Letter O with Diaeresis". Their in-memory representations clearly aren't equal, so a memcpy won't yield the correct result. The same applies to sorting.

If you decide to force a specific string normalization internally, you put the burden on the user to explicitly select a different normalization when they require it. Plus, there is no way to perfectly reconstruct the input binary representation of a string, e.g. when it was given in a non-normalized form (e.g. a mix of NFC and NFD). Once such a string is through a normalization algorithm, the exact input is unrecoverable. This makes interfacing with other code that has idiosyncrasies around all of this hard to impossible to achieve.

One such system that I worked on in the past was a small embedded microcontroller driven HCI module with very limited capabilites, but with the requirement to be multilingual. I carefully worked out that for the languages that were required, a UTF-8 encoding with a very specific normalization would just about work. This choice was viable because the user interface was created in a custom tool where I could control the code and data generation just enough to make it work.

Another case where normalization is troublesome is ligatures. Ligatures that are purely stylistic like "ff", "ffi", "fft", "st", "ct" etc... have their own code points. Yet, it is a purely stylistic choice whether to use them. So in terms of the contained text, the ligature \ufb00 is equal to the string "ff", but it is not the same grapheme. Whether you can normalize this depends on the context. The user may have selected the ligature representation deliberately to have it appear as such on screen. If you want to do spell checking on the other hand, you would need to resolve the ligature to its individual letters.

And then there is Hangul: this is a prime example of a writing system that is "weird" to westerners. It is based on 40 symbols (19 consonants, 21 vowels) which aren't written individually, but merged syllable by syllable into rectangular blocks of two or three such symbols. These symbols get arranged in different layouts depending on which symbols there are in a syllable. As far as I understand, this follows a clear algorithm. This results in approximately 6500 individual graphemes that are actually written. Yet each of these is a group of two or three letters and parsed as such. So depending on whether you're interested in individual letters or syllables, you need to use a different string representation for processing that language.

OK, this are all just examples that come to my mind while brainstorming the question a little bit. However, none of us are not experts in language processing, so whatever examples we can come up with are very likely just the very tip of the iceberg. There is a reason why libraries like ICU give the user a lot of control over string handling and expose a lot of variants of functions depending on the user intent and context. This design rests on a lot of expert knowledge that we don't have, but we know that it is sound. Going against that wisdom is inviting trouble. Autodecoding is an example of doing just that.
1 2 3 4 5 6 7 8 9