June 02, 2016
On Thursday, 2 June 2016 at 21:56:10 UTC, Walter Bright wrote:
> Yes, you have a good point. But we do allow things like:
>
>    byte b;
>    if (b == 10000) ...

Why allowing char/wchar/dchar comparisons is wrong:

void main()
{
    string s = "Привет";
    foreach (c; s)
        assert(c != 'Ñ');
}

From my post from 2014:

http://forum.dlang.org/post/knrwiqxhlvqwxqshyqpy@forum.dlang.org

June 02, 2016
On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d wrote:
> On 06/02/2016 05:58 PM, Walter Bright wrote:
> > On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
> >> The lambda returns bool. -- Andrei
> >
> > Yes, I was wrong about that. But the point still stands with:
> >  > * s.balancedParens('〈', '〉') works only with autodecoding.
> >  > * s.canFind('ö') works only with autodecoding. It returns always
> >
> > false without.
> >
> > Can be made to work without autodecoding.
>
> By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. So you'd need to go through all of std.algorithm and make sure you can special-case your way out of situations that work today.

Yeah, I believe that you do have to do some special casing, though it would be special casing on ranges of code units in general and not strings specifically, and a lot of those functions are already special cased on string in an attempt be efficient. In particular, with a function like find or canFind, you'd take the needle and encode it to match the haystack it was passed so that you can do the comparisons via code units. So, you incur the encoding cost once when encoding the needle rather than incurring the decoding cost of each code point or grapheme as you iterate over the haystack. So, you end up with something that's correct and efficient. It's also much friendlier to code that only operates on ASCII.

The one issue that I'm not quite sure how we'd handle in that case is normalization (which auto-decoding doesn't handle either), since you'd need to normalize the needle to match the haystack (which also assumes that the haystack was already normalized). Certainly, it's the sort of thing that makes it so that you kind of wish you were dealing with a string type that had the normalization built into it rather than either an array of code units or an arbitrary range of code units. But maybe we could assume the NFC normalization like std.uni.normalize does and provide an optional template argument for the normalization scheme.

In any case, while it's not entirely straightforward, it is quite possible to write some algorithms in a way which works on arbitrary ranges of code units and deals with Unicode correctly without auto-decoding or requiring that the user convert it to a range of code points or graphemes in order to properly handle the full range of Unicode. And even if we keep auto-decoding, we pretty much need to fix it so that std.algorithm and friends are Unicode-aware in this manner so that ranges of code units work in general without requiring that you use byGrapheme. So, this sort of thing could have a large impact on RCStr, even if we keep auto-decoding for narrow strings.

Other algorithms, however, can't be made to work automatically with Unicode - at least not with the current range paradigm. filter, for instance, really needs to operate on graphemes to filter on characters, but with a range of code units, that would mean operating on groups of code units as a single element, which you can't do with something like a range of char, since that essentially becomes a range of ranges. It has to be wrapped in a range that's going to provide graphemes - and of course, if you know that you're operating only on ASCII, then you wouldn't want to deal with graphemes anyway, so automatically converting to graphemes would be undesirable. So, for a function like filter, it really does have to be up to the programmer to indicate what level of Unicode they want to operate at.

But if we don't make functions Unicode-aware where possible, then we're going to take a performance hit by essentially forcing everyone to use explicit ranges of code points or graphemes even when they should be unnecessary. So, I think that we're stuck with some level of special casing, but it would then be for ranges of code units and code points and not strings. So, it would work efficiently for stuff like RCStr, which the current scheme does not.

I think that the reality of the matter is that regardless of whether we keep auto-decoding for narrow strings in place, we need to make Phobos operate on arbitrary ranges of code units and code points, since even stuff like RCStr won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable in as many cases otherwise, because if a generic function isn't Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the question of auto-decoding matters much for what we need to do at this point. If we do what we need to do, then Phobos will work whether we have auto-decoding or not (working in a Unicode-aware manner where possible and forcing the user to decide the correct level of Unicode to work at where not), and then it just becomes a question of whether we can or should deprecate auto-decoding once all that's done.

- Jonathan M Davis


June 02, 2016
On Thursday, June 02, 2016 22:27:16 John Colvin via Digitalmars-d wrote:
> On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
> > I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.
>
> There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character? It's an interesting idea, but it's not how things are.

Yeah. I'm inclined to think that the fact that there are multiple normalizations was a huge mistake on the part of the Unicode folks, but we're stuck dealing with it. And as horrible as it is for most cases, maybe it _does_ ultimately make sense because of certain use cases; I don't know. But bad idea or not, we're stuck. :(

- Jonathan M Davis

June 02, 2016
On Thursday, June 02, 2016 15:48:03 Walter Bright via Digitalmars-d wrote:
> On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:
> > On 06/02/2016 05:58 PM, Walter Bright wrote:
> >>  > * s.balancedParens('〈', '〉') works only with autodecoding.
> >>  > * s.canFind('ö') works only with autodecoding. It returns always
> >>
> >> false without.
> >>
> >> Can be made to work without autodecoding.
> >
> > By special casing? Perhaps.
>
> The argument to canFind() can be detected as not being a char, then decoded into a sequence of char's, then forwarded to a substring search.

How do you suggest that we handle the normalization issue? Should we just assume NFC like std.uni.normalize does and provide an optional template argument to indicate a different normalization (like normalize does)? Since without providing a way to deal with the normalization, we're not actually making the code fully correct, just faster.

- Jonathan M Davis


June 02, 2016
On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
> How do you suggest that we handle the normalization issue?

Started a new thread for that one.

June 02, 2016
On 6/2/2016 2:25 PM, deadalnix wrote:
> On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
>> I wonder what rationale there is for Unicode to have two different sequences
>> of codepoints be treated as the same. It's madness.
> To be able to convert back and forth from/to unicode in a lossless manner.


Sorry, that makes no sense, as it is saying "they're the same, only different."
June 02, 2016
On 6/2/2016 3:27 PM, John Colvin wrote:
>> I wonder what rationale there is for Unicode to have two different sequences
>> of codepoints be treated as the same. It's madness.
>
> There are languages that make heavy use of diacritics, often several on a single
> "character". Hebrew is a good example. Should there be only one valid ordering
> of any given set of diacritics on any given character?

I didn't say ordering, I said there should be no such thing as "normalization" in Unicode, where two codepoints are considered to be identical to some other codepoint.

June 03, 2016
On Thursday, 2 June 2016 at 21:00:17 UTC, tsbockman wrote:
> However, this document is very old - from Unicode 3.0 and the year 2000:
>
>> While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them...
>
> Perhaps level 1 has since been redefined?

I found the latest (unofficial) draft version:
    http://www.unicode.org/reports/tr18/tr18-18.html

Relevant changes:

* Level 1 is to be redefined as working on code points, not code units:

> A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units.

* Level 2 (graphemes) is explicitly described as a "default level":

> This is still a default level—independent of country or language—but provides much better support for end-user expectations than the raw level 1...

* All mention of level 2 being slow has been removed. The only reason given for making it toggle-able is for compatibility with level 1 algorithms:

> Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still locale-independent and easily implementable. However, for compatibility with Level 1, it is useful to have some sort of syntax that will turn Level 2 support on and off.

June 02, 2016
On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 06/02/2016 04:36 PM, tsbockman wrote:
> > Your examples will pass or fail depending on how (and whether) the
> > 'ö' grapheme is normalized.
> 
> And that's fine. Want graphemes, .byGrapheme wags its tail in that corner.  Otherwise, you work on code points which is a completely meaningful way to go about things. What's not meaningful is the random results you get from operating on code units.
> 
> > They only ever succeeds because 'ö' happens to be one of the privileged graphemes that *can* be (but often isn't!) represented as a single code point. Many other graphemes have no such representation.
> 
> Then there's no dchar for them so no problem to start with.
> 
> s.find(c) ----> "Find code unit c in string s"
[...]

This is a ridiculous argument.  We might as well say, "there's no single byte UTF-8 that can represent Ш, so that's no problem to start with" -- since we can just define it away by saying s.find(c) == "find byte c in string s", and thereby justify using ASCII as our standard string representation.

The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in the general case.  It is adequate for a subset of characters -- just like ASCII is also adequate for a subset of characters.  If you only need to work with ASCII, it suffices to work with ubyte[]. Similarly, if your work is restricted to only languages without combining diacritics, then a range of dchar suffices. But a range of dchar is NOT good enough in the general case, and arguing that it does only makes you look like a fool.

Appealing to normalization doesn't change anything either, since only a subset of base character + diacritic combinations will normalize to a single code point. If the string has a base character + diacritic combination doesn't have a precomposed code point, it will NOT fit in a dchar. (And keep in mind that the notion of diacritic is still very Euro-centric. In Korean, for example, a single character is composed of multiple parts, each of which occupies 1 code point. While some precomposed combinations do exist, they don't cover all of the possibilities, so normalization won't help you there.)


T

-- 
Frank disagreement binds closer than feigned agreement.
June 02, 2016
On Thu, Jun 02, 2016 at 04:28:45PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 06/02/2016 04:17 PM, Timon Gehr wrote:
> > I.e. you are saying that 'works' means 'operates on code points'.
> 
> Affirmative. -- Andrei

Again, a ridiculous position.  I can use exactly the same line of argument for why we should just standardize on ASCII. All I have to do is to define "work" to mean "operates on an ASCII character", and then every ASCII algorithm "works" by definition, so nobody can argue with me.

Unfortunately, everybody else's definition of "work" is different from mine, so the argument doesn't hold water.

Similarly, you are the only one whose definition of "work" means "operates on code points". Basically nobody else here uses that definition, so while you may be right according to your own made-up tautological arguments, none of your conclusions actually have any bearing in the real world of Unicode handling.

Give it up. It is beyond reasonable doubt that autodecoding is a liability. D should be moving away from autodecoding instead of clinging to historical mistakes in the face of overwhelming evidence. (And note, I said *auto*-decoding; decoding by itself obviously is very relevant. But it needs to be opt-in because of its performance and correctness implications. The user needs to be able to choose whether to decode, and how to decode.)


T


-- 
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.