March 07, 2014
On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev wrote:
> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
> >On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
> >>No, it doesn't.
> >>
> >>import std.algorithm;
> >>
> >>void main()
> >>{
> >>   auto s = "cassé";
> >>   assert(s.canFind('é'));
> >>}
> >>
> >
> >Hm, I'm not following? Works perfectly fine on my system?

Probably because your browser is normalizing the unicode string when you copy-n-paste Vladimir's message? See below:


> Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

I downloaded the file and looked at it through `od -ctx1`: the first é is encoded as the byte sequence 65 cc 81, that is, [U+65, U+301] (small letter e + combining diacritic acute accent), whereas the second é is encoded as c3 a9, that is, U+E9 (precomposed small letter e with acute accent).

This illustrates one of my objections to Andrei's post: by auto-decoding behind the user's back and hiding the intricacies of unicode from him, it has masked the fact that codepoint-for-codepoint comparison of a unicode string is not guaranteed to always return the correct results, due to the possibility of non-normalized strings.

Basically, to have correct behaviour in all cases, the user must be aware of, and use, the Unicode collation / normalization algorithms prescribed by the Unicode standard. What we have in std.algorithm right now is an incomplete implementation with non-working edge cases (like Vladimir's example) that has poor performance to start with. Its only redeeming factor is that the auto-decoding hack has given it the illusion of being correct, when actually it's not correct according to the Unicode standard. I don't see how this is necessarily superior to Walter's proposal.


T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!
March 07, 2014
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
> On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote:
>> Allow me to enumerate the functions of std.algorithm and how they work today and how they'd work with the proposed change. Let s be a variable of some string type.
>
>> s.canFind('é') currently works as expected.
>
> No, it doesn't.
>
> import std.algorithm;
>
> void main()
> {
>     auto s = "cassé";
>     assert(s.canFind('é'));
> }
>
> That's the whole problem - all this hot steam and it still does not work properly. Because it can't - not without pulling in all of the Unicode algorithms implicitly, and that would be much worse.
>
>> I went down std.algorithm in the order listed in its documentation and found pernicious issues with almost every single algorithm.
>
> All of your examples are variations of one and the same case: searching for a non-ASCII dchar or dchar literal.
>
> How often does this pattern occur in real programs? I think the only real metric is to try the change and find out.
>
>> Clearly one might argue that their app has no business dealing with diacriticals or Asian characters. But that's the typical provincial view that marred many languages' approach to UTF and internationalization.
>
> So is yours, if you think that making everything magically a dchar is going to solve all problems.
>
> The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit.

+1
In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit "ddhrya" http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search.

- Sarath
March 07, 2014
On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
> This illustrates one of my objections to Andrei's post: by auto-decoding
> behind the user's back and hiding the intricacies of unicode from him,
> it has masked the fact that codepoint-for-codepoint comparison of a
> unicode string is not guaranteed to always return the correct results,
> due to the possibility of non-normalized strings.
>
> Basically, to have correct behaviour in all cases, the user must be
> aware of, and use, the Unicode collation / normalization algorithms
> prescribed by the Unicode standard. What we have in std.algorithm right
> now is an incomplete implementation with non-working edge cases (like
> Vladimir's example) that has poor performance to start with. Its only
> redeeming factor is that the auto-decoding hack has given it the
> illusion of being correct, when actually it's not correct according to
> the Unicode standard. I don't see how this is necessarily superior to
> Walter's proposal.
>
>
> T

Yes, I realised too late.

Would it not be beneficial to have different types of literals, one type which is implicitly normalized and one which is "raw"(like today)? Since typically you'd want to normalize most string literals at compile-time, then you only have to normalize external input at run-time.

March 07, 2014
>
> Probably because your browser is normalizing the unicode string when you
> copy-n-paste Vladimir's message? See below:
>

Just for curiosity I tried it with C# to see how it is handled there and it works like this:

using System;
using System.Diagnostics;

namespace Test
{
    class Program
    {
        static void Main()
        {
            var s = "cassé";
            Debug.Assert(s.IndexOf('é') < 0);
            s = s.Normalize();
            Debug.Assert(s.IndexOf('é') == 4);
        }
    }
}

So it's neither work by default there and Normalize has to be used
March 07, 2014
On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
> >On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote:
[...]
> >>Clearly one might argue that their app has no business dealing with diacriticals or Asian characters. But that's the typical provincial view that marred many languages' approach to UTF and internationalization.
> >
> >So is yours, if you think that making everything magically a dchar is going to solve all problems.
> >
> >The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit.
> 
> +1
> In Indian languages, a character consists of one or more UNICODE
> code points. For example, in Sanskrit "ddhrya"
> http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
> consists of 7 UNICODE code points. So to search for this char I have
> to use string search.
[...]

That's what I've been arguing for. The most general form of character searching in Unicode requires substring searching, and similarly many character-based operations on Unicode strings are effectively substring-based operations, because said "character" may be a multibyte code point, or, in your case, multiple code points. Since that's the case, we might as well just forget about the distinction between "character" and "string", and treat all such operations as substring operations (even if the operand is supposedly "just 1 character long").

This would allow us to get rid of the hackish auto-decoding of narrow strings, and thus eliminate the needless overhead of always decoding.


T

-- 
All men are mortal. Socrates is mortal. Therefore all men are Socrates.
March 07, 2014
On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
>
> +1
> In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit "ddhrya" http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search.
>
> - Sarath

Oops, incomplete reply ...

Since a single "alphabet" in Indian languages can contain multiple code-points, iterating over single code-points is like iterating over char[] for non English European languages. So decode is of no use other than decreasing the performance. A raw char[] comparison is much faster.

And then there is this "unicode normalization" that makes it very difficult for string searches or comparisons.

- Sarath
March 07, 2014
On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
> On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
> >
> >+1
> >In Indian languages, a character consists of one or more UNICODE
> >code points. For example, in Sanskrit "ddhrya"
> >http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
> >consists of 7 UNICODE code points. So to search for this char I
> >have to use string search.
> >
> >- Sarath
> 
> Oops, incomplete reply ...
> 
> Since a single "alphabet" in Indian languages can contain multiple code-points, iterating over single code-points is like iterating over char[] for non English European languages. So decode is of no use other than decreasing the performance. A raw char[] comparison is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a wrong decision. The question, though, is whether it's worth the massive code breakage needed to undo it. :-(


> And then there is this "unicode normalization" that makes it very difficult for string searches or comparisons.
[...]

I believe the convention is to always normalize strings before performing operations on them, in order to prevent these sorts of problems. I think many of the unicode prescribed algorithms have normalization as a prerequisite, since otherwise there's no guarantee that the algorithm will produce the correct results.


T

-- 
"I'm not childish; I'm just in touch with the child within!" - RL
March 07, 2014
On 03/07/2014 03:37 AM, Walter Bright wrote:
> In "Lots of low hanging fruit in Phobos" the issue came up about the
> automatic encoding and decoding of char ranges.
> ...

I think this is among the most annoying aspects of Phobos.
March 07, 2014
On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev wrote:
>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>> >On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>> >>No, it doesn't.
>> >>
>> >>import std.algorithm;
>> >>
>> >>void main()
>> >>{
>> >>   auto s = "cassé";
>> >>   assert(s.canFind('é'));
>> >>}
>> >>
>> >
>> >Hm, I'm not following? Works perfectly fine on my system?
>
> Probably because your browser is normalizing the unicode string when you
> copy-n-paste Vladimir's message? See below:
>
>
>> Something's messing with your Unicode. Try downloading and compiling
>> this file:
>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>
> I downloaded the file and looked at it through `od -ctx1`: the first é
> is encoded as the byte sequence 65 cc 81, that is, [U+65, U+301] (small
> letter e + combining diacritic acute accent), whereas the second é is
> encoded as c3 a9, that is, U+E9 (precomposed small letter e with acute
> accent).
>
> This illustrates one of my objections to Andrei's post: by auto-decoding
> behind the user's back and hiding the intricacies of unicode from him,
> it has masked the fact that codepoint-for-codepoint comparison of a
> unicode string is not guaranteed to always return the correct results,
> due to the possibility of non-normalized strings.
>
> Basically, to have correct behaviour in all cases, the user must be
> aware of, and use, the Unicode collation / normalization algorithms
> prescribed by the Unicode standard. What we have in std.algorithm right
> now is an incomplete implementation with non-working edge cases (like
> Vladimir's example) that has poor performance to start with. Its only
> redeeming factor is that the auto-decoding hack has given it the
> illusion of being correct, when actually it's not correct according to
> the Unicode standard. I don't see how this is necessarily superior to
> Walter's proposal.
>
>
> T

To me, the status quo feels like an ok compromise between performance and correctness. Everyone is pointing out that working at the code point level is bad because it's not correct but working at the code unit level as Walter proposes is correct even less often so that's not really an argument for moving to that. It is, however, an argument for forcing the user to decide what level of correctness and performance they need.

Walter's idea (code unit level) would be fastest but least correct.
The current is somewhat fast and is somewhat correct.
The next level, graphemes, would be slowest of all but most correct.

It seems like there is just no way to avoid the tradeoff between speed and correctness so we shouldn't try, only try to force the user to make a decision.

Maybe some more string types are in order (hrm). In order of performance to correctness:

 string, wstring (code units)
 dstring         (code points)
+gstring         (graphemes)

(do grapheme's completely normalize? If not probably need another level, say, nstring)

Then if a user needs correctness over performance they just work with gstrings. If they need performance over correctness they work with strings (assuming some of Walter's idea happens, otherwise they'd work with string.representation).
March 08, 2014
On 3/7/2014 11:59 AM, Andrei Alexandrescu wrote:
> On 3/6/14, 7:55 PM, Walter Bright wrote:
>> On 3/6/2014 7:22 PM, bearophile wrote:
>>> One advantage of your change is that this code will work:
>>>
>>> auto s = "hello".dup;
>>> s.sort();
>>
>> Yes, I hadn't thought of that.
>>
>> The auto-decoding front() introduces all kinds of asymmetry in how
>> ranges work, and asymmetry is bad as it negatively impacts composability.
>
> There's no asymmetry, and decoding helps composability as I demonstrated.

Here's one asymmetry:
-----------------------------
alias int T;     // compiles
//alias char T;  // fails to compile

struct Input(T) { T front(); bool empty(); void popFront(); }
struct Output(T) { void put(T); }

import std.array;

void copy(F,T)(F f, T t) {
    while (!f.empty) {
        t.put(f.front);
        f.popFront();
    }
}

void main() {
    immutable(T)[] from;
    Output!T to;
    from.copy(to);
}
-------------------------------