June 02, 2016
On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:
> It does not fall apart for code points.

Yes it does. You've been given plenty examples where it falls apart. Your answer to that was that it operates on code points, not graphemes. Well, duh. Comparing UTF-8 code units against each other works, too. That's not an argument for doing that by default.
June 02, 2016
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
> What is supposed to be done with "do not merge" PRs other than close them?

Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though). Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label.

Either way, they shouldn't be closed just because they say "do not merge" (unless they're abandoned or something, obviously).
June 02, 2016
On 6/2/16 5:01 PM, ag0aep6g wrote:
> On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:
>> It does not fall apart for code points.
>
> Yes it does. You've been given plenty examples where it falls apart.

There weren't any.

> Your answer to that was that it operates on code points, not graphemes.

That is correct.

> Well, duh. Comparing UTF-8 code units against each other works, too.
> That's not an argument for doing that by default.

Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level.


Andrei

June 02, 2016
On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:
> On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
>> By whom? The "support level 1" folks yonder at the Unicode standard? :o)
>> -- Andrei
>
> Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?

The level 2 support description noted that it should be opt-in because its slow.
Arguably it should be easier to operate on code units if you know its safe to do so, but either always working on code units or always working on graphemes as the default seems to be either too broken too often or too slow too often.

Now one can argue either consistency for code units (because then we can treat char[] and friends as a slice) or correctness for graphemes but really the more I think about it the more I think there is no good default and you need to learn unicode anyways. The only sad parts here are that 1) we hijacked an array type for strings, which sucks and 2) that we dont have an api that is actually good at teaching the user what it does and doesnt do.

The consequence of 1 is that generic code that also wants to deal with strings will want to special-case to get rid of auto-decoding, the consequence of 2 is that we will have tons of not-actually-correct string handling.
I would assume that almost all string handling code that is out in the wild is broken anyways (in code I have encountered I have never seen attempts to normalize or do other things before or after comparisons, searching, etc), unless of course, YOU or one of your colleagues wrote it (consider that checking the length of a string in Java or C# to validate it is no longer than X characters is often done and wrong, because .Length is the number of UTF-16 code units in those languages) :o)

So really as bad and alarming as "incorrect string handling" by default seems, it in practice of other languages that get used way more than D has not prevented people from writing working (internationalized!) applications in those languages.
One could say we should do it better than them, but I would be inclined to believe that RCStr provides our opportunity to do so. Having char[] be what it is is an annoying wart, and maybe at some point we can deprecate/remove that behaviour, but for now Id rather see if RCStr is viable than attempt to change semantics of all string handling code in D.
June 02, 2016
On 6/2/16 5:05 PM, tsbockman wrote:
> On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
>> What is supposed to be done with "do not merge" PRs other than close
>> them?
>
> Occasionally people need to try something on the auto tester (not sure
> if that's relevant to that particular PR, though). Presumably if someone
> marks their own PR as "do not merge", it means they're planning to
> either close it themselves after it has served its purpose, or they plan
> to fix/finish it and then remove the "do not merge" label.

Feel free to reopen if it helps, it wasn't closed in anger. -- Andrei

June 02, 2016
On 02.06.2016 23:06, Andrei Alexandrescu wrote:
> As the examples show, the examples would be entirely meaningless at code
> unit level.

So far, I needed to count the number of characters 'ö' inside some string exactly zero times, but I wanted to chain or join strings relatively often.
June 02, 2016
On 02.06.2016 23:16, Timon Gehr wrote:
> On 02.06.2016 23:06, Andrei Alexandrescu wrote:
>> As the examples show, the examples would be entirely meaningless at code
>> unit level.
>
> So far, I needed to count the number of characters 'ö' inside some
> string exactly zero times,

(Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.)

> but I wanted to chain or join strings
> relatively often.

June 02, 2016
On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu wrote:
> On 06/02/2016 03:34 PM, deadalnix wrote:
>> On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
>>> Pretty much everything. Consider s and s1 string variables with
>>> possibly different encodings (UTF8/UTF16).
>>>
>>> * s.all!(c => c == 'ö') works only with autodecoding. It returns
>>> always false without.
>>>
>>
>> False.
>
> True. "Are all code points equal to this one?" -- Andrei

The good thing when you define works by whatever it does right now, it is that everything always works and there are literally never any bug. The bad thing is that this is a completely useless definition of work.

The sample code won't count the instance of the grapheme 'ö' as some of its encoding won't be counted, which definitively count as doesn't work.

When your point need to redefine words in ways that nobody agree with, it is time to admit the point is bogus.

June 02, 2016
On 6/2/16 5:19 PM, Timon Gehr wrote:
> On 02.06.2016 23:16, Timon Gehr wrote:
>> On 02.06.2016 23:06, Andrei Alexandrescu wrote:
>>> As the examples show, the examples would be entirely meaningless at code
>>> unit level.
>>
>> So far, I needed to count the number of characters 'ö' inside some
>> string exactly zero times,
>
> (Obviously this isn't even what the example would do. I predict I will
> never need to count the number of code points 'ö' by calling some
> function from std.algorithm directly.)

You may look for a specific dchar, and it'll work. How about findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? -- Andrei


June 02, 2016
On 02.06.2016 22:51, Andrei Alexandrescu wrote:
> On 06/02/2016 04:50 PM, Timon Gehr wrote:
>> On 02.06.2016 22:28, Andrei Alexandrescu wrote:
>>> On 06/02/2016 04:12 PM, Timon Gehr wrote:
>>>> It is not meaningful to compare utf-8 and utf-16 code units directly.
>>>
>>> But it is meaningful to compare Unicode code points. -- Andrei
>>>
>>
>> It is also meaningful to compare two utf-8 code units or two utf-16 code
>> units.
>
> By decoding them of course. -- Andrei
>

That makes no sense, I cannot decode single code units.

BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.