March 07, 2014
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote:
> Allow me to enumerate the functions of std.algorithm and how they work today and how they'd work with the proposed change. Let s be a variable of some string type.

> s.canFind('é') currently works as expected.

No, it doesn't.

import std.algorithm;

void main()
{
    auto s = "cassé";
    assert(s.canFind('é'));
}

That's the whole problem - all this hot steam and it still does not work properly. Because it can't - not without pulling in all of the Unicode algorithms implicitly, and that would be much worse.

> I went down std.algorithm in the order listed in its documentation and found pernicious issues with almost every single algorithm.

All of your examples are variations of one and the same case: searching for a non-ASCII dchar or dchar literal.

How often does this pattern occur in real programs? I think the only real metric is to try the change and find out.

> Clearly one might argue that their app has no business dealing with diacriticals or Asian characters. But that's the typical provincial view that marred many languages' approach to UTF and internationalization.

So is yours, if you think that making everything magically a dchar is going to solve all problems.

The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit.
March 07, 2014
07-Mar-2014 23:57, Andrei Alexandrescu пишет:
> On 3/6/14, 6:37 PM, Walter Bright wrote:
>> In "Lots of low hanging fruit in Phobos" the issue came up about the
>> automatic encoding and decoding of char ranges.
> [snip]
>> Is there any hope of fixing this?
>
> There's nothing to fix.

There is, all right. ElementEncodingType for starters.

>
> Allow me to enumerate the functions of std.algorithm and how they work
> today and how they'd work with the proposed change. Let s be a variable
> of some string type.

Special case was wrong though - special casing arrays of char[] and throwing all other ranges of char out the window. The amount of code to support this schizophrenia is enormous.

>
> Making strings bidirectional ranges has been a very good choice within
> the constraints. There was already a string type, and that was
> immutable(char)[], and a bunch of code depended on that definition.

Trying to make it work by blowing a hole in the generic range concept now seems like it wasn't worth it.

-- 
Dmitry Olshansky
March 07, 2014
On 3/7/14, 11:28 AM, Dmitry Olshansky wrote:
> 07-Mar-2014 23:11, Andrei Alexandrescu пишет:
>> On 3/7/14, 9:24 AM, Vladimir Panteleev wrote:
>>>> 5. Implement new std.array.front for strings that doesn't decode.
>>>
>>> Until then, how will people use strings with algorithms when they mean
>>> to use them per-byte? A .raw property which casts to ubyte[]?
>>
>> There's no "until then".
>>
>> A current ".representation" property already exists that casts all
>> string types appropriately.
>
> There is however a big glaring failure: std.algorithm specialized for
> char[], wchar[] but not for any RandomAccessRange!char or
> RandomAccessRange!wchar.

I agree that's an issue. Back in the day when this was a choice I decided to consider only char[] and friends "UTF strings". There was room for more generality but I didn't know of any use cases that would ask for them. It's possible I was wrong, but the option to generalize is still open today.

Andrei


March 07, 2014
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
> No, it doesn't.
>
> import std.algorithm;
>
> void main()
> {
>     auto s = "cassé";
>     assert(s.canFind('é'));
> }
>

Hm, I'm not following? Works perfectly fine on my system?
March 07, 2014
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>> No, it doesn't.
>>
>> import std.algorithm;
>>
>> void main()
>> {
>>    auto s = "cassé";
>>    assert(s.canFind('é'));
>> }
>>
>
> Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
March 07, 2014
>> Hm, I'm not following? Works perfectly fine on my system?
>
> Something's messing with your Unicode. Try downloading and compiling this file:
> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be the same symbol.

Works for me with same ones.
March 07, 2014
On Friday, 7 March 2014 at 22:16:58 UTC, TC wrote:
>>> Hm, I'm not following? Works perfectly fine on my system?
>>
>> Something's messing with your Unicode. Try downloading and compiling this file:
>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>
> Used hex view on referenced file and it does not seem to be the same symbol.

Define "symbol". :)
March 07, 2014
On Friday, 7 March 2014 at 21:58:40 UTC, Vladimir Panteleev wrote:
> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>> No, it doesn't.
>>>
>>> import std.algorithm;
>>>
>>> void main()
>>> {
>>>   auto s = "cassé";
>>>   assert(s.canFind('é'));
>>> }
>>>
>>
>> Hm, I'm not following? Works perfectly fine on my system?
>
> Something's messing with your Unicode. Try downloading and compiling this file:
> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

ah right, missing normalization, I get your point, thanks.
March 07, 2014
On Friday, 7 March 2014 at 22:18:17 UTC, Vladimir Panteleev wrote:
> On Friday, 7 March 2014 at 22:16:58 UTC, TC wrote:
>>>> Hm, I'm not following? Works perfectly fine on my system?
>>>
>>> Something's messing with your Unicode. Try downloading and compiling this file:
>>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>>
>> Used hex view on referenced file and it does not seem to be the same symbol.
>
> Define "symbol". :)

"cassé" - 22 63 61 73 73 65 cc 81 22

vs

'é' - 27 c3 a9 27
March 07, 2014
>
> ah right, missing normalization, I get your point, thanks.

Oops :)