March 08, 2014
Andrei suggests that this change would destroy D by breaking too much existing code. He might be right. Can we afford the risk that he is right?

We should think about a way to have our cake and eat it, too.

Keep in mind that this issue is a Phobos one, not a core language issue.
March 08, 2014
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote:
> s.all!(x => x == 'é')
> s.any!(x => x == 'é')
> s.canFind!(x => x == 'é')

These are a variation of the following:

ubyte b = ...;
if (b == 1000) { ... }

The compiler could emit a warning here, and indeed some languages/compilers do. It might not be in the vein of D metaprogramming, though, as the compiler will not emit a warning for "if (false) { ... }".

> s.canFind('é')
> s.endsWith('é')
> s.find('é')
> s.count('é')
> s.countUntil('é')

These should not compile post-change, because the sought element (dchar) is not of the same type as the string. So they will not fail silently.

> s.count()
> s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é")
> s.countUntil('é')

As has already been mentioned, counting code points is borderline useless.

> s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é")

And this is just wrong on many levels. I hope you know better than to actually use this for case-insensitive comparisons in production software.
March 08, 2014
On 3/7/14, 12:26 PM, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
>> s.canFind('é') currently works as expected. Proposed: fails silently.
>
> The problem is that the current implementation of this correct behaviour
> leaves a lot to be desired in terms of performance. Ideally, you should
> not need to decode every single character in s just to see if it happens
> to contain é. Rather, canFind, et al should convert the dchar literal
> 'é' into a UTF-8 (resp. UTF-16) sequence and do a substring search
> instead. Decoding every character in s, while correct, is also
> needlessly inefficient.

That's an optimization that fits the current design and goes in the library transparently, i.e. the good stuff.

>> 5.
>>
>> s.count() currently works as expected. Proposed: fails silently.
>
> Wrong. The current behaviour of s.count() does not work as expected, it
> only gives an illusion that it does.

Depends on what one expects :o).

> Its return value is misleading when
> combining diacritics and other such Unicode "niceness" are involved.
> Arguably, such things should be prohibited altogether, and more
> semantically transparent algorithms used, namely s.countCodePoints,
> s.countGraphemes, etc..

I think s.byGrapheme.count is the right way instead of specializing a bunch of algorithms to work with graphemes.

>> s.endsWith('é') currently works as expected. Proposed: fails silently.
>
> Arguable, because it imposes a performance hit by needless decoding.
> Ideally, you should have 3 overloads:
>
> 	bool endsWith(string s, char asciiChar);
> 	bool endsWith(string s, wchar wideChar);
> 	bool endsWith(string s, dchar codepoint);

Nice idea. Fits current design. Then interesting complications arise with things like bool endsWith(string, wstring) etc.

> [...]
>> I designed the range behavior of strings after much thinking and
>> consideration back in the day when I designed std.algorithm. It was
>> painfully obvious (but it seems to have been forgotten now that it's
>> working so well) that approaching strings as arrays of char[] would
>> break almost every single algorithm leaving us essentially in the
>> pre-UTF C++aveman era.
>
> I agree, but it is also painfully obvious that the current
> implementation is lackluster in terms of performance.

It's not painfully obvious to me at all. What is obvious to me is people are happy campers with the way D's strings work, including UTF support and performance. I don't remember people bringing this up in forums and here at Facebook "yeah, just look at the crappy way they handle strings..." Silent approval is easy to forget about.

Walter has been working on an application in which anything slower than 2x baseline would have been a failure. In that app (which I know very well) the right option from day 1 would have been ubyte[], which he discovered the hard way. His incomplete understanding of how D strings work is the single largest problem there, and indicates an issue with the documentation.

He discovered that, was surprised, and overreacted. No need to amplify that into mass hysteria. There are improvements that can be made, in the form of additions, not breaking changes that would inflict massive breakage on the community. This is the way in which this discussion can have a positive outcome. (I've shared in fact a few ideas with Walter.)

>> Clearly one might argue that their app has no business dealing with
>> diacriticals or Asian characters. But that's the typical provincial
>> view that marred many languages' approach to UTF and
>> internationalization. If you know your string is ASCII, the remedy
>> is simple - don't use char[] and friends. From day 1, the type
>> "char" was meant to mean "code unit of UTF characters".
>
> Yes, but currently Phobos support for non-UTF strings is rather poor,
> and requires many explicit casts to/from ubyte[].

Non-UTF strings are currently modeled as ubyte[], so I don't see what you'd be casting to and fro. You have absolutely no business representing anything non-UTF with char and char[] etc.

>> So please ponder the above before going to do surgery on the patient
>> that's going to kill him.
> [...]
>
> Yeah I was surprised Walter was actually seriously going to pursue this.
> It's a change of a far vaster magnitude than many of the other DIPs and
> other proposals that have been rejected because they were deemed to
> cause too much breakage of existing code.

Compared with what's going on now with D at Facebook, this agitation is but a little side show. We have way bigger fish to fry.


Andrei

March 08, 2014
On 3/7/14, 12:43 PM, Vladimir Panteleev wrote:
> On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote:
>> Allow me to enumerate the functions of std.algorithm and how they work
>> today and how they'd work with the proposed change. Let s be a
>> variable of some string type.
>
>> s.canFind('é') currently works as expected.
>
> No, it doesn't.
>
> import std.algorithm;
>
> void main()
> {
>      auto s = "cassé";
>      assert(s.canFind('é'));
> }

worksforme

March 08, 2014
On Saturday, 8 March 2014 at 00:44:53 UTC, Andrei Alexandrescu wrote:
> worksforme

http://forum.dlang.org/post/fhqradggtvwnpqpuehgg@forum.dlang.org
March 08, 2014
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
> Andrei suggests that this change would destroy D by breaking too much existing code. He might be right. Can we afford the risk that he is right?
>
> We should think about a way to have our cake and eat it, too.
>
> Keep in mind that this issue is a Phobos one, not a core language issue.

Before we discuss risk in the change, we need to agree that it is even a desirable change. I don't think we have reached that point.

It's worth pointing out that all the performance issues can be resolved in Phobos through specialisation with no disruption to the users.
March 08, 2014
On Sat, Mar 08, 2014 at 12:46:21AM +0000, Peter Alexander wrote:
> On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
> >Andrei suggests that this change would destroy D by breaking too much existing code. He might be right. Can we afford the risk that he is right?
> >
> >We should think about a way to have our cake and eat it, too.
> >
> >Keep in mind that this issue is a Phobos one, not a core language issue.
> 
> Before we discuss risk in the change, we need to agree that it is even a desirable change. I don't think we have reached that point.
> 
> It's worth pointing out that all the performance issues can be resolved in Phobos through specialisation with no disruption to the users.

Regardless of which way we decide in the end, I hope the one thing good that will come out of this thread is to improve the performance of string algorithms in Phobos. Things like substring searching to implement multibyte character (or multi-codepoint "characters") operations efficiently are quite needed, IMO.


T

-- 
If a person can't communicate, the very least he could do is to shut up. -- Tom Lehrer, on people who bemoan their communication woes with their loved ones.
March 08, 2014
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
> We should think about a way to have our cake and eat it, too.

I think a good place to start would be to have a draft implementation of the proposal. This will allow people to try it with their projects and see how much code it will really affect. As I mentioned here[1], I suspect that certain valid code that used the range primitives will continue to work unaffected even after a sudden switch, so perhaps the "deprecation" and "error" stage can be replaced with a longer "warning" stage instead.

This is similar to how git changed the meaning of the "push" command: it just nagged users for a long time, and included the instructions to switch to the new behavior early (thus squelching the warning) or permanently accepting the old behavior. (For our case it is adding .representation or .byCodepoint depending on the intent.)

[1]: http://forum.dlang.org/post/dlpmchtaqzrxxylpmiwh@forum.dlang.org
March 08, 2014
On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:
> 07-Mar-2014 23:57, Andrei Alexandrescu пишет:
>> On 3/6/14, 6:37 PM, Walter Bright wrote:
>>> In "Lots of low hanging fruit in Phobos" the issue came up about the
>>> automatic encoding and decoding of char ranges.
>> [snip]
>>> Is there any hope of fixing this?
>>
>> There's nothing to fix.
>
> There is, all right. ElementEncodingType for starters.
>
>>
>> Allow me to enumerate the functions of std.algorithm and how they work
>> today and how they'd work with the proposed change. Let s be a variable
>> of some string type.
>
> Special case was wrong though - special casing arrays of char[] and
> throwing all other ranges of char out the window. The amount of code to
> support this schizophrenia is enormous.

I think this is a confusion. The code in e.g. std.algorithm is specialized for efficiency of stuff that already works.

>> Making strings bidirectional ranges has been a very good choice within
>> the constraints. There was already a string type, and that was
>> immutable(char)[], and a bunch of code depended on that definition.
>
> Trying to make it work by blowing a hole in the generic range concept
> now seems like it wasn't worth it.

I disagree. Also what hole?


Andrei

March 08, 2014
On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>> No, it doesn't.
>>>
>>> import std.algorithm;
>>>
>>> void main()
>>> {
>>>    auto s = "cassé";
>>>    assert(s.canFind('é'));
>>> }
>>>
>>
>> Hm, I'm not following? Works perfectly fine on my system?
>
> Something's messing with your Unicode. Try downloading and compiling
> this file:
> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work.

import std.algorithm, std.uni;

void main()
{
    auto s = "cassé";
    assert(s.byGrapheme.canFind('é'));
}

It doesn't compile, seems like a library bug.

Graphemes are the next level of Nirvana above code points, but that doesn't mean it's graphemes or nothing.


Andrei

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18