March 07, 2014
On 3/7/2014 9:24 AM, Vladimir Panteleev wrote:
> I think .decode should be something more explicit (byCodePoint OSLT), just so
> it's clear that it's not magical and does not solve all problems.

Good point. Perhaps "decodeUTF". "decode" is too generic.


> Until then, how will people use strings with algorithms when they mean to use
> them per-byte?

The way they do it now, i.e. they can't. That's the whole problem.

March 07, 2014
07-Mar-2014 14:41, Walter Bright пишет:
> On 3/7/2014 2:27 AM, Dmitry Olshansky wrote:
>> Where have you been when it was introduced? :)
>
> It slipped by me. What can I say? I'm not the only committer :-)
>
> But after spending non-trivial time suffering as auto-decode blasted my
> kingdom, I've concluded that it needs to die.
> Working around it is not
> easy.

That seems to be the biggest problem, it's an overriding default that is very hard to "turn off" and retain nice and clear generic view of stuff.

> I know that auto-decode has negatively impacted your regex, too.

No, technically, I knew what I was doing and that decode call was explicit. It's just turned out to set a bar on a minimum time budget to do X with a string, and it's too high.

What really got nasty is multiple re-decoding of the same piece as engine backtracks to try earlier alternatives.

-- 
Dmitry Olshansky
March 07, 2014
On 3/7/2014 7:03 AM, Dicebot wrote:
> 1) It is a huge breakage and you have been refusing to do one even for more
> important problems. What is about this sudden change of mind?

1. Performance Performance Performance

2. The current behavior is surprising (it sure surprised me, I didn't notice it until I looked at the assembler to figure out why the performance sucked)

3. Weirdnesses like ElementEncodingType

4. Strange behavior differences between char[], char*, and InputRange!char types

5. Funky anomalous issues with writing OutputRange!char (the put(T) must take a dchar)


> 2) lack of convenient .raw property which will effectively do cast(ubyte[])

I've done the cast as a workaround, but when working with generic code it turns out the ubyte type becomes viral - you have to use it everywhere. So all over the place you're having casts between ubyte <=> char in unexpected places. You also wind up with ugly ubyte <=> dchar casts, with the commensurate risk that you goofed and have a truncation bug.

Essentially, the auto-decode makes trivial code look better, but if you're writing a more comprehensive string processing program, and care about performance, it makes a regular ugly mess of things.
March 07, 2014
On 3/7/2014 9:04 AM, Vladimir Panteleev wrote:
> Ideally, it would be
> clearly visible in the code that you are counting code points.

Yes.

March 07, 2014
On 3/7/2014 11:43 AM, Dmitry Olshansky wrote:
> No, technically, I knew what I was doing and that decode call was explicit.

Ah right, I misremembered. Thanks for the correction.

March 07, 2014
07-Mar-2014 21:34, H. S. Teoh пишет:
> On Fri, Mar 07, 2014 at 05:24:59PM +0000, Vladimir Panteleev wrote:
>> On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
>>> Ok, I have a plan. Each step will be separated by at least one
>>> version:
>>>
>>> 1. implement decode() as an algorithm for string types, so one can
>>> write:
>>>
>>>     string s;
>>>     s.decode.algorithm...
>>>
>>> suggest that people start doing that instead of:
>>>
>>>     s.algorithm...
>>
>> I think .decode should be something more explicit (byCodePoint
>> OSLT), just so it's clear that it's not magical and does not solve
>> all problems.
>
> +1. I think "byCodePoint" is far more self-documenting and less
> misleading than "decode".
>
> 	string s;
> 	s.byCodePoint.algorithm...
>
> I'm already starting to like it.
>
And there is precedent, see std.uni.byCodepoint ;)


-- 
Dmitry Olshansky
March 07, 2014
On 3/6/14, 6:37 PM, Walter Bright wrote:
> In "Lots of low hanging fruit in Phobos" the issue came up about the
> automatic encoding and decoding of char ranges.
[snip]
> Is there any hope of fixing this?

There's nothing to fix.

Allow me to enumerate the functions of std.algorithm and how they work today and how they'd work with the proposed change. Let s be a variable of some string type.

1.

s.all!(x => x == 'é') currently works as expected. Proposed: fails silently.

2.

s.any!(x => x == 'é') currently works as expected. Proposed: fails silently.

3.

s.canFind!(x => x == 'é') currently works as expected. Proposed: fails silently.

4.

s.canFind('é') currently works as expected. Proposed: fails silently.

5.

s.count() currently works as expected. Proposed: fails silently.

6.

s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é") currently works as expected (with the known issues of lowercase conversion). Proposed: fails silently.

7.

s.count('é') currently works as expected. Proposed: fails silently.

8.

s.countUntil("a") currently work as expected. Proposed: fails silently. This applies to all variations of countUntil.

9.

s.endsWith('é') currently works as expected. Proposed: fails silently.

10.

s.find('é') currently works as expected. Proposed: fails silently. This applies to other variations of find that include custom predicates.

11.

...

I went down std.algorithm in the order listed in its documentation and found pernicious issues with almost every single algorithm.

I designed the range behavior of strings after much thinking and consideration back in the day when I designed std.algorithm. It was painfully obvious (but it seems to have been forgotten now that it's working so well) that approaching strings as arrays of char[] would break almost every single algorithm leaving us essentially in the pre-UTF C++aveman era.

Making strings bidirectional ranges has been a very good choice within the constraints. There was already a string type, and that was immutable(char)[], and a bunch of code depended on that definition.

Clearly one might argue that their app has no business dealing with diacriticals or Asian characters. But that's the typical provincial view that marred many languages' approach to UTF and internationalization. If you know your string is ASCII, the remedy is simple - don't use char[] and friends. From day 1, the type "char" was meant to mean "code unit of UTF characters".

So please ponder the above before going to do surgery on the patient that's going to kill him.


Andrei

March 07, 2014
On 3/6/14, 7:52 PM, Walter Bright wrote:
> On 3/6/2014 7:06 PM, Walter Bright wrote:
>> On 3/6/2014 6:37 PM, Walter Bright wrote:
>>> Is there any hope of fixing this?
>>
>> Is there any way we can provide an upgrade path for this? Silent
>> breakage is
>> terrible. Any ideas?
>
> Ok, I have a plan. Each step will be separated by at least one version:
>
> 1. implement decode() as an algorithm for string types, so one can write:
>
>      string s;
>      s.decode.algorithm...
>
> suggest that people start doing that instead of:
>
>      s.algorithm...
>
> 2. Emit warning when people use std.array.front(s) with strings.
>
> 3. Deprecate std.array.front for strings.
>
> 4. Error for std.array.front for strings.
>
> 5. Implement new std.array.front for strings that doesn't decode.

This would kill D. I am not exaggerating.

Andrei

March 07, 2014
On 3/6/14, 7:55 PM, Walter Bright wrote:
> On 3/6/2014 7:22 PM, bearophile wrote:
>> One advantage of your change is that this code will work:
>>
>> auto s = "hello".dup;
>> s.sort();
>
> Yes, I hadn't thought of that.
>
> The auto-decoding front() introduces all kinds of asymmetry in how
> ranges work, and asymmetry is bad as it negatively impacts composability.

There's no asymmetry, and decoding helps composability as I demonstrated.

Andrei

March 07, 2014
On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
> On 3/6/14, 6:37 PM, Walter Bright wrote:
> >In "Lots of low hanging fruit in Phobos" the issue came up about the automatic encoding and decoding of char ranges.
> [snip]
> >Is there any hope of fixing this?
> 
> There's nothing to fix.

:D  I knew this was going to happen.


> Allow me to enumerate the functions of std.algorithm and how they work today and how they'd work with the proposed change. Let s be a variable of some string type.
> 
> 1.
> 
> s.all!(x => x == 'é') currently works as expected. Proposed: fails silently.
> 
> 2.
> 
> s.any!(x => x == 'é') currently works as expected. Proposed: fails silently.
> 
> 3.
> 
> s.canFind!(x => x == 'é') currently works as expected. Proposed:
> fails silently.
> 
> 4.
> 
> s.canFind('é') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour leaves a lot to be desired in terms of performance. Ideally, you should not need to decode every single character in s just to see if it happens to contain é. Rather, canFind, et al should convert the dchar literal 'é' into a UTF-8 (resp. UTF-16) sequence and do a substring search instead. Decoding every character in s, while correct, is also needlessly inefficient.


> 5.
> 
> s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it only gives an illusion that it does. Its return value is misleading when combining diacritics and other such Unicode "niceness" are involved. Arguably, such things should be prohibited altogether, and more semantically transparent algorithms used, namely s.countCodePoints, s.countGraphemes, etc..


> 6.
> 
> s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é")
> currently works as expected (with the known issues of lowercase
> conversion). Proposed: fails silently.

Again, I don't like this. It sweeps the issues of comparing unicode strings under the carpet and gives the programmer a false sense of code correctness. Users instead should be encouraged to use proper Unicode collation functions that are actually correct, instead of giving an illusion of correctness.


> 7.
> 
> s.count('é') currently works as expected. Proposed: fails silently.

This is a repetition of #5. :)


> 8.
> 
> s.countUntil("a") currently work as expected. Proposed: fails silently. This applies to all variations of countUntil.

Whether this is correct or not depends on what the intention is. If you're looking to slice a string, this most definitely does NOT work as expected. If you're looking to count graphemes, this doesn't work as expected either. This only works if you just so happen to be counting code points. The correct approach, IMO, is to help the user make a conscious choice between these different semantics:

	s.indexOf("a");			// for slicing
	s.byCodepoint.countUntil("a");	// count code points
	s.byGrapheme.countUntil("a");	// count graphemes

Things like s.countUntil("a") are misleading and lead to subtle Unicode
bugs.


> 9.
> 
> s.endsWith('é') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding. Ideally, you should have 3 overloads:

	bool endsWith(string s, char asciiChar);
	bool endsWith(string s, wchar wideChar);
	bool endsWith(string s, dchar codepoint);

In the wchar and dchar overloads you'd do substring search. There is no need to decode.


> 10.
> 
> s.find('é') currently works as expected. Proposed: fails silently. This applies to other variations of find that include custom predicates.

Not necessarily. Arguably we should be overloading on needle type to eliminate needless decoding:

	string find(string s, char c); // ubyte search
	string find(string s, wchar c); // substring search with char[2]
	string find(string s, dchar c); // substring search with char[4]

This makes sense to me because string is immutable(char)[], so from the point of view of being an array, searching for wchar is not something that is obvious (how do you search for a value of type T in an array of elements of type U?), so explicit overloads for handling those cases make sense.

Decoding every single character in s is a lot of needless work.


[...]
> I designed the range behavior of strings after much thinking and consideration back in the day when I designed std.algorithm. It was painfully obvious (but it seems to have been forgotten now that it's working so well) that approaching strings as arrays of char[] would break almost every single algorithm leaving us essentially in the pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current implementation is lackluster in terms of performance.


> Making strings bidirectional ranges has been a very good choice within the constraints. There was already a string type, and that was immutable(char)[], and a bunch of code depended on that definition.
> 
> Clearly one might argue that their app has no business dealing with diacriticals or Asian characters. But that's the typical provincial view that marred many languages' approach to UTF and internationalization. If you know your string is ASCII, the remedy is simple - don't use char[] and friends. From day 1, the type "char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor, and requires many explicit casts to/from ubyte[].


> So please ponder the above before going to do surgery on the patient that's going to kill him.
[...]

Yeah I was surprised Walter was actually seriously going to pursue this. It's a change of a far vaster magnitude than many of the other DIPs and other proposals that have been rejected because they were deemed to cause too much breakage of existing code.


T

-- 
Having a smoking section in a restaurant is like having a peeing section in a swimming pool. -- Edward Burr