May 31, 2016
On 5/30/16 5:51 PM, Walter Bright wrote:
> On 5/30/2016 8:34 AM, Marc Schütz wrote:
>> In an ideal world, we'd also want to change the way `length` and
>> `opIndex` work,
>
> Why? strings are arrays of code units. All the trouble comes from
> erratically pretending otherwise.

That's not an argument. Objects are arrays of bytes, or tuples of their fields, etc. The whole point of encapsulation is superimposing a more structured view on top of the representation. Operating on open-heart representation is risky, and strings are no exception. -- Andrei
May 31, 2016
On 5/30/16 7:52 PM, Seb wrote:
> On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:
>> On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:
>>> On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
>>>> D1 -> D2 was a vastly more disruptive change than getting rid of
>>>> auto-decoding would be.
>>>
>>> Don't be so sure. All string handling code would become broken, even
>>> if it appears to work at first.
>>
>> Assuming silent breakage is on the table, what would be broken, really?
>>
>> Code that must intentionally count or otherwise operate code points,
>> sure. But how much of all string handling code is like that?
>>
>> Perhaps it would be worth trying to silently remove autodecoding and
>> seeing how much of Phobos breaks, as an experiment. Has this been
>> tried before?
>>
>> (Not saying this is a route we should take, but it doesn't seem to me
>> that it will break "all string handling code" either.)
>
> 132 lines in Phobos use auto-decoding - that should be fixable ;-)
>
> See them: http://sprunge.us/hUCL
> More details: https://github.com/dlang/phobos/pull/4384

Thanks for this investigation! Results are about as I'd have speculated. -- Andrei

May 30, 2016
On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 5/30/16 6:00 PM, Walter Bright wrote:
> > On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
> > > I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).
> > 
> > Yup. It isn't hard at all to use arrays of codeunits correctly.
> 
> Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- Andrei

Neither does autodecoding make code anymore correct. It just better hides the fact that the code is wrong.


T

-- 
I've been around long enough to have seen an endless parade of magic new techniques du jour, most of which purport to remove the necessity of thought about your programming problem.  In the end they wind up contributing one or two pieces to the collective wisdom, and fade away in the rearview mirror. -- Walter Bright
May 31, 2016
On Tuesday, 31 May 2016 at 06:45:56 UTC, H. S. Teoh wrote:
> On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
>> On 5/30/16 6:00 PM, Walter Bright wrote:
>> > On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
>> > > I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).
>> > 
>> > Yup. It isn't hard at all to use arrays of codeunits correctly.
>> 
>> Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- Andrei
>
> Neither does autodecoding make code anymore correct. It just better hides the fact that the code is wrong.
>
>
> T

Thinking about this a bit more - what algorithms are actually correct when implemented on the level of code units?
Off the top of my head I can only really think of copying and hashing, since you want to do that on the byte level anyways.
I would also think that if you know your strings are normalized in the same normalization form (for example because they come from the same normalized source), you can check two strings for equality on the code unit level, but my understanding of unicode is still quite lacking, so I'm not sure on that.
May 31, 2016
On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
> On 5/30/16 5:51 PM, Walter Bright wrote:
>> On 5/30/2016 8:34 AM, Marc Schütz wrote:
>>> In an ideal world, we'd also want to change the way `length` and
>>> `opIndex` work,
>>
>> Why? strings are arrays of code units. All the trouble comes from
>> erratically pretending otherwise.
>
> That's not an argument.

Consistency is a factual argument, and autodecode is not consistent.


> Objects are arrays of bytes, or tuples of their fields,
> etc. The whole point of encapsulation is superimposing a more structured view on
> top of the representation. Operating on open-heart representation is risky, and
> strings are no exception.

If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it. Autodecoding is not it.
May 31, 2016
On Monday, 30 May 2016 at 21:39:00 UTC, Walter Bright wrote:
> On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:
>> If I ever had to write string-heavy code, I'd probably fork Phobos just
>> so I can get decent performance. Just sayin'.
>
> When I wrote Warp, the only point of which was speed, I couldn't use phobos because of autodecoding. I have since recoded a number of phobos functions so they didn't autodecode, so the situation is better.

Two questions:

1. Given you experience with Warp, how hard would it be to clean Phobos up?
2. After recoding a number of Phobos functions, how much code did actually break (yours or someone else's)?.
May 31, 2016
On Tuesday, 31 May 2016 at 07:56:54 UTC, Walter Bright wrote:
> On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
>> On 5/30/16 5:51 PM, Walter Bright wrote:
>>> On 5/30/2016 8:34 AM, Marc Schütz wrote:
>>>> In an ideal world, we'd also want to change the way `length` and
>>>> `opIndex` work,
>>>
>>> Why? strings are arrays of code units. All the trouble comes from
>>> erratically pretending otherwise.
>>
>> That's not an argument.
>
> Consistency is a factual argument, and autodecode is not consistent.
>

+1

>> Objects are arrays of bytes, or tuples of their fields,
>> etc. The whole point of encapsulation is superimposing a more structured view on
>> top of the representation. Operating on open-heart representation is risky, and
>> strings are no exception.
>
> If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it. Autodecoding is not it.

Thing is, more info is needed to support unicode properly. Collation for instance.

May 31, 2016
Am Tue, 31 May 2016 07:17:03 +0000
schrieb default0 <Kevin.Labschek@gmx.de>:

> Thinking about this a bit more - what algorithms are actually correct when implemented on the level of code units?

Calculating the buffer size of a string, validation and
fast versions of general algorithms that can be defined in
terms of ASCII, like skipAsciiWhitespace(), splitByComma(),
splitByLineAscii().

> I would also think that if you know your strings are normalized in the same normalization form (for example because they come from the same normalized source), you can check two strings for equality on the code unit level, but my understanding of unicode is still quite lacking, so I'm not sure on that.

That's correct.

-- 
Marco

May 31, 2016
On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
> On 5/30/2016 8:34 AM, Marc Schütz wrote:
>> In an ideal world, we'd also want to change the way `length` and `opIndex` work,
>
> Why? strings are arrays of code units.

So, strings are _implemented_ as arrays of code units. But indiscriminately treating them as such in all situations leads to wrong results (just like arrays of code points would).

In an ideal world, the programs someone intuitively writes will do the right thing, and if they can't, they at least refuse to compile. If we agree that it's up to the user whether to iterate over a string by code unit or code points or graphemes, and that we shouldn't arbitrarily choose one of those (except when we know that it's what the user wants), then the same applies to indexing, slicing and counting.

On the other hand, changing such low-level things will likely be impractical, that's why I said "In an ideal world".

> All the trouble comes from erratically pretending otherwise.

For me, the trouble comes from pretending otherwise _without being told to_.

To make sure there are no misunderstandings, here is what is suggested as an alternative to the current situation:

* `char[]`, `wchar[]` (and `dchar[]`?) no longer pass `isInputRange`.
* Ranges with element type `char`, `wchar`, and `dchar` do pass `isInputRange`.
* A bunch of rangeifying helpers are added to `std.string` (I believe they are already there): `byCodePoint`, `byCodeUnit`, `byChar`, `byWchar`, `byDchar`, ...
* Algorithms like `find`, `join(er)` get overloads that accept char slices directly.
* Built-in operators and `length` of char slices are unchanged.

Advantages:

* Algorithms that can work _correctly_ without any kind of decoding will do so.
* Algorithms that would yield incorrect results won't compile, requiring the user to make a decision regarding the desired element type.
* No auto-decoding.
  => Best performance depending on the actual requirements.
  => No results that look correct when tested with only precomposed characters but are wrong in the general case.
* Behaviour of [] and .length is no worse than today.
May 31, 2016
On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:
> On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
>> [...]
>
> So, strings are _implemented_ as arrays of code units. But indiscriminately treating them as such in all situations leads to wrong results (just like arrays of code points would).
>
> [...]

If we follow Adam's proposal to deprecate front, back, popFront and popBack, we don't even need to touch the compiler and it's trivial to do so.
The proof of concept change needs eight lines.

https://github.com/dlang/phobos/pull/4384

Explicitly stating the type of iteration in the 132 places with auto-decoding in Phobos doesn't sound that terrible.