May 30, 2016
On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:
> On 5/28/16 6:59 AM, Marc Schütz wrote:
>> The fundamental problem is choosing one of those possibilities over the
>> others without knowing what the user actually wants, which is what both
>> BEFORE and AFTER do.
>
> OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.)
>
> So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.

I think this is going too far. It's sufficient if they (= char slices, not ranges) can't be iterated over directly, i.e. aren't input ranges (and maybe don't work with foreach). That would force the user to append .byCodeUnit etc. as needed.

This provides a very nice deprecation path, by the way, it's just not clear whether it can be implemented with the way `deprecated` currently works. I.e. deprecate/warn every time auto decoding kicks in, print a nice message to the user, and later remove auto decoding and make isInputRange!string return false.
May 30, 2016
On 05/30/2016 07:58 AM, Marc Schütz wrote:
> On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:
>> On 5/28/16 6:59 AM, Marc Schütz wrote:
>>> The fundamental problem is choosing one of those possibilities over the
>>> others without knowing what the user actually wants, which is what both
>>> BEFORE and AFTER do.
>>
>> OK, that's a fair argument, thanks. So it seems there should be no
>> "default" way to iterate a string, and furthermore iterating for each
>> constituent of a string should be fairly rare. Strings and substrings
>> yes, but not individual points/units/graphemes unless expressly asked.
>> (Indeed some languages treat strings as first-class entities and
>> individual characters are mere short substrings.)
>>
>> So it harkens back to the original mistake: strings should NOT be
>> arrays with the respective primitives.
>
> I think this is going too far. It's sufficient if they (= char slices,
> not ranges) can't be iterated over directly, i.e. aren't input ranges
> (and maybe don't work with foreach).

That's... what I said. -- Andrei

May 30, 2016
On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu wrote:
> That's... what I said. -- Andrei

You said "not arrays", he said "not ranges".

So that just means making the std.range.primitives.popFront and front add a constraint if(!isSomeString()).

Language built-ins still work, but the library rejects them.


Indeed, we could add a deprecated overload then that points people to the other range getter methods (byCodeUnit, byCodePoint, byGrapheme, etc.)... this might be our migration path.
May 30, 2016
On Monday, 30 May 2016 at 12:59:08 UTC, Adam D. Ruppe wrote:
> On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu wrote:
>> That's... what I said. -- Andrei
>
> You said "not arrays", he said "not ranges".
>
> So that just means making the std.range.primitives.popFront and front add a constraint if(!isSomeString()).
>
> Language built-ins still work, but the library rejects them.
>
>
> Indeed, we could add a deprecated overload then that points people to the other range getter methods (byCodeUnit, byCodePoint, byGrapheme, etc.)... this might be our migration path.

That's a great idea - the compiler should also issue deprecation warnings when I try to do things like:

string a  = "你好";

a[1]; // deprecation: direct access to a Unicode string is highly error-prone. Please specify the type of access. More details (shortlink)

a[1] = "b"; // deprecation: direct index assignment to a Unicode string is ...

a.length; // deprecation: a Unicode string has multiple definitions of length. Please specify your iteration (...). More details (shortlink)

...

Btw should a[] be an alias for `byCodeUnit` or also trigger a warning?
May 30, 2016
On 05/30/2016 04:35 PM, Seb wrote:
> That's a great idea - the compiler should also issue deprecation
> warnings when I try to do things like:
>
> string a  = "你好";
>
> a[1]; // deprecation: direct access to a Unicode string is highly
> error-prone. Please specify the type of access. More details (shortlink)
>
> a[1] = "b"; // deprecation: direct index assignment to a Unicode string
> is ...
>
> a.length; // deprecation: a Unicode string has multiple definitions of
> length. Please specify your iteration (...). More details (shortlink)
>
> ...
>
> Btw should a[] be an alias for `byCodeUnit` or also trigger a warning?

All this is only sensible when we move to a dedicated string type that's not just an alias of `immutable(char)[]`.

`immutable(char)[]` explicitly is an array of code units. It would not be acceptable, in my opinion, if the normal array syntax got broken for it.
May 30, 2016
On Monday, 30 May 2016 at 14:56:36 UTC, ag0aep6g wrote:
> All this is only sensible when we move to a dedicated string type that's not just an alias of `immutable(char)[]`.
>
> `immutable(char)[]` explicitly is an array of code units. It would not be acceptable, in my opinion, if the normal array syntax got broken for it.

I agree; most of the troubles have been with auto-decoding. In an ideal world, we'd also want to change the way `length` and `opIndex` work, but if we only fix the range primitives, we've achieved almost as much with fewer compatibility problems.
May 30, 2016
On 05/29/2016 04:47 PM, H. S. Teoh via Digitalmars-d wrote:
> It depends on what you're trying to accomplish. That's the point we're
> trying to get at.  For some operations, working with code points makes
> the most sense. But for other operations, it does not.  There is no one
> representation that is best for all situations; it needs to be decided
> on a case-by-case basis.  Which is why forcing everything to decode to
> code points eventually leads to problems.

I see. Again this all to me sounds like "naked arrays of characters are the wrong choice and should have been encapsulated in a dedicated string type". -- Andrei

May 30, 2016
On 05/28/2016 03:04 PM, Walter Bright wrote:
> On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
>> So it harkens back to the original mistake: strings should NOT be
>> arrays with
>> the respective primitives.
>
> An array of code units provides consistency, predictability,
> flexibility, and performance. It's a solid base upon which the
> programmer can build what he needs as required.

Nope. Not buying it.

> A string class does not do that

Buying it. -- Andrei
May 30, 2016
Am Mon, 30 May 2016 09:26:09 +0000
schrieb Chris <wendlec@tcd.ie>:

> If it's true that auto decode is unnecessary in many cases, then it shouldn't affect the whole code base. But I might be mistaken here. Maybe we should make a list of the functions where auto decode does make a difference, see how common they are, and work out a strategy from there. Destroy.

It makes a difference for every function. But it still isn't necessary in many cases. It's fairly simple:

code unit  == bytes/chars
code point == auto-decode
grapheme*  == .byGrapheme

So if for now you used auto-decode you iterated code-points, which works correctly for most scripts in NFC**. And here lies the rub and why people say auto-decoding is unnecessary most of the time: If you are working with XML, CSV or JSON or another structured text format, these all use ASCII characters for their syntax elements. Code unit, code point and graphemes become all the same and auto-decoding just slows you down.

When on the other hand you work with real world international text, you'll want to work with graphemes. One example is putting an ellipsis in long text:

"Alle Segeltörns im Überblick" (in NFD, e.g. OS X file name)
may display as this with auto-decode:
"Alle Segelto…¨berblick"
and this with byGrapheme:
"Alle Segeltö…Überblick"

But at that point you are likely also in need of localized sorting of strings, a set of algorithms that may change with the rise and fall of nations or reformations. So you'll use the platform's go-to Unicode library instead of what Phobos offers. For Java and Linux that would be ICU***.

That last point makes me think we should not bother much with decoding in Phobos at all. Odds are we miss other capabilities to make good use of it. Users of auto-decode should review their code to see if code-points is really what they want and potentially switch to no-decoding or .byGrapheme.

* What we typically perceive as one unit in written text.
** A normalization form where e.g. 'ö' is a single code-point,
   as opposed to NFD, where 'ö' would be assembled from the
   two 'o' and '¨' code-points as in OS X file names.
*** http://site.icu-project.org/home#TOC-What-is-ICU-

-- 
Marco

May 30, 2016
On 05/29/2016 09:58 PM, Jack Stouffer wrote:
>
> The problem is not active users. The problem is companies who have > 10K
> LOC and libraries that are no longer maintained. E.g. It took
> Sociomantic eight years after D2's release to switch only a few parts of
> their projects to D2. With the loss of old libraries/old code (even old
> answers on SO), all of a sudden you lose a lot of the network effect
> that makes programming languages much more useful.
>

D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.