June 01, 2016
On Wednesday, 1 June 2016 at 19:07:26 UTC, ZombineDev wrote:
> On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu wrote:
>> On 06/01/2016 01:35 PM, ZombineDev wrote:
>>> On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote:
>>>> On 05/31/2016 02:46 PM, Timon Gehr wrote:
>>>>> On 31.05.2016 20:30, Andrei Alexandrescu wrote:
>>>>>> D's
>>>>>
>>>>> Phobos'
>>>>
>>>> foreach, too. -- Andrei
>>>
>>> Incorrect. https://dpaste.dzfl.pl/ba7a65d59534
>>
>> Try typing the iteration variable with "dchar". -- Andrei
>
> I think you are not getting my point. This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach. Typing char, wchar or dchar is the same using byChar, byWchar or byDchar - it is opt-in. The only problems are the front, empty and popFront overloads for narrow strings...

in std.range.primitives.


June 01, 2016
On 06/01/2016 03:07 PM, ZombineDev wrote:
> This is not autodecoding. There is nothing auto-magic w.r.t. strings in
> plain foreach.

I understand where you're coming from, but it actually is autodecoding. Consider:

byte[] a;
foreach (byte x; a) {}
foreach (short x; a) {}
foreach (int x; a) {}

That works by means of a conversion short->int. However:

char[] a;
foreach (char x; a) {}
foreach (wchar x; a) {}
foreach (dchar x; a) {}

The latter two do autodecoding, not coversion as the rest of the language.


Andrei

June 01, 2016
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
> foreach (dchar x; a) {}
> The latter two do autodecoding, not coversion as the rest of the language.

This seems to be a miscommunication with semantics. This is not auto-decoding at all; you're decoding, but there is nothing "auto" about it. This code is an explicit choice by the programmer to do something.

On the other hand, using std.range.primitives.front for narrow strings is auto-decoding because the programmer has not made a choice, the choice is made for the programmer.
June 01, 2016
On 06/01/2016 05:30 PM, Jack Stouffer wrote:
> On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
>> foreach (dchar x; a) {}
>> The latter two do autodecoding, not coversion as the rest of the
>> language.
>
> This seems to be a miscommunication with semantics. This is not
> auto-decoding at all; you're decoding, but there is nothing "auto" about
> it. This code is an explicit choice by the programmer to do something.

No, this is autodecoding pure and simple. We can't move the goals whenever we don't like where the ball gets. The usual language rules are not applied for strings - they are autodecoded (i.e. there's code generated that magically decodes UTF surprisingly for beginners, in apparent violation of the language rules, and without any user-visible request) by the foreach statement. -- Andrei

June 01, 2016
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
> On 06/01/2016 03:07 PM, ZombineDev wrote:
>> This is not autodecoding. There is nothing auto-magic w.r.t. strings in
>> plain foreach.
>
> I understand where you're coming from, but it actually is autodecoding. Consider:
>
> byte[] a;
> foreach (byte x; a) {}
> foreach (short x; a) {}
> foreach (int x; a) {}
>
> That works by means of a conversion short->int. However:
>
> char[] a;
> foreach (char x; a) {}
> foreach (wchar x; a) {}
> foreach (dchar x; a) {}
>
> The latter two do autodecoding, not coversion as the rest of the language.
>
>
> Andrei

Regardless of how different people may call it, it's not what this thread is about. Deprecating front, popFront and empty for narrow strings is what we are talking about here. This has little to do with explicit string transcoding in foreach. I don't think anyone has a problem with it, because it is **opt-in** and easy to change to get the desired behavior.
On the other hand, trying to prevent Phobos from autodecoding without typesystem defeating hacks like .representation is an uphill battle right now.

Removing range autodecoding will also be beneficial for library writers. For example, instead of writing find specializations for char, wchar and dchar needles, it would be much more productive to focus on optimising searching for T in T[] and specializing on element size and other type properties that generic code should care about. Having to specialize for all the char and string types instead of just any types of that size that can be compared bitwise is like programming in a language with no support for generic programing.

And like many others have pointed out, it also about correctness. Only the users can decide if searching at code unit, code point or grapheme level (or something else) is right for their needs. A library that pretends that a single interpretation (i.e. code point) is right for every case is a false friend.

June 01, 2016
On 06/01/2016 06:09 PM, ZombineDev wrote:
> Regardless of how different people may call it, it's not what this
> thread is about.

Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".

> Deprecating front, popFront and empty for narrow
> strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive and the benefit too small.

> This has little to do with
> explicit string transcoding in foreach.

It is implicit, not explicit.

> I don't think anyone has a
> problem with it, because it is **opt-in** and easy to change to get the
> desired behavior.

It's not opt-in. There is no way to tell foreach "iterate this array by converting char to dchar by the usual language rules, no autodecoding". You can if you e.g. use uint for the iteration variable. Same deal as with .representation.

> On the other hand, trying to prevent Phobos from autodecoding without
> typesystem defeating hacks like .representation is an uphill battle
> right now.

Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing?


Andrei

June 02, 2016
On Wednesday, 1 June 2016 at 18:30:25 UTC, Wyatt wrote:
> On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:
>> On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
>>> It's not hard.  I think a lot of us remember when a 14.4 modem was cutting-edge.
>>
>> Well, then apparently you're unaware of how bloated web pages are nowadays.  It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller.
>
> It's telling that you think the encoding of the text is anything but the tiniest fraction of the problem.  You should look at where the actual weight of a "modern" web page comes from.

I'm well aware that text is a small part of it.  My point is that they're not downloading those web pages, they're using mobile instead, as I explicitly said in a prior post.  My only point in mentioning the web bloat to you is that _your perception_ is off because you seem to think they're downloading _current_ web pages over 2G connections, and comparing it to your downloads of _past_ web pages with modems.  Not only did it take minutes for us back then, it takes _even longer_ now.

I know the text encoding won't help much with that.  Where it will help is the mobile apps they're actually using, not the bloated websites they don't use.

>>> Codepages and incompatible encodings were terrible then, too.
>>>
>>> Never again.
>>
>> This only shows you probably don't know the difference between an encoding and a code page,
>
> "I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_"
>
> Yeah, that?  That's codepages.  And your exact proposal to put encodings in the header was ALSO tried around the time that Unicode was getting hashed out.  It sucked.  A lot.  (Not as bad as storing it in the directory metadata, though.)

You know what's also codepages?  Unicode.  The UCS is a standardized set of code pages for each language, often merely picking the most popular code page at that time.

I don't doubt that nothing I'm saying hasn't been tried in some form before.  The question is whether that alternate form would be better if designed and implemented properly, not if a botched design/implementation has ever been attempted.

>>>> Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.
>>>
>>> It _is_ kind of ludicrous, isn't it?  But it really is the least-bad option for the most text.  Sorry, bub.
>>
>> I think we can do a lot better.
>
> Maybe.  But no one's done it yet.

That's what people said about mobile devices for a long time, until about a decade ago.  It's time we got this right.

>> The vast majority of software is written for _one_ language, the local one.  You may think otherwise because the software that sells the most and makes the most money is internationalized software like Windows or iOS, because it can be resold into many markets.  But as a percentage of lines of code written, such international code is almost nothing.
>
> I'm surprised you think this even matters after talking about web pages.  The browser is your most common string processing situation.  Nothing else even comes close.

No, it's certainly popular software, but at the scale we're talking about, ie all string processing in all software, it's fairly small.  And the vast majority of webapps that handle strings passed from a browser are written to only handle one language, the local one.

>> largely ignoring the possibilities of the header scheme I suggested.
>
> "Possibilities" that were considered and discarded decades ago by people with way better credentials.  The era of single-byte encodings is gone, it won't come back, and good riddance to bad rubbish.

Lol, credentials. :D If you think that matters at all in the face of the blatant stupidity embodied by UTF-8, I don't know what to tell you.

>> I could call that "trolling" by all of you, :) but I'll instead call it what it likely is, reactionary thinking, and move on.
>
> It's not trolling to call you out for clearly not doing your homework.

That's funny, because it's precisely you and others who haven't done your homework.  So are you all trolling me?  By your definition of trolling, which btw is not the standard one, _you_ are the one doing it.

>> I don't think you understand: _you_ are the special case.
>
> Oh, I understand perfectly.  _We_ (whoever "we" are) can handle any sequence of glyphs and combining characters (correctly-formed or not) in any language at any time, so we're the special case...?

And you're doing so by mostly using a single-byte encoding for _your own_ Euro-centric languages, ie ASCII, while imposing unnecessary double-byte and triple-byte encodings on everyone else, despite their outnumbering you 10 to 1.  That is the very definition of a special case.

> Yeah, it sounds funny to me, too.

I'm happy to hear you find your privilege "funny," but I'm sorry to tell you, it won't last.

>> The 5 billion people outside the US and EU are _not the special case_.
>
> Fortunately, it works for them to.

At a higher and unneccessary cost, which is why it won't last.

>> The problem is all the rest, and those just below who cannot afford it at all, in part because the tech is not as efficient as it could be yet.  Ditching UTF-8 will be one way to make it more efficient.
>
> All right, now you've found the special case; the case where the generic, unambiguous encoding may need to be lowered to something else: people for whom that encoding is suboptimal because of _current_ network constraints.
>
> I fully acknowledge it's a couple billion people and that's nothing to sneeze at, but I also see that it's a situation that will become less relevant over time.

I continue to marvel at your calling a couple billion people "the special case," presumably thinking ~700 million people in the US and EU primarily using the single-byte encoding of ASCII are the general case.

As for the continued relevance of such constrained use, I suggest you read the link Marco provided above.  The vast majority of the worlwide literate population doesn't have a smartphone or use a cellular data plan, whereas the opposite is true if you include featurephones, largely because they can by used only for voice.  As that article notes, costs for smartphones and 2G data plans will have to come down for them to go wider.  That will take decades to roll out, though the basic tech design will mostly be done now.

The costs will go down by making the tech more efficient, and ditching UTF-8 will be one of the ways the tech will be made more efficient.
June 02, 2016
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
>> Deprecating front, popFront and empty for narrow
>> strings is what we are talking about here.
>
> That will not happen. Walter and I consider the cost excessive and the benefit too small.
>
>> This has little to do with
>> explicit string transcoding in foreach.
>
> It is implicit, not explicit.

Do you mean you agree that range primitives for strings can be changed to stay (auto)decoding to dchar, but require some form of explicit opt-in?
June 02, 2016
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
> On 06/01/2016 03:07 PM, ZombineDev wrote:
>> This is not autodecoding. There is nothing auto-magic w.r.t. strings in
>> plain foreach.
>
> I understand where you're coming from, but it actually is autodecoding. Consider:
>
> byte[] a;
> foreach (byte x; a) {}
> foreach (short x; a) {}
> foreach (int x; a) {}
>
> That works by means of a conversion short->int. However:
>
> char[] a;
> foreach (char x; a) {}
> foreach (wchar x; a) {}
> foreach (dchar x; a) {}
>
> The latter two do autodecoding, not coversion as the rest of the language.
>
>
> Andrei

This, deep down, point at the fact that conversion from/to char types are ill defined.

One should be able to convert from char to byte/ubyte but not the other way around.
One should be able to convert from byte to short but not from char to wchar.

Once you disable the naive conversions, then the autodecoding in foreach isn't inconsistent anymore.
June 02, 2016
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
> On 06/01/2016 06:09 PM, ZombineDev wrote:
>> Regardless of how different people may call it, it's not what this
>> thread is about.
>
> Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".

My claim was not invalidated. I just didn't want to waste time arguing about it, because it is off topic. My point was that foreach is a purely language construct that doesn't know about the std.range.primitives module, therefore doesn't use it and therefore foreach doesn't perform **auto**decoding. It does perform explicit decoding because you need to specify a different type of iteration variable to trigger the behavior. If the variable type is not specified, you won't get any decoding (it will instead iterate over the code units).

>> Deprecating front, popFront and empty for narrow
>> strings is what we are talking about here.
>
> That will not happen. Walter and I consider the cost excessive and the benefit too small.

On the other hand many people think that the cost of using a language (like C++) that has accumulated excessive number of bad design decisions and pitfalls is too high.
Keeping bad design decisions alienates existing users and repulses new ones.

I know you are in a difficult decision making position, but imagine telling people ten years from now:

A) For the last ten years we worked on fixing every bad design and improving all the good ones. That's why we managed to expand our market share/mind share 10x-100x to what we had before.

B) This strange feature you need to know about is here because we chose comparability with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You should use this feature and here's a long list of things you need to consider when avoiding it.

The majority of D users ten years from now are not yet D users. That's the target group you need to consider. And given the overwhelming support for fixing this problem by the existing users, you need to reevaluate your cost vs benefit metrics.

This theme (breaking code) has come up many times before and I think that instead of complaining about the cost, we should focus on lower it with tooling. The problem I currently see is that there is not enough support for building and improving tools like dfix and leveraging them for language/std lib design process.

>> This has little to do with
>> explicit string transcoding in foreach.
>
> It is implicit, not explicit.
>
>> I don't think anyone has a
>> problem with it, because it is **opt-in** and easy to change to get the
>> desired behavior.
>
> It's not opt-in.

You need to opt-in by specifying a the type of the iteration variable and that type needs to be different than the typeof(array[0]). That's opt-in in my book.

> There is no way to tell foreach "iterate this array by converting char to dchar by the usual language rules, no autodecoding". You can if you e.g. use uint for the iteration variable. Same deal as with .representation.

Again, off topic. No sane person wants automatic conversion (bitcast) from char to dchar, because dchar gives the impression of a fully decoded code point, which the result of such cast would certainly not provide.

>> On the other hand, trying to prevent Phobos from autodecoding without
>> typesystem defeating hacks like .representation is an uphill battle
>> right now.
>
> Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing?

Memory safety is not the only benefit of a type system. This goal is only a small subset of the larger goal of preventing logical errors and allowing greater expressiveness.

You may as well invent a memory safe subset of D that works only ubyte, ushort, uint, ulong and arrays of those types, but I don't think anyone would want to use such language. Using .representation in parts of your code, makes those parts like the aforementioned language that no one wants to use.