June 02, 2016
On Wednesday, 1 June 2016 at 14:29:58 UTC, Andrei Alexandrescu wrote:
> On 06/01/2016 06:25 AM, Marc Schütz wrote:
>> On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
>>> The point is to operate on representation-independent entities
>>> (Unicode code points) instead of low-level representation-specific
>>> artifacts (code units).
>>
>> _Both_ are low-level representation-specific artifacts.
>
> Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? --

Ok, if you define it that way, sure. I was thinking in terms of the actual text: Unicode is a way to represent that text using a variety of low-level representations: UTF8/NFC, UTF8/NFD, unnormalized UTF8, UTF16 big/little endian x normalization, UTF32 x normalization, some other more obscure ones. From that viewpoint, auto decoded char[] (= UTF8) is equivalent to dchar[] (= UTF32). Neither of them is the actual text.

Both writing and the memory representation consist of fundamental units. But there is no 1:1 relationship between the units of char[] (UTF8 code units) or auto decoded strings (Unicode code points) on the one hand, and the units of writing (graphemes) on the other.
June 02, 2016
> ...
>
> B) This strange feature you need to know about is here because we chose comparability with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You should use this feature and here's a long list of things you need to consider when avoiding it.

B) This strange feature is here because we chose compatibility with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You shouldn't use this feature because of this and that potential pitfalls and here's a long list of things you need to consider when avoiding it.

> ...



June 02, 2016
On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
> On 06/01/2016 06:09 PM, ZombineDev wrote:

>> Deprecating front, popFront and empty for narrow
>> strings is what we are talking about here.
>
> That will not happen. Walter and I consider the cost excessive and the
> benefit too small.

If this doesn't happen, then all this push to change anything in Phobos is completely wasted effort. As long as arrays aren't treated like arrays, we will have to deal with auto-decoding.

You can change string literals to be something other than arrays, and then we have a path forward. But as long as char[] is not an array, we have lost the battle of sanity.

-Steve
June 02, 2016
On 06/02/2016 06:42 AM, ZombineDev wrote:
> On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
>> On 06/01/2016 06:09 PM, ZombineDev wrote:
>>> Regardless of how different people may call it, it's not what this
>>> thread is about.
>>
>> Yes, definitely - but then again we can't after each invalidated claim
>> to go "yeah well but that other point stands".
>
> My claim was not invalidated. I just didn't want to waste time arguing
> about it, because it is off topic. My point was that foreach is a purely
> language construct that doesn't know about the std.range.primitives
> module, therefore doesn't use it and therefore foreach doesn't perform
> **auto**decoding. It does perform explicit decoding because you need to
> specify a different type of iteration variable to trigger the behavior.
> If the variable type is not specified, you won't get any decoding (it
> will instead iterate over the code units).

Your claim was obliterated, and now you continue arguing it by adjusting term definitions on the fly, while at the same time awesomely claiming to choose the high road by not wasting time to argue it. I should remember the trick :o). Stand with the points that stand, own those that don't.

>>> Deprecating front, popFront and empty for narrow
>>> strings is what we are talking about here.
>>
>> That will not happen. Walter and I consider the cost excessive and the
>> benefit too small.
>
> On the other hand many people think that the cost of using a language
> (like C++) that has accumulated excessive number of bad design decisions
> and pitfalls is too high.
> Keeping bad design decisions alienates existing users and repulses new
> ones.

Definitely. It's a fine line to walk; this particular decision is not that much on the edge at all. We must stay with autodecoding.

> I know you are in a difficult decision making position, but imagine
> telling people ten years from now:
>
> A) For the last ten years we worked on fixing every bad design and
> improving all the good ones. That's why we managed to expand our market
> share/mind share 10x-100x to what we had before.

I think we have underperformed and we need to do radically better. I'm on lookout for radical new approaches to things all the time. This is for another discussion though.

> B) This strange feature you need to know about is here because we chose
> comparability with old code, over building the best language possible.
> The language managed to continue growing (but not as fast as we hoped)
> only because of the other good features. You should use this feature and
> here's a long list of things you need to consider when avoiding it.

There are many components to the decision, not only compatibility with old code.

> The majority of D users ten years from now are not yet D users. That's
> the target group you need to consider. And given the overwhelming
> support for fixing this problem by the existing users, you need to
> reevaluate your cost vs benefit metrics.

It's funny that evidence for the "overwhelming" support is the vote of 35 voters, which was cast in terms of percentages. Math is great.

ZombineDev, I've been at the top level in the C++ community for many many years, even after I wanted to exit :o). I'm familiar with how the committee that steers C++ works, perspective that is unique in our community - even Walter lacks it. I see trends and patterns. It is interesting how easily a small but very influential priesthood can alienate itself from the needs of the larger community and get into a frenzy over matters that are simply missing the point.

This is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode. The very definition of a useless debate, the kind he and I had agreed to not initiate anymore. It was a mistake. I'm still metaphorically angry at him for it. I admit I started it by asking the question, but Walter shouldn't have answered. Following that, there was blood in the water; any of us loves to improve something by 2% by completely rewiring the thing. A proneness to doing that is why we self-select to be in this community and forum.

Meanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons:

* The garbage collector eliminates probably 60% of potential users right off.

* Tooling is immature and of poorer quality compared to the competition.

* Safety has holes and bugs.

* Hiring people who know D is a problem.

* Documentation and tutorials are weak.

* There's no web services framework (by this time many folks know of D, but of those a shockingly small fraction has even heard of vibe.d). I have strongly argued with Sönke to bundle vibe.d with dmd over one year ago, and also in this forum. There wasn't enough interest.

* (On Windows) if it doesn't have a compelling Visual Studio plugin, it doesn't exist.

* Let's wait for the "herd effect" (corporate support) to start.

* Not enough advantages over the competition to make up for the weaknesses above.

There is a second echelon of arguments related to language proper issues, but those collectively count as much less than the above. And "inefficient/poor/error-prone string handling" has NEVER come up. Literally NEVER, even among people who had some familiarity with D and would otherwise make very informed comments about it.

Look at reddit and hackernews, too - admittedly other self-selected communities. Language debates often spring about. How often is the point being made that D is wanting because of its string support? Nada.

> This theme (breaking code) has come up many times before and I think
> that instead of complaining about the cost, we should focus on lower it
> with tooling. The problem I currently see is that there is not enough
> support for building and improving tools like dfix and leveraging them
> for language/std lib design process.

Currently dfix is weak because it doesn't do lookup. So we need to make the front end into a library. Daniel said he wants to be on it, but he has two jobs to worry about so he's short on time. There's only so many hours in the day, and I think the right focus is on attacking the matters above.

>>> This has little to do with
>>> explicit string transcoding in foreach.
>>
>> It is implicit, not explicit.
>>
>>> I don't think anyone has a
>>> problem with it, because it is **opt-in** and easy to change to get the
>>> desired behavior.
>>
>> It's not opt-in.
>
> You need to opt-in by specifying a the type of the iteration variable
> and that type needs to be different than the typeof(array[0]). That's
> opt-in in my book.

Taking exception to language rules for iteration with dchar is not opt-in.

>> There is no way to tell foreach "iterate this array by converting char
>> to dchar by the usual language rules, no autodecoding". You can if you
>> e.g. use uint for the iteration variable. Same deal as with
>> .representation.
>
> Again, off topic.

It's very on-topic. It's surprising semantics compared to the rest of the language, for which the user needs to be informed.

> No sane person wants automatic conversion (bitcast)
> from char to dchar, because dchar gives the impression of a fully
> decoded code point, which the result of such cast would certainly not
> provide.

void fun(char c) {
    if (c < 0x80) {
        // Look ma I'm not a sane person
        dchar d = c; // conversion is implicit, too
        ...
    }
}

>>> On the other hand, trying to prevent Phobos from autodecoding without
>>> typesystem defeating hacks like .representation is an uphill battle
>>> right now.
>>
>> Characterizing .representation as a typesystem defeating hack is a
>> stretch. What memory safety issues is it introducing?
>
> Memory safety is not the only benefit of a type system. This goal is
> only a small subset of the larger goal of preventing logical errors and
> allowing greater expressiveness.

This sounds like "no comeback here so let's insert a filler". Care to substantiate?

> You may as well invent a memory safe subset of D that works only ubyte,
> ushort, uint, ulong and arrays of those types, but I don't think anyone
> would want to use such language. Using .representation in parts of your
> code, makes those parts like the aforementioned language that no one
> wants to use.

I disagree.


Andrei

June 02, 2016
On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:
> On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
>> On 06/01/2016 06:09 PM, ZombineDev wrote:
>
>>> Deprecating front, popFront and empty for narrow
>>> strings is what we are talking about here.
>>
>> That will not happen. Walter and I consider the cost excessive and the
>> benefit too small.
>
> If this doesn't happen, then all this push to change anything in Phobos
> is completely wasted effort.

Really? "Anything"?

> As long as arrays aren't treated like
> arrays, we will have to deal with auto-decoding.
>
> You can change string literals to be something other than arrays, and
> then we have a path forward. But as long as char[] is not an array, we
> have lost the battle of sanity.

Yeah, it's a miracle the language stays glued eh.

Your post is a prime example that this thread has lost the battle of sanity. I'll destroy you in person tonight.


Andrei
June 02, 2016
On 6/1/16 6:31 AM, Marc Schütz wrote:
> On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer wrote:
>> On 5/31/16 4:38 PM, Timon Gehr wrote:
>>> What about e.g. joiner?
>>
>> Compiler error. Better than what it does now.
>
> I believe everything that does only concatenation will work correctly.
> That's why joiner() is one of those algorithms that should accept
> strings directly without going through any decoding (but it may need to
> recode the joining element itself, of course).

This means that a string is a range. What is it a range of? If you want to make it a range of code units, I think you will lose that battle.

If you want to special-case joiner for strings, that's always possible. Or string could be changed to be a range of dchar struct explicitly. Then at least joiner makes sense, and I can reasonably explain why it behaves the way it does.

-Steve
June 02, 2016
On 02.06.2016 12:38, deadalnix wrote:
>>
>
> This, deep down, point at the fact that conversion from/to char types
> are ill defined.
>
> One should be able to convert from char to byte/ubyte but not the other
> way around.
> One should be able to convert from byte to short but not from char to
> wchar.
>
> Once you disable the naive conversions, then the autodecoding in foreach
> isn't inconsistent anymore.

The current situation is bad:

void main(){
    import std.utf,std.stdio;
    foreach(dchar d;"∑")
        writeln(d); // "∑"
    foreach(dchar d;"∑".byCodeUnit)
        writeln(d); // "â", "ˆ\210", "\221‘"
}

Implicit conversion should not happen, and I'd prefer both of them to behave the same. (I.e. make both a compile-time error or decode for both).
June 02, 2016
On 01.06.2016 23:48, Andrei Alexandrescu wrote:
> On 06/01/2016 05:30 PM, Jack Stouffer wrote:
>> On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
>>> foreach (dchar x; a) {}
>>> The latter two do autodecoding, not coversion as the rest of the
>>> language.
>>
>> This seems to be a miscommunication with semantics. This is not
>> auto-decoding at all; you're decoding, but there is nothing "auto" about
>> it. This code is an explicit choice by the programmer to do something.
>
> No, this is autodecoding pure and simple. We can't move the goals
> whenever we don't like where the ball gets.

It does not share most of the characteristics that make Phobos' autodecoding painful in practice.

> The usual language rules are
> not applied for strings - they are autodecoded (i.e. there's code
> generated that magically decodes UTF surprisingly for beginners, in
> apparent violation of the language rules, and without any user-visible
> request) by the foreach statement. -- Andrei
>

Agreed.

(But implicit conversion from char to dchar is a bad language rule.)
June 02, 2016
On 6/2/16 9:09 AM, Andrei Alexandrescu wrote:
> On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:
>> On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
>>> On 06/01/2016 06:09 PM, ZombineDev wrote:
>>
>>>> Deprecating front, popFront and empty for narrow
>>>> strings is what we are talking about here.
>>>
>>> That will not happen. Walter and I consider the cost excessive and the
>>> benefit too small.
>>
>> If this doesn't happen, then all this push to change anything in Phobos
>> is completely wasted effort.
>
> Really? "Anything"?

The push to make Phobos only use byDchar (or any other band-aid fixes for this issue) is what I meant by anything. not "anything" anything :)

>> As long as arrays aren't treated like
>> arrays, we will have to deal with auto-decoding.
>>
>> You can change string literals to be something other than arrays, and
>> then we have a path forward. But as long as char[] is not an array, we
>> have lost the battle of sanity.
>
> Yeah, it's a miracle the language stays glued eh.

I mean as far as narrow strings are concerned. To have the language tell me, yes, char[] is an array with a .length member, but hasLength is false? What, str[4] works, but isRandomAccessRange is false?

Maybe it's more Orwellian than insane: Phobos is saying 2 + 2 = 5 ;)

> Your post is a prime example that this thread has lost the battle of
> sanity. I'll destroy you in person tonight.

It's the cynicism of talking/debating about this for years and years and not seeing any progress. We can discuss of course, and see who gets destroyed :)

And yes, I'm about to kill this thread from my newsreader, since it's wasting too much of my time...

-Steve
June 02, 2016
On 02.06.2016 15:06, Andrei Alexandrescu wrote:
> On 06/02/2016 06:42 AM, ZombineDev wrote:
>> On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
>>> On 06/01/2016 06:09 PM, ZombineDev wrote:
>>>> Regardless of how different people may call it, it's not what this
>>>> thread is about.
>>>
>>> Yes, definitely - but then again we can't after each invalidated claim
>>> to go "yeah well but that other point stands".
>>
>> My claim was not invalidated. I just didn't want to waste time arguing
>> about it, because it is off topic. My point was that foreach is a purely
>> language construct that doesn't know about the std.range.primitives
>> module, therefore doesn't use it and therefore foreach doesn't perform
>> **auto**decoding. It does perform explicit decoding because you need to
>> specify a different type of iteration variable to trigger the behavior.
>> If the variable type is not specified, you won't get any decoding (it
>> will instead iterate over the code units).
>
> Your claim was obliterated, and now you continue arguing it by adjusting
> term definitions on the fly, while at the same time awesomely claiming
> to choose the high road by not wasting time to argue it. I should
> remember the trick :o). Stand with the points that stand, own those that
> don't.

It's not "on the fly". You two were presumably using different definitions of terms all along.