May 12, 2016
On Thu, May 12, 2016 at 09:20:05AM -0700, Walter Bright via Digitalmars-d wrote: [...]
> There are a lot of major efficiency gains by not autodecoding.

Any chance of killing autodecoding in Phobos in the foreseeable future? It's one of the things that I really liked about D in the beginning, but over time, I've come to realize more and more that it was a mistake. It's one of those things that only becomes clear in retrospect.


T

-- 
Ph.D. = Permanent head Damage
May 12, 2016
On 5/12/16 6:55 PM, Walter Bright wrote:
> This reminds me of all the discussions around trying to hide the fact
> that D strings are UTF-8 code units. The ultimate outcome of trying to
> make it "make sense" was the utter disaster of autodecoding.

I am as unclear about the problems of autodecoding as I am about the necessity to remove curl. Whenever I ask I hear some arguments that work well emotionally but are scant on reason and engineering. Maybe it's time to rehash them? I just did so about curl, no solid argument seemed to come together. I'd be curious of a crisp list of grievances about autodecoding. -- Andrei

May 12, 2016
On Thursday, 12 May 2016 at 16:20:05 UTC, Walter Bright wrote:
> On 5/12/2016 9:15 AM, Guillaume Chatelet wrote:
>> Well maybe it was a disaster because the problem was only half solved.
>> It looks like Perl 6 got it right:
>> https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/
>
> Perl isn't a systems programming language. A systems language requires access to code units, invalid encodings, etc. Nor is Perl efficient. There are a lot of major efficiency gains by not autodecoding.

[Sorry for the OT]

I never claimed Perl was a systems programming language nor that it was efficient, just that their design looks more mature than ours.

Also I think you missed this part of the article:

"Of course, that’s all just for the default Str type. If you don’t want to work at a grapheme level, then you have several other string types to choose from: If you’re interested in working within a particular normalization, there’s the self-explanatory types of NFC, NFD, NFKC, and NFKD. If you just want to work with codepoints and not bother with normalization, there’s the Uni string type (which may be most appropriate in cases where you don’t want the NFC normalization that comes with normal Str, and keep text as-is). And if you want to work at the binary level, well, there’s always the Blob family of types :)."

We basically have "Uni" in D, no normalized nor grapheme level.
May 12, 2016
On 05/12/2016 06:29 PM, Andrei Alexandrescu wrote:
> I'd be curious of a crisp list of grievances about
> autodecoding. -- Andrei

It emits code points (dchar) which is an awkward middle point between code units (char/wchar) and graphemes.

Without any auto-decoding at all, every array T[] would be a random-access range of Ts as well. `.front` would be the same as `[0]`, `.length` would be the same as `.walkLength`, etc. That would make things less confusing for newbies, and more experienced programmers wouldn't accidentally mix the two abstraction levels.

Of course, you'd have to be aware that a (w)char is not a character as perceived by humans, but a code unit. But auto-decoding to code points only shifts that problem: You have to be aware that a dchar is not a character either. Multiple dchars may form one visible character, one grapheme. For example, "\u00E4" and "a\u0308" encode the same grapheme: "ä".

If char[], wchar[], dchar[] (and qualified variants) were ranges of graphemes, things would make the most sense for people who are not aware of delicate details of Unicode. You wouldn't accidentally cut code points or graphemes in half, `.walkLength` makes intuitive sense, etc. You could still accidentally use `.length` or `[0]`, though. So it still has some pitfalls.
May 12, 2016
On 5/12/16 12:00 PM, Walter Bright wrote:
> On 5/12/2016 6:03 AM, Steven Schveighoffer wrote:
>> Not taking one side or another on this, but due to D doing everything
>> with
>> reals, this is already the case.
>
> Actually, C allows that behavior if I understood the spec correctly. D
> just makes things more explicit.

This is the thread I was thinking about: https://forum.dlang.org/post/ugxmeiqsbitxxzoyakko@forum.dlang.org

Essentially, copy-pasted code from C results in different behavior in D because of the different floating point decisions made by the compilers.

Your quote was: "[The problem is] Code will do one thing in C, and the same code will do something unexpectedly different in D."

My response is that it already happens, so it may not be a convincing argument. I don't know the requirements or allowances of the C spec. All I know is the testable reality of the behavior on the same platform.

-Steve
May 12, 2016
On 5/12/16 12:29 PM, Andrei Alexandrescu wrote:
> On 5/12/16 6:55 PM, Walter Bright wrote:
>> This reminds me of all the discussions around trying to hide the fact
>> that D strings are UTF-8 code units. The ultimate outcome of trying to
>> make it "make sense" was the utter disaster of autodecoding.
>
> I am as unclear about the problems of autodecoding as I am about the
> necessity to remove curl. Whenever I ask I hear some arguments that work
> well emotionally but are scant on reason and engineering. Maybe it's
> time to rehash them? I just did so about curl, no solid argument seemed
> to come together. I'd be curious of a crisp list of grievances about
> autodecoding. -- Andrei
>

Autodecoding, IMO, is not the problem. The problem is hijacking an array to mean something other than an array.

I ran into this the other day. In iopipe, I treat char[] buffers as actual buffers of code units. This makes sense, as I'm reading/writing text to buffers, and care very little about decoding.

I wanted to test my system's ability to handle random-access ranges, and I'm using isRandomAccessRange || isNarrowString to get around this.

As soon as I do chain(a, b) where a and b are narrow strings, this doesn't work, and I can't get it back.

See my exception in the unit test: https://github.com/schveiguy/iopipe/blob/master/source/iopipe/traits.d#L91

If you want to avoid auto-decoding, you have to be very cautious about using Phobos features.

-Steve
May 12, 2016
On 05/12/2016 07:12 PM, ag0aep6g wrote:
[...]

tl;dr:

This is surprising to newbies, and a possible source of bugs for experienced programmers:

----
writeln("ä".length); /* "2" */
writeln("ä"[0]); /* question mark in a diamond */
----

When people understand the above, they might still not understand this:

----
writeln("length of 'a\u0308': ", "a\u0308".walkLength); /* "length of 'ä': 2" */
writeln("a\u0308".front); /* "a" */
----
May 12, 2016
On 5/12/2016 9:36 AM, Guillaume Chatelet wrote:
> just that their design looks more mature than ours.

I don't think that can be inferred from a brief article.

If you want to access D strings by various means, there's .byChar, .byWchar, .byDchar, .byCodeunit, etc. foreach can also pick off characters by various schemes.

May 12, 2016
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> I am as unclear about the problems of autodecoding

Started a new thread on that.

May 13, 2016
Am Mon, 9 May 2016 04:26:55 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> > I wonder what's the difference between 1.30f and cast(float)1.30.
> 
> There isn't one.

Oh yes, there is! Don't you love floating-point...

cast(float)1.30 rounds twice, first from a base-10 representation to a base-2 double value and then again to a float. 1.30f directly converts to float. In some cases this does not yield the same value as converting the base-10 literal directly to a float!

Imagine this example (with mantissa bit count reduced for
illustration):

Original base-10 rational number converted to base-2:
111110|10000000|011010111011...
       ↖        ↖
        float &  double mantissa precision limits

The 1st segment is the mantissa width of a float.
The 2nd segment is the mantissa width of a double.
The 3rd segment is the fraction used for rounding
 the base-10 literal to a double.

Conversion to double rounds down, since fraction <0.5:
111110|10000000 (double mantissa)

Proposed cast to float rounds down, too, because of
round-to-even rule for 0.5:
111110 (float mantissa)

But the conversion of base-10 directly to float, rounds UP
since the fraction is then >0.5
111110|10000000_011010111011...
111111

It happens when the 29 bits difference between double and float mantissa are 100…000 and the bit in front and after are 0. The error would occur to ~0.000000047% of numbers.

-- 
Marco