Thread overview
Autodecode?
Aug 16, 2020
JN
Aug 16, 2020
aberba
Aug 16, 2020
Paul Backus
August 16, 2020
Related to this thread: https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq@forum.dlang.org

I don't want to hijack it with my newbie questions. What is autodecode and why is it such a big deal? From what I've seen it's related to handling Unicode characters? And D has the wrong defaults?
August 16, 2020
On Sunday, 16 August 2020 at 20:53:41 UTC, JN wrote:
> Related to this thread: https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq@forum.dlang.org
>
> I don't want to hijack it with my newbie questions. What is autodecode and why is it such a big deal? From what I've seen it's related to handling Unicode characters? And D has the wrong defaults?

https://forum.dlang.org/thread/qitnkf$2736$1@digitalmars.com?page=1
August 16, 2020
On Sunday, 16 August 2020 at 20:53:41 UTC, JN wrote:
> Related to this thread: https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq@forum.dlang.org
>
> I don't want to hijack it with my newbie questions. What is autodecode and why is it such a big deal? From what I've seen it's related to handling Unicode characters? And D has the wrong defaults?

For built-in arrays, the range primitives (empty, front, popFront, etc.) are implemented as free functions in the standard-library module `std.range.primitives`. [1]

For most arrays, these work the way you'd expect: empty checks if the array is empty, front returns `array[0]`, and popFront does `array = array[1..$]`.

But for char[] and wchar[] specifically, `front` and `popFront` work differently. They treat the arrays as UTF-8 or UTF-16 encoded Unicode strings, and return/pop the first *code point* instead of the first *code unit*. In other words, they "automatically decode" the array.

This has a number of annoying consequences. New users get mysterious template errors in the middle of range pipelines complaining about a mismatch between `dchar` (the type of a code point) and `char` (the type of a code unit). Generic code that deals with arrays has to add special cases for char[] and wchar[]. Strings don't work correctly in betterC because Unicode decoding can throw an exception. [2] If you search the forums, you'll find plenty more complaints.

The intent behind autodecoding was to help programmers avoid common Unicode-related errors by doing "the right thing" by default. The problem is that (a) decoding to code points isn't always the right thing, and (b) autodecoding ended up causing a bunch of additional problems of its own.

[1] http://dpldocs.info/experimental-docs/std.range.primitives.html
[2] https://issues.dlang.org/show_bug.cgi?id=20139
August 16, 2020
On 8/16/20 4:53 PM, JN wrote:
> Related to this thread: https://forum.dlang.org/post/xtjzhkvszdiwvrmryubq@forum.dlang.org
> 
> I don't want to hijack it with my newbie questions. What is autodecode and why is it such a big deal? From what I've seen it's related to handling Unicode characters? And D has the wrong defaults?

Aside from what others have said, autodecode isn't really terrible as a default. But what IS terrible is the inconsistency.

Phobos says char[] is not an array, but the language does.

e.g.:

char[] example;
static assert(!hasLength!(typeof(example))); // Phobos: no length here!
auto l = example.length; // dlang: um... yes, there is.
static assert(!hasIndexing!(typeof(example))); // P: no index support!
auto e = example[0]; // D: yes, you can index.

And probably my favorite WTF:

static assert(is(ElementType!(typeof(example)) == dchar)); // P: char is a range of dchar!
foreach(e; example) static assert(is(typeof(e)) == char)); // D: nope, it's an array of char

This leads to all kinds of fun stuff. Like try chaining together several char[] arrays, and then converting the result into an array. Surprise! it's a dchar[], and you just wasted a bunch of CPU cycles decoding them, not to mention the RAM to store it.

But then Phobos, as it's telling you that all these things aren't true, then goes behind your back and implements all kinds of special cases to deal with these narrow strings *using the array interfaces* because it performs better.

We will be much better off to be done with autodecoding. And for many cases, autodecoding is just fine! Most of the time, you only care about the entire string and not what it's made of. Many other languages do "autodecoding", and in fact the string type is opaque. But then it gives you ways to use it that aren't silly (like concatenating 2 strings knows what the underlying types are and figures out the most efficient way possible). If `string` was a custom type and not an array, we probably wouldn't have so many issues with it.

-Steve