The Case Against Autodecode (page 3)

On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote: > 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. This just means that filenames mustn't be represented as strings; it's unrelated to auto decoding.

On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote: > > Based on what I've seen in previous conversations on auto-decoding over the past few years (be it in the newsgroup, on github, or at dconf), most of the core devs think that auto-decoding was a major blunder that we continue to pay for. But unfortunately, even if we all agree that it was a huge mistake and want to fix it, the question remains of how to do that without breaking tons of code - though since AFAIK, Andrei is still in favor of auto-decoding, we'd have a hard time going forward with plans to get rid of it even if we had come up with a good way of doing so. But I would love it if we could get rid of auto-decoding and clean up string handling in D. > > - Jonathan M Davis Why not just try it in a separate test release? Only then can we know to what extent it actually breaks code, and what remedies we could come up with.

On Thursday, 12 May 2016 at 23:16:23 UTC, H. S. Teoh wrote: > Therefore, autodecoding actually only produces intuitively correct results when your string has a 1-to-1 correspondence between grapheme and code point. In general, this is only true for a small subset of languages, mainly a few common European languages and a handful of others. It doesn't work for Korean, and doesn't work for any language that uses combining diacritics or other modifiers. You need byGrapheme to have the correct results. In fact, even most European languages are affected if NFD normalization is used, which is the default on MacOS X. And this is actually the main problem with it: It was introduced to make unicode string handling correct. Well, it doesn't, therefore it has no justification.

May 13, 2016

Re: The Case Against Autodecode

Posted by Marc Schütz
in reply to Jonathan M Davis

Permalink

Marc Schütz

Posted in reply to Jonathan M Davis

Permalink

On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
> Ideally, algorithms would be Unicode aware as appropriate, but the default would be to operate on code units with wrappers to handle decoding by code point or grapheme. Then it's easy to write fast code while still allowing for full correctness. Granted, it's not necessarily easy to get correct code that way, but anyone who wants fully correctness without caring about efficiency can just use ranges of graphemes. Ranges of code points are rare regardless.

char[], wchar[] etc. can simply be made non-ranges, so that the user has to choose between .byCodePoint, .byCodeUnit (or .representation as it already exists), .byGrapheme, or even higher-level units like .byLine or .byWord. Ranges of char, wchar however stay as they are today. That way it's harder to accidentally get it wrong.

>
> Based on what I've seen in previous conversations on auto-decoding over the past few years (be it in the newsgroup, on github, or at dconf), most of the core devs think that auto-decoding was a major blunder that we continue to pay for. But unfortunately, even if we all agree that it was a huge mistake and want to fix it, the question remains of how to do that without breaking tons of code - though since AFAIK, Andrei is still in favor of auto-decoding, we'd have a hard time going forward with plans to get rid of it even if we had come up with a good way of doing so. But I would love it if we could get rid of auto-decoding and clean up string handling in D.

There is a simple deprecation path that's already been suggested. `isInputRange` and friends can output a helpful deprecation warning when they're called with a range that currently triggers auto-decoding.

On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote: > If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button. char[] is always going to be unsafe for UTF-8. I don't think we can remove it or auto-decoding, only discourage use of it. We need a String struct IMO, without length or indexing. Its front can do autodecoding, and it has a ubyte[] raw() property too. (Possibly the byte length of front can be cached for use in popFront, assuming it was faster). This would be a gradual transition.

On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote: > IIRC, Andrei talked in TDPL about how Java's choice to go with UTF-16 was worse than the choice to go with UTF-8, because it was correct in many more cases UTF-16 was a migration from UCS-2, and UCS-2 was superior at the time.

On 5/13/2016 2:12 AM, Chris wrote: > If autodecode is killed, could we have a test version asap? I'd be willing to > test my programs with autodecode turned off and see what happens. Others should > do likewise and we could come up with a transition strategy based on what happened. You can avoid autodecode by using .byChar

On 5/12/2016 11:50 PM, Bill Hicks wrote: > And I get called a troll and > other names when I list half a dozen things wrong with D, my posts get > removed/censored, etc, all because I try to inform people not to waste time with > D because it's a broken and failed language. Posts that engage in personal attacks and bring up personal issues about other forum members get removed. You're welcome to post here in a reasonably professional manner.

On 5/13/2016 3:43 AM, Marc Schütz wrote: > On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote: >> 7. Autodecode cannot be used with unicode path/filenames, because it is legal >> (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the >> wild that pure Unicode is not universal - there's lots of dirty Unicode that >> should remain unmolested, and autocode does not play with that. > > This just means that filenames mustn't be represented as strings; it's unrelated > to auto decoding. It means much more than that, filenames are just an example. I recently fixed MicroEmacs (my text editor) to assume the source is UTF-8, and display Unicode characters. But it still needs to work with dirty UTF-8 without throwing exceptions, modifying the text in-place, or other tantrums.

On Friday, 13 May 2016 at 13:17:44 UTC, Walter Bright wrote: > On 5/13/2016 2:12 AM, Chris wrote: >> If autodecode is killed, could we have a test version asap? I'd be willing to >> test my programs with autodecode turned off and see what happens. Others should >> do likewise and we could come up with a transition strategy based on what happened. > > You can avoid autodecode by using .byChar Hm. It would be difficult to make sure that my whole code base doesn't do something, somewhere that doesn't trigger auto decode. PS Why does do I get a "StopForumSpam error" every time I post today? Has anyone else experienced the same problem: "StopForumSpam error: Socket error: Lookup error: getaddrinfo error: Name or service not known. Please solve a CAPTCHA to continue."

Forums