May 12, 2016 The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote: > I am as unclear about the problems of autodecoding as I am about the necessity > to remove curl. Whenever I ask I hear some arguments that work well emotionally > but are scant on reason and engineering. Maybe it's time to rehash them? I just > did so about curl, no solid argument seemed to come together. I'd be curious of > a crisp list of grievances about autodecoding. -- Andrei Here are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case. |
May 12, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> > I am as unclear about the problems of autodecoding as I am
> about the necessity
> > to remove curl. Whenever I ask I hear some arguments that
> work well emotionally
> > but are scant on reason and engineering. Maybe it's time to
> rehash them? I just
> > did so about curl, no solid argument seemed to come together.
> I'd be curious of
> > a crisp list of grievances about autodecoding. -- Andrei
>
> Here are some that are not matters of opinion.
>
> 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency.
>
> 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.
>
> 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case.
>
> 4. Autodecoding is slow and has no place in high speed string processing.
>
> 5. Very few algorithms require decoding.
>
> 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.
>
> 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.
>
> 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.
>
> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.
>
> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.
>
> 11. Indexing an array produces different results than autodecoding, another glaring special case.
12. The result of autodecoding, a range of Unicode code points, is rarely actually useful, and code that relies on autodecoding is rarely actually, universally correct. Graphemes are occasionally useful for a subset of scripts, and a subset of that subset has all graphemes mapped to single code points, but this only applies to some scripts/languages.
In the majority of cases, autodecoding provides only the illusion of correctness.
|
May 12, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote: > On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote: [...] > >1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. > > > >2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. Example of string special-casing leading to bugs: https://issues.dlang.org/show_bug.cgi?id=15972 This particular issue highlight the problem quite well: one would hardly expect '#'.repeat(i) to return anything but a range of char. After all, how could a single char need to be "auto-decoded" to a dchar? Unfortunately, due to Phobos algorithms assuming autodecoding, the resulting range of char is not recognized as "string-like" data by .joiner, thus causing a compile error. The workaround (as described in the bug comments) also illustrates the inconsistency in handling ranges of char vs. ranges of dchar: writing .joiner("\n".byCodeUnit) will actually fix the problem, basically by explicitly disabling autodecoding. We can, of course, fix .joiner to recognize this case and handle it correctly, but the fact the using .byCodeUnit works perfectly proves that autodecoding is not necessary here. Which begs the question, why have autodecoding at all, and then require .byCodeUnit to work around issues it causes? T -- It is widely believed that reinventing the wheel is a waste of time; but I disagree: without wheel reinventers, we would be still be stuck with wooden horse-cart wheels. |
May 12, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote: [...] > 12. The result of autodecoding, a range of Unicode code points, is rarely actually useful, and code that relies on autodecoding is rarely actually, universally correct. Graphemes are occasionally useful for a subset of scripts, and a subset of that subset has all graphemes mapped to single code points, but this only applies to some scripts/languages. > > In the majority of cases, autodecoding provides only the illusion of correctness. A range of Unicode code points is not the same as a range of graphemes (a grapheme is what a layperson would consider to be a "character"). Autodecoding returns dchar, a code point, rather than a grapheme. Therefore, autodecoding actually only produces intuitively correct results when your string has a 1-to-1 correspondence between grapheme and code point. In general, this is only true for a small subset of languages, mainly a few common European languages and a handful of others. It doesn't work for Korean, and doesn't work for any language that uses combining diacritics or other modifiers. You need byGrapheme to have the correct results. So basically autodecoding, as currently implemented, fails to meet its goal of segmenting a string by "character" (i.e., grapheme), and yet imposes a performance penalty that is difficult to "turn off" (you have to sprinkle your code with byCodeUnit everywhere, and many Phobos algorithms just return a range of dchar anyway). Not to mention that a good number of string algorithms don't actually *need* autodecoding at all. (One could make a case for auto-segmenting by grapheme, but that's even worse in terms of performance (it requires a non-trivial Unicode algorithm involving lookup tables, and may need memory allocation). At the end of the day, we're back to square one: iterate by code unit, and explicitly ask for byGrapheme where necessary.) T -- "I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly |
May 12, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> > I am as unclear about the problems of autodecoding as I am
> about the necessity
> > to remove curl. Whenever I ask I hear some arguments that
> work well emotionally
> > but are scant on reason and engineering. Maybe it's time to
> rehash them? I just
> > did so about curl, no solid argument seemed to come together.
> I'd be curious of
> > a crisp list of grievances about autodecoding. -- Andrei
>
> Here are some that are not matters of opinion.
>
> 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency.
>
> 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.
>
> 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case.
>
> 4. Autodecoding is slow and has no place in high speed string processing.
>
> 5. Very few algorithms require decoding.
>
> 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.
>
> 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.
>
> 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.
>
> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.
>
> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.
>
> 11. Indexing an array produces different results than autodecoding, another glaring special case.
For me it is not about autodecoding. I would like to have something like String type which do that. But what I am really piss of is that current string type is alias to immutable(char)[] (so it is not usable at all). This is really problem for me. Because this make working on array of chars almost impossible.
Even char[] is unusable. So I am force to used ubyte[], but this is really not an array of chars.
ATM D does not support even full Unicode strings and even basic array of chars :(.
I hope this will be fixed one day. So I could start to expand D in Czech, until than I am unable to do that.
|
May 12, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Kozak | On 5/12/2016 4:23 PM, Daniel Kozak wrote:
> But what I am really piss of is that current string type is
> alias to immutable(char)[] (so it is not usable at all). This is really problem
> for me. Because this make working on array of chars almost impossible.
>
> Even char[] is unusable. So I am force to used ubyte[], but this is really not
> an array of chars.
>
> ATM D does not support even full Unicode strings and even basic array of chars :(.
>
> I hope this will be fixed one day. So I could start to expand D in Czech, until
> than I am unable to do that.
I can't find any actionable information in this.
|
May 13, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Am Thu, 12 May 2016 13:15:45 -0700 schrieb Walter Bright <newshound2@digitalmars.com>: > 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. More precisely they are byte strings with '/' reserved to separate path elements. While on an out-of-the-box Linux nowadays everything is typically presented as UTF-8, there are still die-hards that use code pages, corrupted file systems or incorrectly bound network shares displaying with the wrong charset. It is safer to work with them as a ubyte[] and that also bypasses auto-decoding. I'd like 'string' to mean valid UTF-8 in D as far as the encoding goes. A filename should not be a 'string'. -- Marco |
May 13, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> Here are some that are not matters of opinion.
If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button.
I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse.
D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.
|
May 12, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco Leise | On 5/12/2016 4:52 PM, Marco Leise wrote:
> I'd like 'string' to mean valid UTF-8 in D as far as the
> encoding goes. A filename should not be a 'string'.
I would have agreed with you in the past, but more and more it just doesn't seem practical. UTF-8 is dirty in the real world, and D code will have to deal with it.
By dealing with it I mean not crash, throw exceptions, or other tantrums when encountering it. Unless it matters, it should pass the invalid encodings along unmolested and without comment. For example, if you're searching for 'a' in a UTF-8 string, what does it matter if there are invalid encodings in that string?
For filenames/paths in particular, having redone the file/path code in Phobos, I realized that invalid encodings are completely immaterial.
|
May 13, 2016 Re: The Case Against Autodecode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jack Stouffer | On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote: > I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse. To hammer this home a little more, Python 3 had a really useful library in order to abstract most of the differences automatically. But despite that, here is a list of the top 200 Python packages in 2011, three years after the fork, and if they supported Python 3 or not: https://web.archive.org/web/20110215214547/http://python3wos.appspot.com/ This is _three years_ later, and only 18 out of the top 200 supported Python 3. And here it is now, eight years later, at 174 out of 200 https://python3wos.appspot.com/ |
Copyright © 1999-2021 by the D Language Foundation