Jump to page: 1 246  
Page
Thread overview
The Case Against Autodecode
May 12, 2016
Walter Bright
May 12, 2016
Vladimir Panteleev
May 12, 2016
H. S. Teoh
May 12, 2016
H. S. Teoh
May 13, 2016
Marc Schütz
May 13, 2016
Marco Leise
May 13, 2016
H. S. Teoh
May 12, 2016
Daniel Kozak
May 12, 2016
Walter Bright
May 12, 2016
Marco Leise
May 13, 2016
Walter Bright
May 13, 2016
Jack Stouffer
May 13, 2016
Jack Stouffer
May 13, 2016
Walter Bright
May 13, 2016
Chris
May 13, 2016
Walter Bright
May 13, 2016
Chris
May 13, 2016
Vladimir Panteleev
May 13, 2016
Chris
May 13, 2016
Iakh
May 13, 2016
Nick Treleaven
May 13, 2016
H. S. Teoh
May 29, 2016
Nick Sabalausky
May 30, 2016
Jack Stouffer
May 30, 2016
Nick Sabalausky
May 30, 2016
Jack Stouffer
How to improve autodecoding? (Was: The Case Against Autodecode)
May 30, 2016
Dmitry Olshansky
May 30, 2016
Jack Stouffer
May 31, 2016
Jonathan M Davis
May 30, 2016
Vladimir Panteleev
May 30, 2016
Seb
May 31, 2016
Jack Stouffer
May 30, 2016
Chris
May 30, 2016
Marco Leise
May 30, 2016
Chris
May 30, 2016
Marco Leise
May 31, 2016
Joakim
May 31, 2016
Jonathan M Davis
May 31, 2016
Joakim
May 31, 2016
Marco Leise
May 31, 2016
Joakim
May 31, 2016
Timon Gehr
May 31, 2016
Walter Bright
May 31, 2016
ag0aep6g
Jun 01, 2016
Walter Bright
[OT] UTF-16
May 31, 2016
Marco Leise
May 31, 2016
ag0aep6g
May 31, 2016
Joakim
Jun 01, 2016
Marc Schütz
Jun 01, 2016
Joakim
[OT] Effect of UTF-8 on 2G connections
Jun 01, 2016
Marco Leise
Jun 01, 2016
Joakim
Jun 01, 2016
Wyatt
Jun 02, 2016
Joakim
[OT] The Case Against... Unicode?
Jun 01, 2016
Wyatt
Jun 01, 2016
Patrick Schluter
Jun 01, 2016
deadalnix
Jun 01, 2016
Nick Sabalausky
Jun 01, 2016
Kagamin
Jun 01, 2016
Kagamin
May 30, 2016
Adam D. Ruppe
May 13, 2016
Jack Stouffer
May 13, 2016
Bill Hicks
May 13, 2016
Ethan Watson
May 13, 2016
poliklosio
May 13, 2016
Chris
May 13, 2016
Kagamin
May 13, 2016
Walter Bright
May 13, 2016
Jonathan M Davis
May 13, 2016
Chris
May 13, 2016
Marc Schütz
May 13, 2016
Jonathan M Davis
May 13, 2016
Kagamin
May 13, 2016
Jonathan M Davis
May 17, 2016
Kagamin
May 17, 2016
sarn
May 13, 2016
Marc Schütz
May 13, 2016
Walter Bright
May 13, 2016
Alex Parrill
May 15, 2016
Jon D
May 16, 2016
Jack Stouffer
May 16, 2016
H. S. Teoh
May 16, 2016
jmh530
May 26, 2016
Jack Stouffer
May 26, 2016
H. S. Teoh
May 30, 2016
Marco Leise
May 30, 2016
Andrew Godfrey
May 30, 2016
Adam D. Ruppe
May 30, 2016
Andrew Godfrey
May 30, 2016
Marco Leise
May 27, 2016
Vladimir Panteleev
May 31, 2016
Jonathan M Davis
May 27, 2016
Kagamin
May 27, 2016
Marc Schütz
May 28, 2016
Marc Schütz
May 28, 2016
Walter Bright
May 28, 2016
Andrew Godfrey
May 29, 2016
Chris
May 29, 2016
Tobias Müller
May 29, 2016
default0
May 29, 2016
Tobias M
May 29, 2016
Chris
May 29, 2016
Chris
May 29, 2016
Tobias M
May 30, 2016
H. S. Teoh
May 30, 2016
Walter Bright
May 29, 2016
Walter Bright
May 30, 2016
Timon Gehr
May 30, 2016
H. S. Teoh
May 30, 2016
Walter Bright
May 31, 2016
Chris
Jun 01, 2016
Walter Bright
May 30, 2016
Timon Gehr
May 31, 2016
Nick Sabalausky
May 28, 2016
Jack Stouffer
May 29, 2016
Dicebot
May 30, 2016
Marc Schütz
May 30, 2016
Adam D. Ruppe
May 30, 2016
Seb
May 30, 2016
ag0aep6g
May 30, 2016
Marc Schütz
May 30, 2016
Walter Bright
May 31, 2016
Walter Bright
May 31, 2016
deadalnix
May 31, 2016
deadalnix
May 31, 2016
Jonathan M Davis
May 31, 2016
Jonathan M Davis
May 31, 2016
Marc Schütz
May 31, 2016
Seb
May 31, 2016
ag0aep6g
May 31, 2016
Kagamin
May 30, 2016
Adam D. Ruppe
May 30, 2016
Walter Bright
May 31, 2016
H. S. Teoh
May 31, 2016
default0
May 31, 2016
Marco Leise
May 31, 2016
Jonathan M Davis
May 31, 2016
Marco Leise
May 31, 2016
Dmitry Olshansky
May 31, 2016
Jonathan M Davis
May 31, 2016
H. S. Teoh
May 27, 2016
Chris
May 27, 2016
ag0aep6g
May 27, 2016
Chris
May 27, 2016
Adam D. Ruppe
May 27, 2016
Dmitry Olshansky
May 27, 2016
Minas Mina
May 27, 2016
tsbockman
May 28, 2016
Dmitry Olshansky
May 27, 2016
Minas Mina
May 27, 2016
David Nadlinger
May 31, 2016
Jonathan M Davis
May 28, 2016
Chris
May 27, 2016
H. S. Teoh
May 27, 2016
ag0aep6g
May 27, 2016
ag0aep6g
May 27, 2016
H. S. Teoh
May 28, 2016
H. S. Teoh
May 31, 2016
Jonathan M Davis
May 29, 2016
Tobias M
May 29, 2016
H. S. Teoh
May 31, 2016
Jonathan M Davis
May 27, 2016
Adam D. Ruppe
May 27, 2016
H. S. Teoh
May 27, 2016
H. S. Teoh
May 30, 2016
Marco Leise
May 30, 2016
Marco Leise
May 31, 2016
Jonathan M Davis
May 31, 2016
Jonathan M Davis
May 31, 2016
Timon Gehr
Jun 01, 2016
ZombineDev
Jun 01, 2016
Adam D. Ruppe
Jun 01, 2016
ZombineDev
Jun 01, 2016
ZombineDev
Jun 01, 2016
Jack Stouffer
Jun 02, 2016
Timon Gehr
Jun 01, 2016
ZombineDev
Jun 02, 2016
Kagamin
Jun 02, 2016
ZombineDev
Jun 02, 2016
ZombineDev
Jun 02, 2016
Timon Gehr
Jun 02, 2016
cym13
Jun 02, 2016
tsbockman
Jun 02, 2016
Marc Schütz
Jun 02, 2016
ag0aep6g
Jun 02, 2016
Timon Gehr
Jun 02, 2016
ag0aep6g
Jun 02, 2016
tsbockman
Jun 02, 2016
Brad Anderson
Jun 02, 2016
ag0aep6g
Jun 02, 2016
default0
Jun 02, 2016
ag0aep6g
Jun 02, 2016
ag0aep6g
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Timon Gehr
Jun 02, 2016
ag0aep6g
Jun 02, 2016
ag0aep6g
Jun 02, 2016
ag0aep6g
Jun 02, 2016
tsbockman
Jun 03, 2016
H. S. Teoh
Jun 02, 2016
deadalnix
Jun 02, 2016
Timon Gehr
Jun 03, 2016
H. S. Teoh
Jun 02, 2016
cym13
Jun 02, 2016
Timon Gehr
Jun 02, 2016
cym13
Jun 03, 2016
H. S. Teoh
Jun 02, 2016
deadalnix
Jun 02, 2016
deadalnix
Jun 02, 2016
deadalnix
Jun 03, 2016
Nick Sabalausky
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Walter Bright
Jun 02, 2016
deadalnix
Jun 03, 2016
Walter Bright
Jun 02, 2016
John Colvin
Jun 02, 2016
Jonathan M Davis
Jun 03, 2016
Walter Bright
Jun 03, 2016
H. S. Teoh
Jun 03, 2016
Walter Bright
Jun 03, 2016
Vladimir Panteleev
Jun 03, 2016
Jonathan M Davis
Jun 03, 2016
Chris
Jun 05, 2016
deadalnix
Jun 03, 2016
Walter Bright
Jun 03, 2016
Walter Bright
Jun 03, 2016
Vladimir Panteleev
Jun 03, 2016
H. S. Teoh
Jun 03, 2016
Walter Bright
Jun 03, 2016
H. S. Teoh
Jun 03, 2016
Walter Bright
Jun 04, 2016
H. S. Teoh
Jun 04, 2016
Walter Bright
Jun 04, 2016
H. S. Teoh
Jun 04, 2016
Walter Bright
Jun 05, 2016
docandrew
Jun 05, 2016
deadalnix
Jun 05, 2016
Walter Bright
Jun 04, 2016
Patrick Schluter
Jun 04, 2016
Patrick Schluter
Jun 04, 2016
ketmar
Jun 04, 2016
Walter Bright
Jun 04, 2016
ketmar
Jun 05, 2016
deadalnix
Jun 05, 2016
Walter Bright
Jun 03, 2016
Walter Bright
Jun 03, 2016
Timon Gehr
Jun 03, 2016
Walter Bright
Jun 03, 2016
Adam D. Ruppe
Jun 03, 2016
Jonathan M Davis
Jun 03, 2016
Walter Bright
Jun 03, 2016
Adam D. Ruppe
Jun 05, 2016
Jonathan M Davis
Jun 03, 2016
Dmitry Olshansky
Jun 04, 2016
Alix Pexton
Jun 02, 2016
Timon Gehr
Jun 02, 2016
jmh530
Jun 02, 2016
ag0aep6g
Jun 02, 2016
tsbockman
Jun 02, 2016
tsbockman
Jun 03, 2016
tsbockman
Jun 02, 2016
ag0aep6g
Jun 02, 2016
default0
Jun 02, 2016
tsbockman
Jun 02, 2016
default0
Jun 02, 2016
tsbockman
Jun 02, 2016
default0
Jun 02, 2016
tsbockman
Jun 02, 2016
Walter Bright
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Walter Bright
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Walter Bright
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Vladimir Panteleev
Jun 02, 2016
Walter Bright
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Walter Bright
Jun 02, 2016
Jonathan M Davis
Jun 02, 2016
Walter Bright
Jun 02, 2016
Jonathan M Davis
Jun 02, 2016
Marco Leise
Jun 02, 2016
Walter Bright
Jun 03, 2016
Marco Leise
Jun 03, 2016
Jonathan M Davis
Jun 02, 2016
jmh530
Jun 02, 2016
Adam D. Ruppe
Jun 02, 2016
deadalnix
Jun 02, 2016
Kagamin
Jun 02, 2016
Adam D. Ruppe
Jun 02, 2016
Kagamin
Jun 02, 2016
Walter Bright
Jun 02, 2016
Walter Bright
Jun 02, 2016
Adam D. Ruppe
Jun 02, 2016
Walter Bright
Jun 02, 2016
Jack Stouffer
Jun 02, 2016
tsbockman
Jun 02, 2016
Walter Bright
Jun 02, 2016
tsbockman
Jun 02, 2016
Adam D. Ruppe
Jun 02, 2016
Adam D. Ruppe
Jun 02, 2016
H. S. Teoh
Jun 02, 2016
Kagamin
Jun 02, 2016
Kagamin
Jun 02, 2016
ZombineDev
Jun 02, 2016
Jonathan M Davis
Jun 02, 2016
Timon Gehr
Jun 02, 2016
deadalnix
decoding foreach
Jun 02, 2016
Timon Gehr
May 31, 2016
Jonathan M Davis
May 31, 2016
Timon Gehr
May 31, 2016
Wyatt
May 31, 2016
H. S. Teoh
May 31, 2016
Timon Gehr
May 31, 2016
Jonathan M Davis
May 31, 2016
Jonathan M Davis
May 31, 2016
Marco Leise
Jun 01, 2016
Jonathan M Davis
Jun 01, 2016
Jack Stouffer
Jun 01, 2016
Marc Schütz
May 31, 2016
Jonathan M Davis
May 31, 2016
H. S. Teoh
May 31, 2016
Max Samukha
Jun 01, 2016
Marc Schütz
Jun 01, 2016
Nick Sabalausky
Jun 02, 2016
Marc Schütz
May 31, 2016
H. S. Teoh
May 31, 2016
Timon Gehr
May 31, 2016
H. S. Teoh
Jun 01, 2016
Marc Schütz
Jun 02, 2016
Marc Schütz
Jun 02, 2016
Timon Gehr
Jun 02, 2016
Kagamin
Jun 01, 2016
Nick Sabalausky
Jun 01, 2016
Jonathan M Davis
May 31, 2016
ag0aep6g
Jun 01, 2016
Nick Sabalausky
May 27, 2016
Walter Bright
May 27, 2016
Walter Bright
May 29, 2016
Martin Nowak
May 30, 2016
Marco Leise
May 12, 2016
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> I am as unclear about the problems of autodecoding as I am about the necessity
> to remove curl. Whenever I ask I hear some arguments that work well emotionally
> but are scant on reason and engineering. Maybe it's time to rehash them? I just
> did so about curl, no solid argument seemed to come together. I'd be curious of
> a crisp list of grievances about autodecoding. -- Andrei

Here are some that are not matters of opinion.

1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency.

2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.

3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case.

4. Autodecoding is slow and has no place in high speed string processing.

5. Very few algorithms require decoding.

6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.

7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.

8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.

11. Indexing an array produces different results than autodecoding, another glaring special case.
May 12, 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> > I am as unclear about the problems of autodecoding as I am
> about the necessity
> > to remove curl. Whenever I ask I hear some arguments that
> work well emotionally
> > but are scant on reason and engineering. Maybe it's time to
> rehash them? I just
> > did so about curl, no solid argument seemed to come together.
> I'd be curious of
> > a crisp list of grievances about autodecoding. -- Andrei
>
> Here are some that are not matters of opinion.
>
> 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency.
>
> 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.
>
> 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case.
>
> 4. Autodecoding is slow and has no place in high speed string processing.
>
> 5. Very few algorithms require decoding.
>
> 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.
>
> 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.
>
> 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.
>
> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.
>
> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.
>
> 11. Indexing an array produces different results than autodecoding, another glaring special case.

12. The result of autodecoding, a range of Unicode code points, is rarely actually useful, and code that relies on autodecoding is rarely actually, universally correct. Graphemes are occasionally useful for a subset of scripts, and a subset of that subset has all graphemes mapped to single code points, but this only applies to some scripts/languages.

In the majority of cases, autodecoding provides only the illusion of correctness.

May 12, 2016
On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote:
> On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
[...]
> >1. Ranges of characters do not autodecode, but arrays of characters do.  This is a glaring inconsistency.
> >
> >2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.

Example of string special-casing leading to bugs:

	https://issues.dlang.org/show_bug.cgi?id=15972

This particular issue highlight the problem quite well: one would hardly expect '#'.repeat(i) to return anything but a range of char. After all, how could a single char need to be "auto-decoded" to a dchar? Unfortunately, due to Phobos algorithms assuming autodecoding, the resulting range of char is not recognized as "string-like" data by .joiner, thus causing a compile error.

The workaround (as described in the bug comments) also illustrates the inconsistency in handling ranges of char vs. ranges of dchar: writing .joiner("\n".byCodeUnit) will actually fix the problem, basically by explicitly disabling autodecoding.

We can, of course, fix .joiner to recognize this case and handle it correctly, but the fact the using .byCodeUnit works perfectly proves that autodecoding is not necessary here. Which begs the question, why have autodecoding at all, and then require .byCodeUnit to work around issues it causes?


T

-- 
It is widely believed that reinventing the wheel is a waste of time; but I disagree: without wheel reinventers, we would be still be stuck with wooden horse-cart wheels.
May 12, 2016
On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote: [...]
> 12. The result of autodecoding, a range of Unicode code points, is rarely actually useful, and code that relies on autodecoding is rarely actually, universally correct. Graphemes are occasionally useful for a subset of scripts, and a subset of that subset has all graphemes mapped to single code points, but this only applies to some scripts/languages.
> 
> In the majority of cases, autodecoding provides only the illusion of correctness.

A range of Unicode code points is not the same as a range of graphemes (a grapheme is what a layperson would consider to be a "character"). Autodecoding returns dchar, a code point, rather than a grapheme.

Therefore, autodecoding actually only produces intuitively correct results when your string has a 1-to-1 correspondence between grapheme and code point. In general, this is only true for a small subset of languages, mainly a few common European languages and a handful of others.  It doesn't work for Korean, and doesn't work for any language that uses combining diacritics or other modifiers.  You need byGrapheme to have the correct results.

So basically autodecoding, as currently implemented, fails to meet its goal of segmenting a string by "character" (i.e., grapheme), and yet imposes a performance penalty that is difficult to "turn off" (you have to sprinkle your code with byCodeUnit everywhere, and many Phobos algorithms just return a range of dchar anyway). Not to mention that a good number of string algorithms don't actually *need* autodecoding at all.

(One could make a case for auto-segmenting by grapheme, but that's even worse in terms of performance (it requires a non-trivial Unicode algorithm involving lookup tables, and may need memory allocation). At the end of the day, we're back to square one: iterate by code unit, and explicitly ask for byGrapheme where necessary.)


T

-- 
"I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly
May 12, 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> > I am as unclear about the problems of autodecoding as I am
> about the necessity
> > to remove curl. Whenever I ask I hear some arguments that
> work well emotionally
> > but are scant on reason and engineering. Maybe it's time to
> rehash them? I just
> > did so about curl, no solid argument seemed to come together.
> I'd be curious of
> > a crisp list of grievances about autodecoding. -- Andrei
>
> Here are some that are not matters of opinion.
>
> 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency.
>
> 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays.
>
> 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case.
>
> 4. Autodecoding is slow and has no place in high speed string processing.
>
> 5. Very few algorithms require decoding.
>
> 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.
>
> 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that.
>
> 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.
>
> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.
>
> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place.
>
> 11. Indexing an array produces different results than autodecoding, another glaring special case.

For me it is not about autodecoding. I would like to have something like String type which do that. But what I am really piss of is that current string type is alias to immutable(char)[] (so it is not usable at all). This is really problem for me. Because this make working on array of chars almost impossible.

Even char[] is unusable. So I am force to used ubyte[], but this is really not an array of chars.

ATM D does not support even full Unicode strings and even basic array of chars :(.

I hope this will be fixed one day. So I could start to expand D in Czech, until than I am unable to do that.
May 12, 2016
On 5/12/2016 4:23 PM, Daniel Kozak wrote:
> But what I am really piss of is that current string type is
> alias to immutable(char)[] (so it is not usable at all). This is really problem
> for me. Because this make working on array of chars almost impossible.
>
> Even char[] is unusable. So I am force to used ubyte[], but this is really not
> an array of chars.
>
> ATM D does not support even full Unicode strings and even basic array of chars :(.
>
> I hope this will be fixed one day. So I could start to expand D in Czech, until
> than I am unable to do that.

I can't find any actionable information in this.
May 13, 2016
Am Thu, 12 May 2016 13:15:45 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames.

More precisely they are byte strings with '/' reserved to separate path elements. While on an out-of-the-box Linux nowadays everything is typically presented as UTF-8, there are still die-hards that use code pages, corrupted file systems or incorrectly bound network shares displaying with the wrong charset. It is safer to work with them as a ubyte[] and that also bypasses auto-decoding.

I'd like 'string' to mean valid UTF-8 in D as far as the encoding goes. A filename should not be a 'string'.

-- 
Marco

May 13, 2016
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> Here are some that are not matters of opinion.

If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button.

I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse.

D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem.  You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.
May 12, 2016
On 5/12/2016 4:52 PM, Marco Leise wrote:
> I'd like 'string' to mean valid UTF-8 in D as far as the
> encoding goes. A filename should not be a 'string'.

I would have agreed with you in the past, but more and more it just doesn't seem practical. UTF-8 is dirty in the real world, and D code will have to deal with it.

By dealing with it I mean not crash, throw exceptions, or other tantrums when encountering it. Unless it matters, it should pass the invalid encodings along unmolested and without comment. For example, if you're searching for 'a' in a UTF-8 string, what does it matter if there are invalid encodings in that string?

For filenames/paths in particular, having redone the file/path code in Phobos, I realized that invalid encodings are completely immaterial.

May 13, 2016
On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
> I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse.

To hammer this home a little more, Python 3 had a really useful library in order to abstract most of the differences automatically. But despite that, here is a list of the top 200 Python packages in 2011, three years after the fork, and if they supported Python 3 or not: https://web.archive.org/web/20110215214547/http://python3wos.appspot.com/

This is _three years_ later, and only 18 out of the top 200 supported Python 3.

And here it is now, eight years later, at 174 out of 200 https://python3wos.appspot.com/
« First   ‹ Prev
1 2 3 4 5 6 7 8 9 10 11