April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #11 from Walter Bright <bugzilla@digitalmars.com> ---
https://github.com/D-Programming-Language/druntime/pull/1240

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #12 from Walter Bright <bugzilla@digitalmars.com> ---
(In reply to bearophile_hugs from comment #8)
> Another solution is to deprecate foreach iteration on strings, and require something like "foreach(c; mystring.byCharThrowing)" and similar things.

That's not a solution as I bet it breaks 50% of the programs out there.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #13 from Walter Bright <bugzilla@digitalmars.com> ---
Vladimir, you bring up good points. I'll try to address them. First off, why do this?

1. much faster

2. string processing can be @nogc and nothrow. If you follow external discussions on the merits of D, the "D is no good because Phobos requires the GC" ALWAYS comes up, and sucks all the energy out of the conversation.

So, on to your points:

1. Replacement only happens when doing a UTF decoding. S+R doesn't have to do conversion, and that's one of the things I want to fix in std.algorithm. The string fixes I've done in std.string avoid decoding as much as possible.

2. Same thing. (Running normalization on passwords? What the hell?)

The replacement char thing was not invented by me, it is commonplace as users don't like their documents being wholly rejected for one or two bad encodings.

I know that many programs try to guess the encoding of random text they get. Doing this by only reading a few characters, and assuming the rest, is a strange method if one cares about the integrity of the data.

Having to constantly re-sanitize data, at every step in the pipeline, is going to make D programs uncompetitive speed-wise.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #14 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Walter Bright from comment #13)
> Vladimir, you bring up good points. I'll try to address them. First off, why do this?
> 
> 1. much faster

If I understand correctly, throwing Error instead of Exception will also solve the performance issues

> 2. string processing can be @nogc and nothrow. If you follow external discussions on the merits of D, the "D is no good because Phobos requires the GC" ALWAYS comes up, and sucks all the energy out of the conversation.

Ditto, but the @nogc aspect can also be solved with the refcounted exceptions spec, which will fix the problem in general.

> So, on to your points:
> 
> 1. Replacement only happens when doing a UTF decoding. S+R doesn't have to do conversion, and that's one of the things I want to fix in std.algorithm. The string fixes I've done in std.string avoid decoding as much as possible.

Inevitably it is still very easy to to accidentally use something that auto-decodes. There is no way to statically make sure that you don't (except for using a non-string type for text, which is impractical), and with this proposed change, there will be no run-time way to handle this either.

> 2. Same thing. (Running normalization on passwords? What the hell?)

I did not mean Unicode normalization - it was a joke (std.algorithm will "normalize" invalid UTF characters to the replacement character). But since .front on strings autodecodes, feeding a string to any generic range function in std.algorithm will cause auto-decoding (and thus, character substitution).

> The replacement char thing was not invented by me, it is commonplace as users don't like their documents being wholly rejected for one or two bad encodings.

I know, I agree it's useful, but it needs to be opt-in.

> I know that many programs try to guess the encoding of random text they get. Doing this by only reading a few characters, and assuming the rest, is a strange method if one cares about the integrity of the data.

I don't see how this is relevant, sorry.

> Having to constantly re-sanitize data, at every step in the pipeline, is going to make D programs uncompetitive speed-wise.

I don't understand what you mean by this. You could say that any way to handle invalid UTF can be seen as a way of sanitizing data: there will always be a code path for what to do when invalid UTF is encountered. I would interpret "no sanitization" as not handling invalid UTF in any way (i.e. treating it in an undefined way).

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #15 from Walter Bright <bugzilla@digitalmars.com> ---
(In reply to Vladimir Panteleev from comment #14)
> If I understand correctly, throwing Error instead of Exception will also solve the performance issues

It still allocates memory. But it's worth thinking about. Maybe assert()?


> Ditto, but the @nogc aspect can also be solved with the refcounted exceptions spec, which will fix the problem in general.

We'll see. That's still a ways off.


> > 2. Same thing. (Running normalization on passwords? What the hell?)
> 
> I did not mean Unicode normalization - it was a joke (std.algorithm will "normalize" invalid UTF characters to the replacement character). But since .front on strings autodecodes, feeding a string to any generic range function in std.algorithm will cause auto-decoding (and thus, character substitution).

That can be fixed as I suggested.


> > The replacement char thing was not invented by me, it is commonplace as users don't like their documents being wholly rejected for one or two bad encodings.
> I know, I agree it's useful, but it needs to be opt-in.

Global opt-in for foreach is not feasible. However, one can add an algorithm "validate" which throws on invalid UTF, and put that at the start of a pipeline, as in:

    text.validate.A.B.C.D;


> > I know that many programs try to guess the encoding of random text they get. Doing this by only reading a few characters, and assuming the rest, is a strange method if one cares about the integrity of the data.
> 
> I don't see how this is relevant, sorry.

You brought up guessing the encoding of XML text by reading the start of it: "what if it was some 8-bit encoding that only LOOKED like valid UTF-8?"

> > Having to constantly re-sanitize data, at every step in the pipeline, is going to make D programs uncompetitive speed-wise.
> 
> I don't understand what you mean by this. You could say that any way to handle invalid UTF can be seen as a way of sanitizing data: there will always be a code path for what to do when invalid UTF is encountered. I would interpret "no sanitization" as not handling invalid UTF in any way (i.e. treating it in an undefined way).

If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D never are executed. But if A does not throw, then B.C.D guaranteed to be getting valid UTF, but they still pay the penalty of the compiler thinking they can allocate memory and throw.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #16 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Walter Bright from comment #15)
> It still allocates memory. But it's worth thinking about. Maybe assert()?

Sure.

> > I did not mean Unicode normalization - it was a joke (std.algorithm will "normalize" invalid UTF characters to the replacement character). But since .front on strings autodecodes, feeding a string to any generic range function in std.algorithm will cause auto-decoding (and thus, character substitution).
> 
> That can be fixed as I suggested.

Sorry, I'm not following. Which suggestion here will fix what in what way?

> Global opt-in for foreach is not feasible.

I agree - some libraries will expect one thing, and others another.

> However, one can add an algorithm
> "validate" which throws on invalid UTF, and put that at the start of a
> pipeline, as in:
> 
>     text.validate.A.B.C.D;

This is part of a solution. There also needs to be a way to ensure that validate was called, which is the hard part.

> You brought up guessing the encoding of XML text by reading the start of it: "what if it was some 8-bit encoding that only LOOKED like valid UTF-8?"

No, that's not what I meant.

UTF-8 and old 8-bit encodings (ISO 8859-*, Windows-125*) both use the high bit in the byte to indicate Unicode. Consider a program that expects an UTF-8 document, but is actually fed one in an 8-bit encoding: it is possible (although unlikely) that text that is actually in an 8-bit encoding may be successfully interpreted as a valid UTF-8 stream. Thus, invalid UTF-8 can indicate a problem with the entire document, and not just the immediate sequence of bytes.

> If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D never are executed. But if A does not throw, then B.C.D guaranteed to be getting valid UTF, but they still pay the penalty of the compiler thinking they can allocate memory and throw.

OK, so you're saying that we can somehow automatically remove the cost of handling invalid UTF-8 if we know that the UTF-8 we're getting is valid? I don't see how this would work in practice, or how it would provide a noticeable benefit in practice. Since the cost of removing a code path is negligible, I assume you're talking about exception frames, but I still don't see how this applies. Could you elaborate, or is this improvement a theory for now?

Besides, won't A's output be a range of dchar, so B, C and D will not autodecode with or without this change?

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #17 from Vladimir Panteleev <thecybershadow@gmail.com> ---
Let's see if I understand the situation correctly... let's say we have a chain:

str.a.b.c

So, str is a UTF-8 string, and a, b and c are range algorithms (they use .front/.popFront and provide .front/.popFront themselves).

If a/b/c don't throw anything themselves, the nothrow attribute will be inferred from the .front/.popFront of the range in front of them (the range they consume), right?

That means that if str.front can throw, c can't be nothrow. But if str.front is nothrow, then c CAN be nothrow.

But what if we do this:

str.forceDecode.a.b.c

forceDecode doesn't use str.front - it reads the str directly, code unit by code unit, and inserts replacement characters where it sees error. This allows a, b and c to be nothrow.

Unless I'm wrong, I think this idea could work for opt-in replacement character substitution. Following the 90/10 law, it should be easy to insert "forceDecode" in the few relevant places as indicated by a profiler.

Does this proposal make sense?

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

Marc Schütz <schuetzm@gmx.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |schuetzm@gmx.net

--- Comment #18 from Marc Schütz <schuetzm@gmx.net> ---
(In reply to Walter Bright from comment #15)
> If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D never are executed. But if A does not throw, then B.C.D guaranteed to be getting valid UTF, but they still pay the penalty of the compiler thinking they can allocate memory and throw.

When `assert()` is used, whatever cost there is will of course disappear with
`-release`.

And IMO asserting is the right thing to do. Quoting the spec [1]:

"char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format. dchar[] strings are in UTF-32 format."

Note how it says "are in UTF-x format", not "should be". Therefore, a `string` not containing UTF8 is by definition a bug.

Data with other (or unknown) encodings needs to be stored in `ubyte[]`.

[1] http://dlang.org/arrays.html#strings

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #19 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Vladimir Panteleev from comment #16)
> (In reply to Walter Bright from comment #15)
> > It still allocates memory. But it's worth thinking about. Maybe assert()?
> 
> Sure.

Wait, now I'm not sure. For some reason I was thinking of assert(false) which will always stop executions. But continuing upon encountering invalid UTF-8 in release mode might result in bad outcomes as well.

The problem is that it's impossible to achieve 100% coverage and make sure that all Unicode-handling code in your program also handles invalid UTF-8 in a good way. Thus, an invalid UTF-8 handling problem might not be caught in testing but might cause an unpleasant situation in release mode (depending on what happens next after the assert is NOT thrown).

I don't feel too strongly about this though, I think programs that operate on important data shouldn't run with -release anyway.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #20 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Marc Schütz from comment #18)
> Data with other (or unknown) encodings needs to be stored in `ubyte[]`.

Have you tried using ubyte[] to process ASCII text? It's horrible, you have to cast at every step, and nothing in std.string works even when it should.

--