November 20, 2013
On Wednesday, November 20, 2013 08:51:16 Joseph Rushton Wakeling wrote:
> On 20/11/13 01:01, Andrei Alexandrescu wrote:
> > There's been recent discussion herein about what parameter validation method would be best for Phobos to adhere to.
> > 
> > Currently we are using a mix of approaches:
> > 
> > 1. Some functions enforce()
> > 
> > 2. Some functions just assert()
> > 
> > 3. Some (fewer I think) functions assert(0)
> > 
> > 4. Some functions don't do explicit checking, relying instead on lower-level enforcement such as null dereference and bounds checking to ensure safety.
> > 
> > Each method has its place. The question is what guidelines we put forward for Phobos code to follow; we're a bit loose about that right now.
> 
> Regarding enforce() vs. assert(), a good rule that I remember having
> suggested to me was that enforce() should be used for actual runtime
> checking (e.g. checking that the input to a public API function has correct
> properties), assert() should be used to test logical failures (i.e.
> checking that cases which should never arise, really don't arise).
> 
> I've always followed that as a rule of thumb ever since.

When an assertion fails, it's a bug in your code. Assertions should _never_ be used for validating user input. So, if your function is asserting on the state of its input, then it is requiring that the caller give input which follows that contract, and it's a bug in the caller when they violate that contract by passing in bad input.

When your function uses enforce to validate its input, it is _not_ considered a bug when bad input is given. It _could_ be a bug in the caller, but they are not required to give valid input. When they give invalid input, they then get to react to the exception that was thrown and handle the error appropriately. Then this works when the input came from outside the program (e.g. a user or a file) as well as when it doesn't make sense for the caller to have validated the input before calling the function (e.g. because the validator function and the function doing the work end up having to almost the same work, making it cheaper to just have the function validate its input and not have a separate validator function). It also makes it so that the function will _never_ have to operate on invalid input as invalid input will always be checked and rejected, which then makes it much harder to use the function incorrectly.

But ultimately, whether you use assertions or exceptions comes down to whether it's considered to always be a bug in the caller if the input is bad. DbC uses assertions and considers it a bug in the caller (since they violated their part of the contract), whereas defensive programming has the function protect itself and always check and throw on invalid input rather than assuming that the caller is going to provide valid input.

- Jonathan M Davis
November 20, 2013
On 2013-11-20 09:50, Walter Bright wrote:

> Important is deciding upon the notions of "validated data" and
> "untrusted data" is.
>
> 1. Validated data should get asserts if it is found to be invalid.
>
> 2. Untrusted data should get exceptions thrown if it is found to be
> invalid (or return errors).
>
> For example, consider a utf string. If it has passed a validation check,
> then it becomes trusted data. Further processing on it should assert if
> it turns out to be invalid (because then you've got a programming bug).
>
> File open failures should always throw, and never assert, because the
> file is not part of the program and so is inherently not trusted.
>
> One way to distinguish validated from untrusted data is by using
> different types (or a naming convention, see Joel Spolsky's
> http://www.joelonsoftware.com/articles/Wrong.html).
>
> It is of major importance in a program to think about what APIs get
> validated arguments and what APIs get untrusted arguments.

How should we accomplish this? We can't replace:

void main (string[] args)

With

void main (UnsafeString[] args)

And break every application out there.

-- 
/Jacob Carlborg
November 20, 2013
On Wednesday, November 20, 2013 11:49:32 Jacob Carlborg wrote:
> On 2013-11-20 09:50, Walter Bright wrote:
> > Important is deciding upon the notions of "validated data" and "untrusted data" is.
> > 
> > 1. Validated data should get asserts if it is found to be invalid.
> > 
> > 2. Untrusted data should get exceptions thrown if it is found to be
> > invalid (or return errors).
> > 
> > For example, consider a utf string. If it has passed a validation check, then it becomes trusted data. Further processing on it should assert if it turns out to be invalid (because then you've got a programming bug).
> > 
> > File open failures should always throw, and never assert, because the file is not part of the program and so is inherently not trusted.
> > 
> > One way to distinguish validated from untrusted data is by using different types (or a naming convention, see Joel Spolsky's http://www.joelonsoftware.com/articles/Wrong.html).
> > 
> > It is of major importance in a program to think about what APIs get validated arguments and what APIs get untrusted arguments.
> 
> How should we accomplish this? We can't replace:
> 
> void main (string[] args)
> 
> With
> 
> void main (UnsafeString[] args)
> 
> And break every application out there.

You'd do it the other way around by having something like

ValidatedString!char s = validateString("hello world");

ValidatedString would then avoid any extra validation when iterating over the characters, though I don't know how much of an efficiency gain that would actually be given that much of the validation occurs naturally when decoding or using stride. It would have the downside that any function which specializes on strings would likely have to then specialize on ValidatedString as well. So, while I agree with the idea in concept, I'd propose that we benchmark the difference in decoding and striding without the checks and see if there actually is much difference. Because if there isn't, then I don't think that it's worth going to the trouble of adding something like ValidatedString.

- Jonathan M Davis
November 20, 2013
On Wednesday, November 20, 2013 11:45:57 Lars T. Kyllingstad wrote:
> On Wednesday, 20 November 2013 at 00:01:00 UTC, Andrei
> 
> Alexandrescu wrote:
> > (c) A variety of text functions currently suffer because we don't make the difference between validated UTF strings and potentially invalid ones.
> 
> I think it is fair to always assume that a char[] is a valid UTF-8 string, and instead perform the validation when creating/filling the string from a non-validated source.

That doesn't work when strings are being created via concatenation and the like inside the program rather than simply coming from outside the program.

> Take std.file.read() as an example; it returns void[], but has a
> validating counterpart in std.file.readText().
> 
> I think we should use ubyte[] to a greater extent for data which is potentially *not* valid UTF.

Well, we've already discussed the possibility of using ubyte[] to indicate ASCII strings, and that makes a lot more sense IMHO, because then no decoding occurs (which is precisely what you want for ASCII), whereas with a string that's potentially invalid UTF, it's not that we don't want to decode it. It's just that we need to validate it when decoding it.

So, I'd argue that ubyte[] should be used when you want to operate on code units rather than code points rather than it having anything to do with validating code points.

- Jonathan M Davis
November 20, 2013
On 2013-11-20 12:16, Jonathan M Davis wrote:

> You'd do it the other way around by having something like
>
> ValidatedString!char s = validateString("hello world");

Right.

> ValidatedString would then avoid any extra validation when iterating over the
> characters, though I don't know how much of an efficiency gain that would
> actually be given that much of the validation occurs naturally when decoding
> or using stride. It would have the downside that any function which
> specializes on strings would likely have to then specialize on ValidatedString
> as well. So, while I agree with the idea in concept, I'd propose that we
> benchmark the difference in decoding and striding without the checks and see if
> there actually is much difference. Because if there isn't, then I don't think
> that it's worth going to the trouble of adding something like ValidatedString.

If not just if the string is valid UTF-8. There can be many other types of valid strings. Or rather other functions that have additional requirements. Like sanitized filenames, HTML/SQL escaped strings and so on.

-- 
/Jacob Carlborg
November 20, 2013
On 2013-11-20 11:38, Jonathan M Davis wrote:

> Unfortunately, I don't think that it scales at all to take the approach that
> Walter has suggested of having the API normally assert on input and provide
> helper functions which the caller can use to validate input when they deem
> appropriate. That has the advantage of giving the caller control over what is
> and isn't checked and avoiding unnecessary checks, but it also makes it much
> easier to misuse the API, and I would expect the average programmer to skip
> the checks in most cases. It very quickly becomes like using error codes
> instead of exceptions, except that in this case, instead of an error code
> being ignored, the data's validity wouldn't have even been checked in the first
> place, resulting in the function being called doing who-knows-what. And the
> resulting bugs could be very obvious, or they could be insidiously hard to
> detect.

I think Walter suggestion requires the use of asserts:

bool isValid (Data data);

void process (Data data)
{
    assert(isValid(data));
    // process
}

The asserts should be on by default and remove in release builds. This would require DMD shipping two versions of Phobos, one with asserts enabled and one where they're disabled. Then only when the -release flag is used the the version of Phobos with disabled asserts will be used.

> Still, the most important point that I'd like to make is that I think we
> should lean towards validating input with enforce by default and then provide
> alternative means to avoid that validation rather than using assertions and
> DbC by default, because leaving the validation up to the caller in release and
> asserting in debug is going to lead to _far_ more bugs in code using Phobos,
> particularly when the result isn't immediately and obviously wrong when bad
> input is given. And the fact that by default, the assertions in Phobos won't
> be hit in calling code unless the Phobos function is templatized (because
> Phobos will have been compiled in release) makes using assertions that much
> worse.

DMD need to ship with two versions of Phobos, one with assertions on and one with them disabled.

-- 
/Jacob Carlborg
November 20, 2013
Am Wed, 20 Nov 2013 08:49:28 +0100
schrieb Jacob Carlborg <doob@me.com>:

> What about distributing a version of druntime and Phobos with asserts enabled that is used by default (or with the -debug flag). Then a version with asserts disabled is used when the -release flag is used.
> 
> We probably also want it to be possible to use Phobos with asserts enabled even in release mode.

That is what LDC does and with the -defaultlib switch it is easy to use the debug Phobos in release builds. Currently this flag is mostly used to link against the shared phobos2.so.

-- 
Marco

November 20, 2013
Am Wed, 20 Nov 2013 12:49:20 +0100
schrieb Jacob Carlborg <doob@me.com>:

> On 2013-11-20 12:16, Jonathan M Davis wrote:
> 
> > You'd do it the other way around by having something like
> >
> > ValidatedString!char s = validateString("hello world");
> 
> Right.
> 
> > ValidatedString would then avoid any extra validation when iterating over the characters, though I don't know how much of an efficiency gain that would actually be given that much of the validation occurs naturally when decoding or using stride. It would have the downside that any function which specializes on strings would likely have to then specialize on ValidatedString as well. So, while I agree with the idea in concept, I'd propose that we benchmark the difference in decoding and striding without the checks and see if there actually is much difference. Because if there isn't, then I don't think that it's worth going to the trouble of adding something like ValidatedString.
> 
> If not just if the string is valid UTF-8. There can be many other types of valid strings. Or rather other functions that have additional requirements. Like sanitized filenames, HTML/SQL escaped strings and so on.

None of that is feasible. We can only hope that we simply catch every case of user input (or untrusted data) and check it before passing it to Phobos APIs. That's why there are functions to validate and also to sanitize UTF strings on a best effort basis in Phobos.

So in my opinion Phobos should continue forward with assert instead of enforce. I/O functions, of course, have to use exceptions.

That said, I never thought of validating args[] before passing it to getopt or using them as a filename. Lesson learned, I guess?

-- 
Marco

November 20, 2013
On 11/20/2013 08:49 AM, Jacob Carlborg wrote:
> Would we accompany the assumeSorted with an assert in the function
> assuming something is sorted? We probably don't want to rely on convention.

We do in any case:

import std.algorithm, std.range;

void main(){
    auto a = [1,2,3,4,5];
    auto s = sort(a);
    swap(a[0],a[$-1]);
    assert(is(typeof(s)==SortedRange!(int[])) && !s.isSorted());
}

November 20, 2013
On 11/20/2013 12:57 PM, Jacob Carlborg wrote:
>
> bool isValid (Data data);
>
> void process (Data data)
> {
>      assert(isValid(data));
>      // process
> }


void process(Data data)in{ assert(isValid(data)); }body{
    // process
}