September 29, 2014
On 9/28/2014 6:39 PM, Sean Kelly wrote:
> Well... suppose you design a system with redundancy such that an error in a
> specific process isn't enough to bring down the system.  Say it's a quorum
> method or whatever.  In the instance that a process goes crazy, I would argue
> that the system is in an undefined state but a state that it's designed
> specifically to handle, even if that state can't be explicitly defined at design
> time.  Now if enough things go wrong at once the whole system will still fail,
> but it's about building systems with the expectation that errors will occur.
> They may even be logic errors--I think it's kind of irrelevant at that point.
>
> Even a network of communicating processes, one getting in a bad state can
> theoretically poison the entire system and you're often not in a position to
> simply shut down the whole thing and wait for a repairman.  And simply rebooting
> the system if it's a bad sensor that's causing the problem just means a pause
> before another failure cascade.  I think any modern program designed to run
> continuously (increasingly the typical case) must be designed with some degree
> of resiliency or self-healing in mind.  And that means planning for and limiting
> the scope of undefined behavior.

I've said that processes are different, because the scope of the effects is limited by the hardware.

If a system with threads that share memory cannot be restarted, there are serious problems with the design of it, because a crash and the necessary restart are going to happen sooner or later, probably sooner.

I don't believe that the way to get 6 sigma reliability is by ignoring errors and hoping. Airplane software is most certainly not done that way.

I recall Toyota got into trouble with their computer controlled cars because of their idea of handling of inevitable bugs and errors. It was one process that controlled everything. When something unexpected went wrong, it kept right on operating, any unknown and unintended consequences be damned.

The way to get reliable systems is to design to accommodate errors, not pretend they didn't happen, or hope that nothing else got affected, etc. In critical software systems, that means shut down and restart the offending system, or engage the backup.

There's no other way that works.
September 29, 2014
On 9/28/2014 6:17 PM, Sean Kelly wrote:
> On Sunday, 28 September 2014 at 22:00:24 UTC, Walter Bright wrote:
>>
>> I can't get behind the notion of "reasonably certain". I certainly would not
>> use such techniques in any code that needs to be robust, and we should not be
>> using such cowboy techniques in Phobos nor officially advocate their use.
>
> I think it's a fair stance not to advocate this approach.  But as it is I spend
> a good portion of my time diagnosing bugs in production systems based entirely
> on archived log data, and analyzing the potential impact on the system to
> determine the importance of a hot fix.  The industry seems to be moving towards
> lowering the barrier between engineering and production code (look at what
> Netflix has done for example), and some of this comes from an isolation model
> akin to the Erlang approach, but the typical case is still that hot fixing code
> is incredibly expensive and so you don't want to do it if it isn't necessary.
> For me, the correct approach may simply be to eschew assert() in favor of
> enforce() in some cases.  But the direction I want to be headed is the one
> you're encouraging.  I simply don't know if it's practical from a performance
> perspective.  This is still developing territory.

You've clearly got a tough job to do, and I understand you're doing the best you can with it. I know I'm hardcore and uncompromising on this issue, but that's where I came from (the aviation industry).

I know what works (airplanes are incredibly safe) and what doesn't work (Toyota's approach was in the news not too long ago). Deepwater Horizon and Fukushima are also prime examples of not dealing properly with modest failures that cascaded into disaster.
September 29, 2014
On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:
>
> I've said that processes are different, because the scope of the effects is limited by the hardware.
>
> If a system with threads that share memory cannot be restarted, there are serious problems with the design of it, because a crash and the necessary restart are going to happen sooner or later, probably sooner.

Right.  But if the condition that caused the restart persists, the process can end up in a cascading restart scenario.  Simply restarting on error isn't necessarily enough.


> I don't believe that the way to get 6 sigma reliability is by ignoring errors and hoping. Airplane software is most certainly not done that way.

I believe I was arguing the opposite.  More to the point, I think it's necessary to expect undefined behavior to occur and to plan for it.  I think we're on the same page here and just miscommunicating.


> I recall Toyota got into trouble with their computer controlled cars because of their idea of handling of inevitable bugs and errors. It was one process that controlled everything. When something unexpected went wrong, it kept right on operating, any unknown and unintended consequences be damned.
>
> The way to get reliable systems is to design to accommodate errors, not pretend they didn't happen, or hope that nothing else got affected, etc. In critical software systems, that means shut down and restart the offending system, or engage the backup.

My point was that it's often more complicated than that.  There have been papers written on self-repairing systems, for example, and ways to design systems that are inherently durable when it comes to even internal errors.  I think what I'm trying to say is that simply aborting on error is too brittle in some cases, because it only deals with one vector--memory corruption that is unlikely to reoccur.  But I've watched always-on systems fall apart from some unexpected ongoing situation, and simply restarting doesn't actually help.
September 29, 2014
On 09/29/2014 02:47 AM, Walter Bright wrote:
> On 9/28/2014 4:18 PM, Joseph Rushton Wakeling via Digitalmars-d wrote:
>> I don't follow this point.  How can this approach work with programs
>> that are
>> built with the -release switch?
>
> All -release does is not generate code for assert()s. ...

(Euphemism for undefined behaviour.)

September 29, 2014
On 09/29/2014 06:06 AM, Timon Gehr wrote:
> On 09/29/2014 02:47 AM, Walter Bright wrote:
>> On 9/28/2014 4:18 PM, Joseph Rushton Wakeling via Digitalmars-d wrote:
>>> I don't follow this point.  How can this approach work with programs
>>> that are
>>> built with the -release switch?
>>
>> All -release does is not generate code for assert()s. ...
>
> (Euphemism for undefined behaviour.)
>

Also, -release additionally removes contracts, in particular invariant calls, and enables version(assert).
September 29, 2014
On 09/29/2014 12:59 AM, Walter Bright wrote:
> ...
>
>> Unless, of course, you're suggesting that we put this around every
>> main() function:
>>
>>     void main() {
>>         try {
>>             ...
>>         } catch(Exception e) {
>>             assert(0, "Unhandled exception: I screwed up");
>>         }
>>     }
>
> I'm not suggesting that Exceptions are to be thrown on programmer
> screwups - I suggest the OPPOSITE.
>

He does not suggest that Exceptions are to be thrown on programmer screw-ups, but rather that the thrown exception itself is the screw-up, with a possibly complex cause.

It is not:

if(screwedUp()) throw Exception("");


It is rather:

void foo(int x){
    if(!test(x)) throw Exception(""); // this may be an expected code path for some callers
}

void bar(){
    // ...
    int y=screwUp();
    foo(y); // yet it is unexpected here
}
September 29, 2014
On Sunday, 28 September 2014 at 22:59:46 UTC, Walter Bright wrote:
> If anyone is writing code that throws an Exception with "internal error", then they are MISUSING exceptions to throw on logic bugs. I've been arguing this all along.

Nothing wrong with it. Quite common and useful for a non-critical web service to log the exception, then re-throw something like "internal error",  catch the internal error at the root and returning the appropriate 5xx HTTP response, then keep going.

You are arguing as if it is impossible to know whether the logic error is local to the handler, or not, with a reasonable probability. "Division by zero" is usually not a big deal, but it is a logic error. No need to shut down the service.

> I'm not suggesting that Exceptions are to be thrown on programmer screwups - I suggest the OPPOSITE.

It is impossible to verify what the source is. It might be a bug in a boolean expression leading to a throw when the system is ok.

assert()s should also not be left in production code. They are not for catching runtime errors, but for testing at the expense of performance.

Uncaught exceptions should be re-thrown higher up in the call chain to a different error level based on the possible impact on the system. Getting an unexpected mismatch exception in a form-validator is not a big deal. Getting out-of-bounds error in main storage is a big deal. Whether it is a big deal can only be decided at the higher level.

It is no doubt useful to be able to obtain a stack trace so that you can log it when an exception turns out to fall into the "big deal" category and therefore should be re-thrown as a critical failture. The deciding factor should be performance.


September 29, 2014
On 9/28/2014 9:31 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> Nothing wrong with it. Quite common and useful for a non-critical web service to
> log the exception, then re-throw something like "internal error",  catch the
> internal error at the root and returning the appropriate 5xx HTTP response, then
> keep going.

Lots of bad practices are commonplace.


> You are arguing as if it is impossible to know whether the logic error is local
> to the handler, or not, with a reasonable probability.

You're claiming to know that a program in an unknown and unanticipated state is really in a known state. It isn't.


> assert()s should also not be left in production code. They are not for catching
> runtime errors, but for testing at the expense of performance.

Are you really suggesting that asserts should be replaced by thrown exceptions? I suspect we have little common ground here.


> Uncaught exceptions should be re-thrown higher up in the call chain to a
> different error level based on the possible impact on the system. Getting an
> unexpected mismatch exception in a form-validator is not a big deal. Getting
> out-of-bounds error in main storage is a big deal. Whether it is a big deal can
> only be decided at the higher level.

A vast assumption here that you know in advance what bugs you're going to have and what causes them.


> It is no doubt useful to be able to obtain a stack trace so that you can log it
> when an exception turns out to fall into the "big deal" category and therefore
> should be re-thrown as a critical failture. The deciding factor should be
> performance.

You're using exceptions as a program bug reporting mechanism. Whoa camel, indeed!
September 29, 2014
On 9/28/2014 9:03 PM, Sean Kelly wrote:
> On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:
> Right.  But if the condition that caused the restart persists, the process can
> end up in a cascading restart scenario.  Simply restarting on error isn't
> necessarily enough.

When it isn't enough, use the "engage the backup" technique.


>> I don't believe that the way to get 6 sigma reliability is by ignoring errors
>> and hoping. Airplane software is most certainly not done that way.
>
> I believe I was arguing the opposite.  More to the point, I think it's necessary
> to expect undefined behavior to occur and to plan for it.  I think we're on the
> same page here and just miscommunicating.

Assuming that the program bug couldn't have affected other threads is relying on hope. Bugs happen when the program went into an unknown and unanticipated state. You cannot know, until after you debug it, what other damage the fault caused, or what other damage caused the detected fault.


> My point was that it's often more complicated than that.  There have been papers
> written on self-repairing systems, for example, and ways to design systems that
> are inherently durable when it comes to even internal errors.

I confess much skepticism about such things when it comes to software. I do know how reliable avionics software is done, and that stuff does work even in the face of all kinds of bugs, damage, and errors. I'll be betting my life on that tomorrow :-)

Would you bet your life on software that had random divide by 0 bugs in it that were just ignored in the hope that they weren't serious? Keep in mind that software is rather unique in that a single bit error in a billion bytes can render the software utterly demented.

Remember the Apollo 11 lunar landing, when the descent computer software started showing self-detected faults? Armstrong turned it off and landed manually. He wasn't going to bet his ass that the faults could be ignored. You and I wouldn't, either.


> I think what I'm
> trying to say is that simply aborting on error is too brittle in some cases,
> because it only deals with one vector--memory corruption that is unlikely to
> reoccur.  But I've watched always-on systems fall apart from some unexpected
> ongoing situation, and simply restarting doesn't actually help.

In such a situation, ignoring the error seems hardly likely to do any better.
September 29, 2014
On Monday, 29 September 2014 at 04:57:45 UTC, Walter Bright wrote:
> Lots of bad practices are commonplace.

This is not an argument, it is a postulate.

>> You are arguing as if it is impossible to know whether the logic error is local
>> to the handler, or not, with a reasonable probability.
>
> You're claiming to know that a program in an unknown and unanticipated state is really in a known state. It isn't.

It does not have to be known, it is sufficient that it is isolated or that it is improbable to be global or that it is of low impact to long term integrity.

> Are you really suggesting that asserts should be replaced by thrown exceptions? I suspect we have little common ground here.

No, regular asserts should not be caught except for mailing the error log to the developer. They are for testing only.

Pre/postconditions between subsystems are on a different level though. They should not be conflated with regular asserts.

> A vast assumption here that you know in advance what bugs you're going to have and what causes them.

I know in advance that a "divison-by-zero" error is of limited scope with high probability or that an error in a strictly pure validator is of low impact with high probability. I also know that any sign of a flaw in a transaction engine is a critical error that warrants a shutdown.

We know in advance that all programs above low complexity will contain bugs, most of them innocent and they are not a good excuse of shutting down the entire service for many services.

If you have memory safety, reasonable isolation and well tested global data-structures it is most desirable to keep the system running if it is incapable of corrupting a critical database.

> You're using exceptions as a program bug reporting mechanism.

Uncaught exceptions are bugs and should be logged as such. If a form validator throws an unexpected exception then it is a bug. It makes the validation questionable, but does not affect the rest of the system. It is a non-critical bug that needs attention.

> Whoa camel, indeed!

By your line of reasoning no software should ever be shipped, without a formal proof, because they most certainly will be buggy and contain unspecified undetected state.

Keep in mind that a working program, in the real world, is a program that provides reasonable output for reasonable input. Total correctness is a pipe dream, it is not within reach for most real programs. Not even with formal proofs.