October 29, 2014
On 10/27/2014 1:54 PM, Sean Kelly wrote:
> On Friday, 24 October 2014 at 19:09:23 UTC, Walter Bright wrote:
>>
>> You can insert your own handler with core.assertHandler(myAssertHandler). Or
>> you can catch(Error). But you don't want to try doing anything more than
>> notification with that - the program is in an unknown state.
>
> Also be aware that if you throw an Exception from the assertHandler you could be
> violating nothrow guarantees.

Right.
October 30, 2014
On 2014-10-29 22:22, Walter Bright wrote:

> Assumptions are not guarantees.
>
> In any case, if the programmer knows than assert error is restricted to
> a particular domain, and is recoverable, and wants to recover from it,
> use enforce(), not assert().

I really don't like "enforce". It encourage the use of plain Exception instead of a subclass.

-- 
/Jacob Carlborg
October 31, 2014
On Thursday, 16 October 2014 at 19:53:42 UTC, Walter Bright wrote:
> On 10/15/2014 12:19 AM, Kagamin wrote:
>> Sure, software is one part of an airplane, like a thread is a part of a process.
>> When the part fails, you discard it and continue operation. In software it works
>> by rolling back a failed transaction. An airplane has some tricks to recover
>> from failures, but still it's a "no fail" design you argue against: it shuts
>> down parts one by one when and only when they fail and continues operation no
>> matter what until nothing works and even then it still doesn't fail, just does
>> nothing. The airplane example works against your arguments.
>
> This is a serious misunderstanding of what I'm talking about.
>
> Again, on an airplane, no way in hell is a software system going to be allowed to continue operating after it has self-detected a bug. Trying to bend the imprecise language I use into meaning the opposite doesn't change that.

To better depict the big picture as I see it:

You suggest that a system should shutdown as soon as possible on first sign of failure, which can affect the system.

You provide the hospital in a hurricane example. But you don't praise the hospitals, which shutdown on failure, you praise the hospital, which continues to operate in face of an unexpected and uncontrollable disaster in total contradiction with your suggestion to shutdown ASAP.

You refer to airplane's ability to not shutdown ASAP and continue operation on unexpected failure as if it corresponds to your suggestion to shutdown ASAP. This makes no sense, you contradict yourself.

Why didn't you praise hospital shutdown? Why nobody wants airplanes to dive into ocean on first suspicion? Because that's how unreliable systems work: they often stop working. And reliable systems work in a completely different way, they employ many tricks, but one big objective of these tricks is to have ability to continue operation on failure. All the effort put into airplane design with one reason: to fight against immediate shutdown, defended by you as the only true way of operation. Exactly the way explicitly rejected by real reliable systems design. How an airplane without the tricks would work? It would dive into ocean on first failure (and crash investigation team diagnoses the failure) - exactly as you suggest. That's safe: it could fall on a city or a nuclear reactor. How a real airplane works? Failure happens and it still flies, contrary to your suggestion to shutdown on failure. That's how critical missions are done: they take a risk of a greater disaster to complete the mission, and failures can be diagnosed when appropriate.

That's why I think your examples contradict to your proposal.
October 31, 2014
On Friday, 24 October 2014 at 18:47:59 UTC, H. S. Teoh via Digitalmars-d wrote:
> Basically, if you want a component to recover from a serious problem
> like a failed assertion, the recovery code should be in a *separate*
> component. Otherwise, if the recovery code is within the failing
> component, you have no way to know if the recovery code itself has been
> compromised, and trusting that it will do the right thing is very
> dangerous (and is what often leads to nasty security exploits). The
> watcher must be separate from the watched, otherwise how can you trust
> the watcher?

You make process isolation sound like a silver bullet, but failure can happen on any scale from a temporary variable to global network. You can't use process isolation to contain a failure of a larger than process scale, and it's an overkill for a failure of a temporary variable scale.
October 31, 2014
On Fri, Oct 31, 2014 at 08:15:17PM +0000, Kagamin via Digitalmars-d wrote:
> On Thursday, 16 October 2014 at 19:53:42 UTC, Walter Bright wrote:
> >On 10/15/2014 12:19 AM, Kagamin wrote:
> >>Sure, software is one part of an airplane, like a thread is a part of a process.  When the part fails, you discard it and continue operation. In software it works by rolling back a failed transaction. An airplane has some tricks to recover from failures, but still it's a "no fail" design you argue against: it shuts down parts one by one when and only when they fail and continues operation no matter what until nothing works and even then it still doesn't fail, just does nothing. The airplane example works against your arguments.
> >
> >This is a serious misunderstanding of what I'm talking about.
> >
> >Again, on an airplane, no way in hell is a software system going to be allowed to continue operating after it has self-detected a bug. Trying to bend the imprecise language I use into meaning the opposite doesn't change that.
> 
> To better depict the big picture as I see it:
> 
> You suggest that a system should shutdown as soon as possible on first sign of failure, which can affect the system.
> 
> You provide the hospital in a hurricane example. But you don't praise the hospitals, which shutdown on failure, you praise the hospital, which continues to operate in face of an unexpected and uncontrollable disaster in total contradiction with your suggestion to shutdown ASAP.
> 
> You refer to airplane's ability to not shutdown ASAP and continue operation on unexpected failure as if it corresponds to your suggestion to shutdown ASAP. This makes no sense, you contradict yourself.

You are misrepresenting Walter's position. His whole point was that once a single component has detected a consistency problem within itself, it can no longer be trusted to continue operating and therefore must be shutdown. That, in turn, leads to the conclusion that your system design must include multiple, redundant, independent modules that perform that one function. *That* is the real answer to system reliability.

Pretending that a failed component can somehow fix itself is a fantasy. The only way you can be sure you are not making the problem worse is by having multiple redundant units that can perform each other's function. Then when one of the units is known to be malfunctioning, you turn it off and fallback to one of the other, known-to-be-good, components.


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
October 31, 2014
On Fri, Oct 31, 2014 at 08:23:04PM +0000, Kagamin via Digitalmars-d wrote:
> On Friday, 24 October 2014 at 18:47:59 UTC, H. S. Teoh via Digitalmars-d wrote:
> >Basically, if you want a component to recover from a serious problem like a failed assertion, the recovery code should be in a *separate* component. Otherwise, if the recovery code is within the failing component, you have no way to know if the recovery code itself has been compromised, and trusting that it will do the right thing is very dangerous (and is what often leads to nasty security exploits). The watcher must be separate from the watched, otherwise how can you trust the watcher?
> 
> You make process isolation sound like a silver bullet, but failure can happen on any scale from a temporary variable to global network. You can't use process isolation to contain a failure of a larger than process scale, and it's an overkill for a failure of a temporary variable scale.

You're missing the point. The point is that a reliable system made of unreliable parts, can only be reliable if you have multiple *redundant* copies of each component that are *decoupled* from each other.

The usual unit of isolation at the lowest level is that of a single process, because threads within a process has full access to memory shared by all threads. Therefore, they are not decoupled from each other, and therefore, you cannot put any confidence in the correct functioning of other threads once a single thread has become inconsistent. The only failsafe solution is to have multiple redundant processes, so when one process becomes inconsistent, you fallback to another process, *decoupled* process that is known to be good.

This does not mean that process isolation is a "silver bullet" -- I never said any such thing. The same reasoning applies to larger components in the system as well. If you have a server that performs function X, and the server begins to malfunction, you cannot expect the server to fix itself -- because you don't know if a hacker hasn't rooted the server and is running exploit code instead of your application. The only 100% safe way to recover, is to have another redundant server (or more) that also performs function X, shutdown the malfunctioning server for investigation and repair, and in the meantime switch over to the redundant server to continue operations. You don't shutdown the *entire* network unless all redundant components have failed.

The reason you cannot go below the process level as a unit of redundancy is because of coupling. The above design of failing over to a redundant module only works if the modules are completely decoupled from each other. Otherwise, you have end up with the situation where you have two redundant modules M1 and M2, but both of them share a common helper module M3. Then if M1 detects a problem, you cannot be 100% sure it's not caused by a problem with M3, so in this case if you just switch to M2, it will also fail in the same way. Similarly, you cannot guarantee that while malfunctioning, M1 may have somehow damaged M3, and thereby also making M2 unreliable. The only way to be 100% sure that failover will actually fix the problem, is to make sure that M1 and M2 are completely isolated from each other (e.g., by having two redundant copies of M3 that are isolated from each other).

Since a single process is the unit of isolation in the OS, you can't go below this granularity: as I've already said, if one thread is malfunctioning, it may have trashed the data shared by all other threads in the same process, and therefore none of the other threads can be trusted to continue operating correctly. The only way to be 100% sure that failover will actually fix the problem, is to switch over to another process that you *know* is not coupled to the old, malfunctioning process.

Attempting to have a process "fix itself" after detecting an inconsistency is unreliable -- you're leaving it up to chance whether or not the attempted recovery will actually work, and not make the problem worse. You cannot guarantee the recovery code itself hasn't been compromised by the failure -- because the recovery code exists in the same process and is vulnerable to the same problem that caused the original failure, and vulnerable to memory corruption caused by malfunctioning code prior to the point the problem was detected. Therefore, the recovery code is not trustworthy, and cannot be relied on to continue operating correctly. That kind of "maybe, maybe not" recovery is not what I'd want to put any trust in, especially when it comes to critical applications that can cost lives if things go wrong.


T

-- 
English has the lovely word "defenestrate", meaning "to execute by throwing someone out a window", or more recently "to remove Windows from a computer and replace it with something useful". :-) -- John Cowan
October 31, 2014
On Friday, 31 October 2014 at 20:33:54 UTC, H. S. Teoh via Digitalmars-d wrote:
> You are misrepresenting Walter's position. His whole point was that once
> a single component has detected a consistency problem within itself, it
> can no longer be trusted to continue operating and therefore must be
> shutdown. That, in turn, leads to the conclusion that your system design
> must include multiple, redundant, independent modules that perform that
> one function. *That* is the real answer to system reliability.

In server software such component is a transaction/request. They are independent.

> Pretending that a failed component can somehow fix itself is a fantasy.

Traditionally a failed transaction is indeed rolled back. It's more a business logic requirement because a partially completed operation would confuse the user.
October 31, 2014
On Fri, Oct 31, 2014 at 09:11:53PM +0000, Kagamin via Digitalmars-d wrote:
> On Friday, 31 October 2014 at 20:33:54 UTC, H. S. Teoh via Digitalmars-d wrote:
> >You are misrepresenting Walter's position. His whole point was that once a single component has detected a consistency problem within itself, it can no longer be trusted to continue operating and therefore must be shutdown. That, in turn, leads to the conclusion that your system design must include multiple, redundant, independent modules that perform that one function. *That* is the real answer to system reliability.
> 
> In server software such component is a transaction/request. They are independent.

You're using a different definition of "component". An inconsistency in a transaction is a problem with the input, not a problem with the program logic itself. If something is wrong with the input, the program can detect it and recover by aborting the transaction (rollback the wrong data). But if something is wrong with the program logic itself (e.g., it committed the transaction instead of rolling back when it detected a problem) there is no way to recover within the program itself.


> >Pretending that a failed component can somehow fix itself is a fantasy.
> 
> Traditionally a failed transaction is indeed rolled back. It's more a business logic requirement because a partially completed operation would confuse the user.

Again, you're using a different definition of "component".

A failed transaction is a problem with the data -- this is recoverable to some extent (that's why we have the ACID requirement of databases, for example). For this purpose, you vet the data before trusting that it is correct. If the data verification fails, you reject the request. This is why you should never use assert to verify data -- assert is for checking the program's own consistency, not for checking the validity of data that came from outside.

A failed component, OTOH, is a problem with program logic. You cannot recover from that within the program itself, since its own logic has been compromised. You *can* rollback the wrong changes made to data by that malfunctioning program, of course, but the rollback must be done by a decoupled entity outside of that program. Otherwise you might end up causing even more problems (for example, due to the compromised / malfunctioning logic, the program commits the data instead of reverting it, thus turning an intermittent problem into a permanent one).


T

-- 
By understanding a machine-oriented language, the programmer will tend to use a much more efficient method; it is much closer to reality. -- D. Knuth
October 31, 2014
On 10/31/2014 2:31 PM, H. S. Teoh via Digitalmars-d wrote:
> On Fri, Oct 31, 2014 at 09:11:53PM +0000, Kagamin via Digitalmars-d wrote:
>> On Friday, 31 October 2014 at 20:33:54 UTC, H. S. Teoh via Digitalmars-d
>> wrote:
>>> You are misrepresenting Walter's position. His whole point was that
>>> once a single component has detected a consistency problem within
>>> itself, it can no longer be trusted to continue operating and
>>> therefore must be shutdown. That, in turn, leads to the conclusion
>>> that your system design must include multiple, redundant, independent
>>> modules that perform that one function. *That* is the real answer to
>>> system reliability.
>>
>> In server software such component is a transaction/request. They are
>> independent.
>
> You're using a different definition of "component". An inconsistency in
> a transaction is a problem with the input, not a problem with the
> program logic itself. If something is wrong with the input, the program
> can detect it and recover by aborting the transaction (rollback the
> wrong data). But if something is wrong with the program logic itself
> (e.g., it committed the transaction instead of rolling back when it
> detected a problem) there is no way to recover within the program
> itself.
>
>
>>> Pretending that a failed component can somehow fix itself is a
>>> fantasy.
>>
>> Traditionally a failed transaction is indeed rolled back. It's more a
>> business logic requirement because a partially completed operation
>> would confuse the user.
>
> Again, you're using a different definition of "component".
>
> A failed transaction is a problem with the data -- this is recoverable
> to some extent (that's why we have the ACID requirement of databases,
> for example). For this purpose, you vet the data before trusting that it
> is correct. If the data verification fails, you reject the request. This
> is why you should never use assert to verify data -- assert is for
> checking the program's own consistency, not for checking the validity of
> data that came from outside.
>
> A failed component, OTOH, is a problem with program logic. You cannot
> recover from that within the program itself, since its own logic has
> been compromised. You *can* rollback the wrong changes made to data by
> that malfunctioning program, of course, but the rollback must be done by
> a decoupled entity outside of that program. Otherwise you might end up
> causing even more problems (for example, due to the compromised /
> malfunctioning logic, the program commits the data instead of reverting
> it, thus turning an intermittent problem into a permanent one).

This is a good summation of the situation.

November 01, 2014
On Friday, 31 October 2014 at 21:33:22 UTC, H. S. Teoh via Digitalmars-d wrote:
> You're using a different definition of "component". An inconsistency in
> a transaction is a problem with the input, not a problem with the
> program logic itself. If something is wrong with the input, the program
> can detect it and recover by aborting the transaction (rollback the
> wrong data).

Transactions roll back when there is contention for resources and/or when you have any kind of integrity issue. That's why you have retries… so no, it is not only something wrong with the input. Something is temporarily wrong with the situation overall.