October 04, 2014
On 10/3/2014 10:00 AM, Joseph Rushton Wakeling via Digitalmars-d wrote:
> What I'm asking you to consider is a use-case, one that I picked quite
> carefully.  Without assuming anything about how the system is architected, if we
> have a telephone exchange, and an Error occurs in the handling of a single call,
> it seems to me fairly unarguable that it's essential to avoid this bringing down
> everyone else's call with it.  That's not simply a matter of convenience -- it's
> a matter of safety, because those calls might include emergency calls, urgent
> business communications, or any number of other circumstances where dropping
> someone's call might have severe negative consequences.

What you're doing is attempting to write a program with the requirement that the program cannot fail.

It's impossible.

If that's your requirement, the system needs to be redesigned so that it can accommodate the failure of the program.

(Ignoring bugs in the program is not accommodating failure, it's pretending that the program cannot fail.)


> As I'm sure you realize, I also picked that particular use-case because it's one
> where there is a well-known technological solution -- Erlang -- which has as a
> key feature its ability to isolate different parts of the program, and to deal
> with errors by bringing down the local process where the error occurred, rather
> than the whole system.  This is an approach which is seriously battle-tested in
> production.

As I (and Brad) has stated before, process isolation, shutting down the failed process, and restarting the process, is acceptable, because processes are isolated from each other.

Threads are not isolated from each other. They are not. Not. Not.


> As I said, I'm not asking you to endorse catching Errors in threads, or other
> gross simplifications of Erlang's approach.  What I'm interested in are your
> thoughts on how we might approach resolving the requirement for this kind of
> stability and localization of error-handling with the tools that D provides.
>
> I don't mind if you say to me "That's your problem" (which it certainly is:-),
> but I'd like it to be clear that it _is_ a problem, and one that it's important
> for D to address, given its strong standing in the development of
> super-high-connectivity server applications.

The only way to have super high uptime is to design the system so that failure is isolated, and the failed process can be quickly restarted or replaced. Ignoring bugs is not isolation, and hoping that bugs in one thread doesn't affected memory shared by other threads doesn't work.

October 04, 2014
On Saturday, 4 October 2014 at 08:30:11 UTC, Walter Bright wrote:
> On 10/3/2014 3:27 PM, Piotrek wrote:
>> My point was that the broken speed indicators shut down the autopilot systems.
>
> The alternative is to have the autopilot crash the airplane. The autopilot cannot fly with compromised airspeed data.

Yes, I know. I just provided that example as a response to:

> Do you interpret airplane safety right? As I understand, airplanes are safe
> exactly because they recover from assert failures and continue operation.

And Paulo stated it's a bad example. Maybe it is, but I couldn't find a better one. This accident just sits in my head as the sequence of events shocked me the most from all accident stories I heard.

Piotrek
October 04, 2014
On Saturday, 4 October 2014 at 08:24:40 UTC, Paolo Invernizzi wrote:

>
> And that is still the only reasonable thing to do in that case.
>
> ---
> /Paolo

And I never said otherwise. See my response to Walter's post.

Piotrek
October 04, 2014
On Sat, 04 Oct 2014 09:14:03 +0000
eles via Digitalmars-d <digitalmars-d@puremagic.com> wrote:

> It knows what it is doing.
yes. processing garbage to generate more garbage.


October 04, 2014
On Saturday, 4 October 2014 at 08:39:47 UTC, Walter Bright wrote:

>
> It's really too bad that I've never seen any engineering courses on reliability.
>
> http://www.drdobbs.com/architecture-and-design/safe-systems-from-unreliable-parts/228701716

Thanks Walter. I was going to ask you about papers :) Maybe we need a to mention the key things in D guideline or in the future book: "Effective D".

Piotrek
October 04, 2014
On Saturday, 4 October 2014 at 08:45:57 UTC, Paolo Invernizzi wrote:

> Basically, the info-path regarding the joystick between two disconnected systems (the co-pilots) is cut in modern plane, so it's more difficult to check if their "output" (push/pull/etc) is coherent at the end.
>
> That, in my opinion, was the biggest problem in that tragedy.
> ---
> /Paolo

Yeah, I wish to know the rationale for asynchronous joysticks.

Pioterk
October 04, 2014
On 10/4/2014 1:40 AM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> On Saturday, 4 October 2014 at 08:25:22 UTC, Walter Bright wrote:
>> On 10/3/2014 9:10 AM, "Ola Fosheim Grøstad"
>> <ola.fosheim.grostad+dlang@gmail.com>" wrote:
>>> I think Walter forgets that you ensure integrity of a complex system of servers
>>> by utilizing a rock solid proven transaction database/task-scheduler for
>>> handling all critical information. If that fails, you probably should shut down
>>> everything, roll back to the last backup and reboot.
>>
>> You don't ensure integrity of anything by running software after it has
>> entered an unknown and unanticipated state.
>
> Integrity is ensured

Sorry, Ola, you've never written bug-free software, and nobody else has, either.


> by the transaction engine. The world outside of the
> transaction engine has NO WAY of affecting integrity.

Hardware fails, too.


> SAAB Gripen crashed in 1989 and 1993 due to control software,

Wikipedia sez these were caused by "pilot induced oscillations". http://en.wikipedia.org/wiki/Accidents_and_incidents_involving_the_JAS_39_Gripen#February_1989

In any case, Fighter aircraft are not built to airliner safety standards.


> Eurofighter is wire
> controlled, you most likely cannot keep it stable without electronic control. So
> if it fails, you have to use the parachute. Bye, bye $100.000.000.

That doesn't mean there are no backups to the primary flight control computer.


> Anyway, failure should not be due to "asserts", that should be covered by
> program verification and formal proofs.

The assumption that "proof" means the code doesn't have bugs is charming, but still false.


> Failure can still happen if the stabilizing model is inadequate.

It seems we can't escape bugs.


> During peace time fighter jets stay grounded for many days every year due to
> technical issues, maybe as much as 50%. In war time they would be up fighting…
> So yes, you bet your life on it when you defend the air base. Your life is worth
> nothing in certain circumstances. It is contextual.

Again, warplanes are not built to airliner safety standards. They have different priorities.


>> I think you forget my background in designing critical flight controls
>> systems. I know what works, and the proof is the incredible safety of
>> airliners. Yeah, I know that's "appeal to authority", but I've backed it up, too.
>
> That's a marginal use scenario and software for critical control systems should
> not rely on asserts in 2014. Critical software should be formally proven correct.

Airframe companies are going to continue to rely on things that have a long, successful track record. It's pretty hard to argue with success.
October 04, 2014
On Saturday, 4 October 2014 at 08:45:57 UTC, Paolo Invernizzi wrote:
> On Saturday, 4 October 2014 at 08:28:40 UTC, Walter Bright wrote:
>> On 10/3/2014 11:00 AM, Piotrek wrote:

> That, in my opinion, was the biggest problem in that tragedy.

There is also this one:

http://www.flightglobal.com/news/articles/stall-warning-controversy-haunts-af447-inquiry-360336/


"It insists the design of the stall warning "misled" the pilots. "Each time they reacted appropriately the alarm triggered inside the cockpit, as though they were reacting wrongly. Conversely each time the pilots pitched up the aircraft, the alarm shut off, preventing a proper diagnosis of the situation."

SNPL's argument essentially suggests the on-off alarm might have incorrectly been interpreted by the crew as an indication that the A330 was alternating between being stalled and unstalled, when it was actually alternating between being stalled with valid airspeed data and stalled with invalid airspeed data."
October 04, 2014
On Saturday, 4 October 2014 at 09:36:37 UTC, Piotrek wrote:
> On Saturday, 4 October 2014 at 08:45:57 UTC, Paolo Invernizzi wrote:

> Yeah, I wish to know the rationale for asynchronous joysticks.

Not only that, also the computer *averaged* the inputs (I am really at loss with that, in a plane. If there is a mountain in front of the plane, and one pilot pulls left, while the other pilot pulls right, the computer makes the "smart" decision to fly ahead straight throught the mountain.)

At least if it would let you know: "the other post is giving me other inputs! I warn you, I am averaging the inputs!". It did not warn. As the right seat has been pulling all back, the left seat saw nothing else that the plane did not respond (or very little) to his own inputs. He, following his inputs, have no feedback about how the plane is reacting. Not knowing the other is pulling back, all that he saw was an iressponsive plane.

Anyway, systems might be blamed. But the attitude of that guy who never informed the others that he was pulling back for a quarter of hour is the issue there.
October 04, 2014
On Saturday, 4 October 2014 at 09:40:26 UTC, Walter Bright wrote:
> On 10/4/2014 1:40 AM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
>> On Saturday, 4 October 2014 at 08:25:22 UTC, Walter Bright wrote:
>>> On 10/3/2014 9:10 AM, "Ola Fosheim Grøstad"
>>> <ola.fosheim.grostad+dlang@gmail.com>" wrote:

> In any case, Fighter aircraft are not built to airliner safety standards.

That concerns only the degrees of freedom of pilot's inputs and airplane's angles, but not the standard of the software or other components.

It just trusts the pilot for the assetion of the situation. After all a missile is a far more danger than a possible stall.