January 18, 2014
On Saturday, 18 January 2014 at 01:46:55 UTC, Walter Bright wrote:
> On 1/17/2014 4:44 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
>> Big systems have to live with bugs, it is inevitable that they run with bugs.
>
> It's a dark and stormy night. You're in a 747 on final approach, flying on autopilot.
>
> Scenario 1
> ----------
>
> The autopilot software was designed by someone who thought it should keep operating even if it detects faults in the software. The software runs into a null object when there shouldn't be one, and starts feeding bad values to the controls. The plane flips over and crashes, everybody dies. But hey, the software kept on truckin'!
>
> Scenario 2
> ----------
>
> The autopilot software was designed by Boeing. Actually, there are two autopilots, each independently developed, with different CPUs, different hardware, different algorithms, different languages, etc. One has a null pointer fault. A deadman circuit sees this, and automatically shuts that autopilot down. The other autopilot immediately takes over. The pilot is informed that one of the autopilots failed, and the pilot immediately shuts off the remaining autopilot and lands manually. The passengers all get to go home.
>
>
> Note that in both scenarios there are bugs in the software. Yes there have been incidents with earlier autopilots where bugs in it caused the airplane to go inverted.
>
> Consider also the Toyota. My understanding from reading reports (admittedly journalists botch up the facts) is that a single computer controls the brakes, engine, throttle, ignition switch, etc. Oh joy. I wouldn't want to be in that car when it keeps on going despite having self-detected faults. It could, you know, start going at full throttle and ignore all signals to brake or turn off, only stopping when it crashes or runs out of gas.

You are running a huge website. Let's say for instance a social
network with more than a billion users.

Scenario 1
----------

The software was designed by someone who thought it should keep
operating even if it detects faults in the software. A bug arise
in some fronted and it start corruption data. Some monitoring
detects the issue, the code get fixed and corrupted data are
recovered from backup. Users that ended up on that cluster saw
their account not working for a day, but everything is back to
normal the day after.

Scenario 2
----------

The software was designed by an ex employee from boeing. He know
that he should make his software crash hard and soon. As soon as
some error are detected on a cluster, the cluster goes down.
Hopefully, no data is corrupted, but the load on that cluster
must now be handled by other cluster. Soon enough, these clusters
overload and the whole website goes down. Hopefully, no data were
corrupted in the process, so it isn't needed to restore anything
from backup.



Different software, different needs. Ultimately, that distinction
is irrelevant anyway. The whole possibility of these scenarios
can be avoided in case of null dereferences by proper language
design.
January 18, 2014
On 1/17/2014 6:18 PM, Michel Fortin wrote:
> Implemented well, it makes it a compilation error. It works like this:
>
> - can't pass a likely-null value to a function that wants a not-null argument.
> - can't assign a likely-null value to a not-null variable.
> - can't dereference a likely-null value.
>
> You have to check for null first, and the check changes the value from
> likely-null to not-null in the branch taken when the pointer is valid.


I was talking about runtime errors, in that finding the cause of a runtime null failure is not harder than finding the cause of any other runtime invalid value.

We all agree that detecting bugs at compile time is better.

January 18, 2014
On Saturday, 18 January 2014 at 01:58:13 UTC, Walter Bright wrote:
> I strong, strongly, disagree with the notion that critical systems should soldier on once they have entered an invalid state. Such is absolutely the wrong way to go about making a fault tolerant system. For hard evidence, I submit the safety record of airliners.

But then you have to define "invalid state", not in terms of language constructs, but in terms of the model the implementation is based on.

If your thread only uses thread local memory and safe language features it should be sufficient to spin down that thread and restart that sub-service. That's what fault tolerant operating systems do.

In a functional language it is easy, you may keep computing with "bottom" until it disappears. "bottom" OR true => true, "bottom" AND false => false. You might even succeed by having lazy evaluation over functions that would never halt.

If you KNOW that a particular type is not going to have any adverse affect if taking a default object rather than null (could probably even prove it in is some cases), it does not produce an "invalid state". It might be an exceptional state, but that does not imply that it is invalid.

Some systems are in an pragmatic fuzzy state. Take fuzzy logic as an example (http://www.dmitry-kazakov.de/ada/fuzzy.htm) where you operate with two fluid dimensions: necessity and possibility, representing a space with the extremes: false, true, contradiction and uncertain. There is no "invalid state" if the system is designed to be in a state of "best effort".
January 18, 2014
On 1/17/2014 6:22 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> On Saturday, 18 January 2014 at 01:46:55 UTC, Walter Bright wrote:
>> The autopilot software was designed by someone who thought it should keep
>> operating even if it detects faults in the software.
>
> I would not write autopilot or life-support software in D. So that is kind of
> out-of-scope for the language. But:
>
> Keep the system simple, select a high level language and verify correctness by
> an automated proof system.
>
> Use 3 independently implemented systems and shut down the one that produces
> deviant values. That covers more ground than the unlikely null-pointers in
> critical systems. No need to self-detect anything.

I didn't mention that the dual autopilots also have a comparator on the output, and if they disagree they are both shut down. The deadman is an additional check. The dual system has proven itself, a third is not needed.


>> Consider also the Toyota. My understanding from reading reports (admittedly
>> journalists botch up the facts) is that a single computer controls the brakes,
>> engine, throttle, ignition switch, etc. Oh joy. I wouldn't want to be in that
>> car when it keeps on going despite having self-detected faults.
>
> So you would rather have the car drive off the road because the anti-skid
> software abruptly turned itself off during an emergency manoeuvre?

Please reread what I wrote. I said it shuts itself off and engages the backup, and if there is no backup, you have failed at designing a safe system.


> But would you stay in a car where the driver talks in a cell-phone while
> driving, or would you tell him to stop? Probably much more dangerous if you
> measured correlation between accidents and system features. So you demand
> perfection from a computer, but not from a human being that is exhibiting
> risk-like behaviour. That's an emotional assessment.
>
> The rational action would be to improve the overall safety of the system, rather
> than optimizing a single part. So spend the money on installing a cell-phone
> jammer and an accelerator limiter rather than investing in more computers.
> Clearly, the computer is not the weakest link, the driver is. He might not
> agree, but he is and he should be forced to exhibit low risk behaviour. Direct
> effort to where it has most effect.
>
> (From a system analytical point of view. It might not be a good sales tactic,
> because car buyers aren't that rational.)

I have experience with this stuff, Ola, from my years at Boeing designing flight critical systems. What I outlined is neither irrational nor emotionally driven, and has the safety record to prove its effectiveness.

I also ask that you please reread what I wrote - I explicitly do not demand perfection from a computer.

January 18, 2014
On 1/17/14 5:05 PM, Walter Bright wrote:
> On 1/17/2014 4:17 PM, Andrei Alexandrescu wrote:
>>> Even if you got rid of all the nulls and instead use the null object
>>> pattern, you're not going to find it any easier to track it down, and
>>> you're in even worse shape because now it can fail and you may not even
>>> detect the failure, or may discover the error much, much further from
>>> the source of the bug.
>>
>> How do you know all that?
>
> Because I've tracked down the cause of many, many null pointers, and
> I've tracked down the cause of many, many other kinds of invalid values
> in a variable. Null pointers tend to get detected much sooner, hence
> closer to where they were set.

I'm not sure at all. Don't forget you have worked on a category of programs for the last 15 years. Reactive/callback-based code can be quite different, as a poster noted. I'll check with the folks to confirm that.

The larger point I'm trying to make is your position and mine requires we need to un-bias ourselves as much as we can.


Andrei


January 18, 2014
On 1/17/2014 6:12 PM, bearophile wrote:
> Walter Bright:
>
>> I strong, strongly, disagree with the notion that critical systems should
>> soldier on once they have entered an invalid state.
>
> The idea is to design the language and its type system (and static analysis
> tools) to be able to reduce the frequency (or probability) of such invalid
> states, because many of them are removed while you write the program.

Once again,

"I don't disagree that some form of static analysis that can detect null dereferencing at compile time (i.e. static analysis) is a good thing. I concur that detecting bugs at compile time is better."

January 18, 2014
On 1/17/14 5:58 PM, Walter Bright wrote:
> I strong, strongly, disagree with the notion that critical systems
> should soldier on once they have entered an invalid state. Such is
> absolutely the wrong way to go about making a fault tolerant system. For
> hard evidence, I submit the safety record of airliners.

You're arguing against a strawman, and it is honestly not pleasant that you reduced one argument to the other.

Andrei


January 18, 2014
On 1/17/2014 6:42 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> But then you have to define "invalid state",

An unexpected value is an invalid state.

January 18, 2014
On 1/17/2014 6:56 PM, Andrei Alexandrescu wrote:
> On 1/17/14 5:58 PM, Walter Bright wrote:
>> I strong, strongly, disagree with the notion that critical systems
>> should soldier on once they have entered an invalid state. Such is
>> absolutely the wrong way to go about making a fault tolerant system. For
>> hard evidence, I submit the safety record of airliners.
>
> You're arguing against a strawman, and it is honestly not pleasant that you
> reduced one argument to the other.


I believe I have correctly separated the various null issues into 4 separate ones, and have tried very hard not to conflate them.

January 18, 2014
On 1/17/2014 6:40 PM, deadalnix wrote:
> Different software, different needs. Ultimately, that distinction
> is irrelevant anyway. The whole possibility of these scenarios
> can be avoided in case of null dereferences by proper language
> design.

I've already agreed that detecting bugs at compile time is better.