January 18, 2014
On Sat, Jan 18, 2014 at 02:22:22AM +0000, digitalmars-d-bounces@puremagic.com wrote:
> On Saturday, 18 January 2014 at 01:46:55 UTC, Walter Bright wrote:
[...]
> >Consider also the Toyota. My understanding from reading reports (admittedly journalists botch up the facts) is that a single computer controls the brakes, engine, throttle, ignition switch, etc. Oh joy. I wouldn't want to be in that car when it keeps on going despite having self-detected faults.
> 
> So you would rather have the car drive off the road because the anti-skid software abruptly turned itself off during an emergency manoeuvre?
[...]

You missed his point. The complaint is that the car has a *single* software system that handles everything. That's a single point of failure. When that single software system fails, *everything* fails.

A fault-tolerant design demands at least two anti-skid software units, where the redundant unit will kick in when the primary one turns off or stops for whatever reason. So when a software fault occurs in the primary unit, it gets shut off, and the backup unit takes over and keeps the car stable.  You'd only crash in the event that *both* units fail at the same time, which is far less likely than a single unit failing.

This is better than having a single software system that tries to fix itself when it goes wrong, because the fact that something caused part of the code to crash (segfault, or whatever) is a sign that the system is no longer in a state anticipated by the engineers, so there's no guarantee it won't make things worse when it tries to fix itself. For example, it might be scrambled into a state where it keeps the accelerator on with no way to override it, thereby making the problem worse.

You need a *decoupled* redundant system to be truly confident that whatever fault caused the problem in the first system doesn't also affect the backup / self-repair system, something which doesn't hold for a single software unit (for example, if the power supply to the unit fails, then whatever self-repair subsystem it has will also be non-functional). That way, when the first unit goes wrong, it can simply be shut off safely, thereby preventing making the problem worse, and the backup unit takes over and keeps things going.

To use a software example: if you have a single process that tries to fix itself when, say, a null pointer is dereferenced, then there's no guarantee that the error recovery code won't do something stupid, like format your disk (because the null pointer in an unexpected place proves that the code has logic problems: it isn't in a state that the engineers planned for, so who knows what else is wrong with it -- maybe a function pointer to display graphics has been accidentally replaced with a pointer to the formatDisk function due to the bug that caused the null to appear in an unexpected place). If instead you have two redundant processes, one of which is doing the real work and the second is just sleeping, then when the first process segfaults due to a null pointer, the second one can kick into action -- since it hasn't been doing the same thing as the first process, it's likely still in a safe, consistent state, and so it can safely take over and keep the service running.

This is the premise of high-availability systems: there's a primary server that's doing the real work, and one or more redundant units. When the primary dies (power loss, CPU overheat, segfault causing it no longer to respond, etc.), a watchdog timer triggers a failover to the second unit, thus minimizing service interruption time. The failover detection code can then contact an administrator (email, SMS, etc.) notifying that something went wrong with the first unit, and service continues uninterrupted while the first unit is repaired.

OTOH, if you have only a single unit and something goes wrong, there's a risk that the recovery code will go wrong too, so the entire unit stops functioning, and service is interrupted until it's repaired.


T

-- 
Study gravitation, it's a field with a lot of potential.
January 18, 2014
On 1/17/2014 6:48 PM, Andrei Alexandrescu wrote:
> On 1/17/14 5:05 PM, Walter Bright wrote:
>> On 1/17/2014 4:17 PM, Andrei Alexandrescu wrote:
>>>> Even if you got rid of all the nulls and instead use the null object
>>>> pattern, you're not going to find it any easier to track it down, and
>>>> you're in even worse shape because now it can fail and you may not even
>>>> detect the failure, or may discover the error much, much further from
>>>> the source of the bug.
>>>
>>> How do you know all that?
>>
>> Because I've tracked down the cause of many, many null pointers, and
>> I've tracked down the cause of many, many other kinds of invalid values
>> in a variable. Null pointers tend to get detected much sooner, hence
>> closer to where they were set.
>
> I'm not sure at all. Don't forget you have worked on a category of programs for
> the last 15 years. Reactive/callback-based code can be quite different, as a
> poster noted. I'll check with the folks to confirm that.

We've talked about the floating point issue, where nan values "poison" the results rather than raise a seg fault. I remembered later that a while back Don tried to change D to throw an exception when a floating point nan was encountered, specifically because this made it easier for him to track down the source of the nan. He said it was pretty hard to backtrack it from the eventual output.

In any case, I'm open to data that supports the notion that delayed detection of an error makes the source of the error easier to identify. That's not been my experience, and my intuition also finds that improbable.


> The larger point I'm trying to make is your position and mine requires we need
> to un-bias ourselves as much as we can.

I agree with that. So I await data :-)
January 18, 2014
On 2014-01-18 02:41:58 +0000, Walter Bright <newshound2@digitalmars.com> said:

> We all agree that detecting bugs at compile time is better.

I guess so. But you're still unconvinced it's worth it to eliminate null dereferences?

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

January 18, 2014
On Saturday, 18 January 2014 at 02:48:38 UTC, Walter Bright wrote:
> I didn't mention that the dual autopilots also have a comparator on the output, and if they disagree they are both shut down. The deadman is an additional check. The dual system has proven itself, a third is not needed.

The pilot is engaged as the third.

There are situations where you cannot have a third "intelligent" agent take over, so you should have 3 systems, and reboot and resync the one that diverges, but this is rather off topic. I don't think D is a language that should be used for these kind of systems.

> Please reread what I wrote. I said it shuts itself off and engages the backup, and if there is no backup, you have failed at designing a safe system.

A car driver that is doing an emergency manoeuvre is not part of a safe system, indeed!

If you want one system to take over for another you need a safe spot to do it in. Just disappearing instantly isn't optimal because instantly changing responsiveness is a gurantee for failure.

In fact, being instantly disruptive is usually the wrong thing to do. You should spin down gracefully.

I don't see why you cannot do that with null-pointers. You obviously can do it with division by zero errors. I think you associate null-pointers with memory corruption, which truly is an invalid state for which you might want to instantly shut down.

> I have experience with this stuff, Ola, from my years at Boeing designing flight critical systems. What I outlined is neither irrational nor emotionally driven, and has the safety record to prove its effectiveness.

In a very narrow field where the pilot is monitoring the system and can take over. The pilot is the ultimate source for failure (in a political sense). So you basically shut down the technology and blame the pilot if you end up with a crash. That only works if the computer has been made to replace a human being.
January 18, 2014
On 1/17/2014 7:05 PM, H. S. Teoh wrote:
> [...]

Thank you, a good explanation.

I don't know how the anti-skid brake system is designed. But on the older systems, brakes are a dual mostly-independent system. There was one system for the front brakes, and another for the rear. Dual cylinders, dual reservoirs, etc. The brake pedal operated both cylinders. There was even a hydraulic comparator between the two, which would turn on a red [brake] light on the dash if they differed in pressure.

Last year, I acquired a leak in my rear brakes on my old truck, and that light coming on was my first indication of trouble. The front brakes still worked fine, and topping off the rear reservoir got the rear brakes temporarily working, and I was able to ease it to the repair shop without difficulty.

It's a good example of how to build a safe, fault tolerant system.
January 18, 2014
On 1/17/2014 7:23 PM, Michel Fortin wrote:
> I guess so. But you're still unconvinced it's worth it to eliminate null
> dereferences?

I think it's a more nuanced problem than that. But I agree that compile time detect of null reference bugs is better than runtime detection of them.

January 18, 2014
On Fri, Jan 17, 2014 at 07:11:16PM -0800, Walter Bright wrote: [...]
> We've talked about the floating point issue, where nan values "poison" the results rather than raise a seg fault. I remembered later that a while back Don tried to change D to throw an exception when a floating point nan was encountered, specifically because this made it easier for him to track down the source of the nan. He said it was pretty hard to backtrack it from the eventual output.
[...]

This is tangential to the discussion, but the nan issue got me thinking about how one might detect the source of a nan. One crude way might be to use a pass-thru function that asserts if its argument is a nan:

	real assumeNotNan(real v, string file=__FILE__, size_t line=__LINE__) {
		assert(!isNan(v), "assumeNotNan failed at " ~ file ~ " line " ~
				to!string(line));
		return v;
	}

Then you can stick this inside a floating-point expression in the same way you'd stick assert(ptr !is null) in some problematic code in order to find where the null is coming from, to narrow down the source of the nan:

	real complicatedComputation(real x, real y, real z) {
		// Original code:
		//return sqrt(x) * y - z/(x-y) + suspiciousFunc(x,y,z);

		// Instrumented code (UFCS rawkz!):
		return sqrt(x).assumeNotNan * y
			- (z/(x-y)).assumeNotNan
			+ suspiciousFunc(x,y,z).assumeNotNan;
	}

Makes it *slightly* easier than having to break up a complex expression and introduce temporaries in order to insert asserts between terms. But, granted, not by much. Still, it reduces the pain somewhat.


T

-- 
Political correctness: socially-sanctioned hypocrisy.
January 18, 2014
On 1/17/2014 7:38 PM, Walter Bright wrote:
> But I agree that compile time
> detect of null reference bugs is better than runtime detection of them.


BTW, the following program:

  class C { int a,b; }

  int test() {
    C c;
    return c.b;
  }

When compiled with -O:

  foo.d(6): Error: null dereference in function _D3foo4testFZi

It isn't much, only working on intra-function analysis and only when the optimizer is used, but it's something. It's been in dmd for a long time.
January 18, 2014
On 2014-01-17 21:42, Michel Fortin wrote:

> Andrei's post was referring at language/compiler changes too: allowing
> init to be defined per-class, with a hint about disabling init. I took
> the hint that modifying the compiler to add support for non-null was in
> the cards and proposed something more useful and less clunky to use.

Hehe, right, I kind of lost track of the discussion.

-- 
/Jacob Carlborg
January 18, 2014
On Saturday, 18 January 2014 at 06:10:20 UTC, Walter Bright wrote:
> On 1/17/2014 7:38 PM, Walter Bright wrote:
>> But I agree that compile time
>> detect of null reference bugs is better than runtime detection of them.
>
>
> BTW, the following program:
>
>   class C { int a,b; }
>
>   int test() {
>     C c;
>     return c.b;
>   }
>
> When compiled with -O:
>
>   foo.d(6): Error: null dereference in function _D3foo4testFZi
>
> It isn't much, only working on intra-function analysis and only when the optimizer is used, but it's something. It's been in dmd for a long time.

But:
----
class C { int a, b; }

C create() pure nothrow {
	return null;
}

int test() pure nothrow {
	C c = create();
	return c.b;
}

void main() {
	test();
}
----

Print nothing, even with -O.
Maybe the idea to use C? for nullable references would be worth to implement.