November 16, 2009
On Mon, 16 Nov 2009 12:48:51 -0800, Walter Bright wrote:

> bearophile wrote:
>> Walter Bright:
>>> I just wished to point out that it was not a *safety* issue.<
>> A safe system is not a program that switches itself off as soon as there's a small problem.
> 
> Computers cannot know whether a problem is "small" or not.

But designers who make the system can.


> Pretending a program hasn't failed when it has, and just "soldiering on", is completely unacceptable behavior in a system that must be reliable.

...

> If you've got a system that relies on the software continuing to function after an unexpected null seg fault, you have a VERY BADLY DESIGNED and COMPLETELY UNSAFE system. I really cannot emphasize this enough.

What is the 'scope' of "system"? Is that if any component in a system fails, then all other components are also in an unknown, and therefore potentially unsafe, state too?

For example, can one describe this scenario below as a single system or multiple systems...

"A software failure causes the cabin lights to be permanently turned on, so should the 'system' also assume that the toilets must no longer be flushed?"

Is the "system" the entire aircraft, i.e. all its components, or is there a set of systems involved here?

In the "set of systems" concept, is it possible that a failure of one system can have no impact on another system in the set, or must it be assumed that every system is reliant on all other systems in the same set?

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell
November 16, 2009
Yigal Chripun wrote:
> Andrei Alexandrescu wrote:
>> Denis Koroskin wrote:
>>> On Mon, 16 Nov 2009 19:27:41 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>>
>>>> bearophile wrote:
>>>>> Walter Bright:
>>>>>
>>>>>> A person using alloca is expecting stack allocation, and that it goes away after the function exits. Switching arbitrarily to the gc will not be detected and may hide a programming error (asking for a gigantic piece of memory is not anticipated for alloca, and could be caused by an overflow or logic error in calculating its size).
>>>>>  There's another solution, that I'd like to see more often used in Phobos: you can add another function to Phobos, let's call it salloca (safe alloca) that does what Denis Koroskin asks for (it's a very simple function).
>>>>
>>>> Can't be written. Try it.
>>>>
>>>> Andrei
>>>
>>> It's tricky. It can't be written *without a compiler support*, because it is considered special for a compiler (it always inlines the call to it). It could be written otherwise.
>>>
>>> I was thinking about proposing either an inline keyword in a language (one that would enforce function inlining, rather than suggesting it to compiler), or allways inline all the functions that make use of alloca. Without either of them, it is impossible to create wrappers around alloca (for example, one that create arrays on stack type-safely and without casts):
>>>
>>> T[] array_alloca(T)(size_t size) { ... }
>>>
>>> or one that would return GC-allocated memory when stack allocation fails:
>>>
>>> void* salloca(size_t size) {
>>>     void* ptr = alloca(size);
>>>     if (ptr is null) return (new void[size]).ptr;
>>>
>>>     return ptr;
>>> }
>>
>> The problem of salloca is that alloca's memory gets released when salloca returns.
>>
>> Andrei
> 
> template salloca(alias ptr, alias size) { // horrible name, btw
>   ptr = alloca(size);
>   if (ptr is null) ptr = (new void[size]).ptr;
> }
> 
> // use:
> void foo() {
>   int size = 50;
>   void* ptr;
>   mixin salloca!(ptr, size);
>   //...
> }
> 
> wouldn't that work?

mixin? Interesting. Probably it works.

Andrei
November 16, 2009
On Mon, Nov 16, 2009 at 9:48 PM, Walter Bright <newshound1@digitalmars.com> wrote:
> bearophile wrote:
>>
>> Walter Bright:
>>>
>>> I just wished to point out that it was not a *safety* issue.<
>>
>> A safe system is not a program that switches itself off as soon as there's a small problem.
>
> Computers cannot know whether a problem is "small" or not.
>
>> One Ariane missile has self-destroyed (and destroyed an extremely important scientific satellite it was carrying whose mission I miss still) because of this silly behaviour united with the inflexibility of the Ada language.
>>
>> A reliable system is a systems that keeps working correctly despite all. If this is not possible, in real life you usually want a "good enough" behaviour. For example, for your TAC medical machine, in Africa if the machine switches itself off at the minimal problem they force the machine to start again, because they don't have money for a 100% perfect fix. So for them it's better a machine that shows a slow and graceful degradation. That's a reliable system, something that looks more like your liver, that doesn't totally switch off as soon it has a small problem (killing you quickly).
>
> This is how you make reliable systems:
>

You sure got all the answers...
November 16, 2009
I am sorry for having mixed global reliability of a system with the discussion about non nullable class references. It's my fault. Those are two very different topics, as Walter says. Here I give few comments, but please try to keep the two things separated. If that's not possible, feel free to ignore this post...

Adam D. Ruppe:

>Would you have preferred it to just randomly do its own thing and potentially end up landing on people?<

Your mind is now working in terms of 0/1. But that's not how most things in the universe work. A guide system designed with different design principles may have guided it safely, with a small error in the trajectory, that may be fixed later in orbit.


>Even expensive, important pieces of equipment can always be replaced.<

The scientific equipment it was carrying is lost, no one has replaced it so far. It was very complex.


>What would you have it do? Carry on in the error state, doing Lord knows what? That's clearly unsafe.<

My idea was to have a type system that avoids such errors :-)


>Hospitals know their medical machines might screw up, so they keep a nurse on duty at all times who can handle the situation - restart the failed machine, or bring in a replacement before it kills someone.<

This is not how things are.


>I wouldn't say safer, though I will concede that it is easier to debug.<

A program that doesn't break in the middle of its run is safer if you have to use it for something more important than a video game :-)

-------------------

Walter Bright:

>Computers cannot know whether a problem is "small" or not.<

The system designer can explain to the computer what "small" means in the specific situation.


>This is how you make reliable systems:<

I'm a biologist, and I like biology-inspired designs. There is not just 1 way to design reliable systems that must work in the real world. Biology shows several other ways. Today people are starting to copy nature in such regard too, for example designing swarms of very tiny robots that are able to perform a task even if some tiny robot gets damaged or struck, etc.


>Pretending a program hasn't failed when it has, and just "soldiering on", is completely unacceptable behavior in a system that must be reliable.<

Well, it's often a matter of degree. On Windows I have amateur-level image editing programs that sometimes have a bug, and one of their windows "dies" or gets struck. I can usually keep working a little with that program and then save the work, and then restart the program.


>The Ariane 5 had a backup system which was engaged, but the backup system had the same software in it, so failed in the same way. That is not how you make reliable systems.<

I have read enough about that case. I agree that it was badly designed. But in our universe there is more than 1 true way to design a reliable system.


>You're using two different definitions of the word "safe". Program safety is about not corrupting memory. System safety (i.e. reliability) is a completely different thing.<

I'd like my programs to be safer in the system safety way.


>If you've got a system that relies on the software continuing to function after an unexpected null seg fault, you have a VERY BADLY DESIGNED and COMPLETELY UNSAFE system. I really cannot emphasize this enough.<

My idea was to introduce ways to avoid nulls in the first place.


>by aviation companies who take this issue extremely seriously.<

There are wonderful birds (alatross) that keep flying across thousand of kilometers (and singing&loving to each other and laying large eggs) after 50+ years:
http://news.nationalgeographic.com/news/2003/04/0417_030417_oldestbird.html
They are biological systems way more complex that a modern aeroplane, they are made of subsystems (like cells in their brain) that are not very reliable. They use a different design strategy to be so reliable.

Sorry for mixing two so unrelated topics, my second stupid mistake of today.

Bye,
bearophile
November 16, 2009
Tomas Lindquist Olsen wrote:
> You sure got all the answers...

I had it beaten into my head by people who had 50 years of experience designing reliable airliners - what worked and what didn't work.

The consensus on what constitutes best practices for software reliability is steadily improving, but I still think the airliner companies are more advanced in that regard.

Even your car has a dual path design (for the brakes)!
November 16, 2009
bearophile wrote:
> They use a different design strategy to
> be so reliable.

My understanding (I am no biologist) is that biology achieves reliability by using redundancy, not by requiring individual components to be perfect.

The redundancy goes down to the DNA level, even.

Another way is it uses quantity, rather than quality. Many organisms produce millions of offspring in the hope that one or two survive.

Software, on the other hand, is notorious for one bit being wrong out of a billion rendering it completely useless. A strategy of independent redundancy is appropriate here.

For example, how would you write a program that would be expected to survive having a random bit in it flipped at random intervals?
November 16, 2009
On Mon, 16 Nov 2009 20:39:57 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> Denis Koroskin wrote:
>> On Mon, 16 Nov 2009 19:27:41 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>
>>> bearophile wrote:
>>>> Walter Bright:
>>>>
>>>>> A person using alloca is expecting stack allocation, and that it goes away after the function exits. Switching arbitrarily to the gc will not be detected and may hide a programming error (asking for a gigantic piece of memory is not anticipated for alloca, and could be caused by an overflow or logic error in calculating its size).
>>>>  There's another solution, that I'd like to see more often used in Phobos: you can add another function to Phobos, let's call it salloca (safe alloca) that does what Denis Koroskin asks for (it's a very simple function).
>>>
>>> Can't be written. Try it.
>>>
>>> Andrei
>>  It's tricky. It can't be written *without a compiler support*, because it is considered special for a compiler (it always inlines the call to it). It could be written otherwise.
>>  I was thinking about proposing either an inline keyword in a language (one that would enforce function inlining, rather than suggesting it to compiler), or allways inline all the functions that make use of alloca. Without either of them, it is impossible to create wrappers around alloca (for example, one that create arrays on stack type-safely and without casts):
>>  T[] array_alloca(T)(size_t size) { ... }
>>  or one that would return GC-allocated memory when stack allocation fails:
>>  void* salloca(size_t size) {
>>     void* ptr = alloca(size);
>>     if (ptr is null) return (new void[size]).ptr;
>>      return ptr;
>> }
>
> The problem of salloca is that alloca's memory gets released when salloca returns.
>
> Andrei

You missed the point of my post. I know it can't be implemented, and I told just that. I also told about 2 possible solutions to this issue.
November 16, 2009
Denis Koroskin wrote:
> On Mon, 16 Nov 2009 20:39:57 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> Denis Koroskin wrote:
>>> On Mon, 16 Nov 2009 19:27:41 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>>
>>>> bearophile wrote:
>>>>> Walter Bright:
>>>>>
>>>>>> A person using alloca is expecting stack allocation, and that it goes away after the function exits. Switching arbitrarily to the gc will not be detected and may hide a programming error (asking for a gigantic piece of memory is not anticipated for alloca, and could be caused by an overflow or logic error in calculating its size).
>>>>>  There's another solution, that I'd like to see more often used in Phobos: you can add another function to Phobos, let's call it salloca (safe alloca) that does what Denis Koroskin asks for (it's a very simple function).
>>>>
>>>> Can't be written. Try it.
>>>>
>>>> Andrei
>>>  It's tricky. It can't be written *without a compiler support*, because it is considered special for a compiler (it always inlines the call to it). It could be written otherwise.
>>>  I was thinking about proposing either an inline keyword in a language (one that would enforce function inlining, rather than suggesting it to compiler), or allways inline all the functions that make use of alloca. Without either of them, it is impossible to create wrappers around alloca (for example, one that create arrays on stack type-safely and without casts):
>>>  T[] array_alloca(T)(size_t size) { ... }
>>>  or one that would return GC-allocated memory when stack allocation fails:
>>>  void* salloca(size_t size) {
>>>     void* ptr = alloca(size);
>>>     if (ptr is null) return (new void[size]).ptr;
>>>      return ptr;
>>> }
>>
>> The problem of salloca is that alloca's memory gets released when salloca returns.
>>
>> Andrei
> 
> You missed the point of my post. I know it can't be implemented, and I told just that. I also told about 2 possible solutions to this issue.

I see now. Apologies.

Andrei
November 17, 2009
Walter Bright:

>is that biology achieves reliability by using redundancy, not by requiring individual components to be perfect. The redundancy goes down to the DNA level, even. Another way is it uses quantity, rather than quality. Many organisms produce millions of offspring in the hope that one or two survive.<

Quantity is a form of redundancy.

In biology reliability is achieved in many ways.
Your genetic code is degenerated, so many single letter mutations lead to no mutated proteins. This leads to neural mutations.

Bones and tendons use redundancy too to be resilient, but they aren't flat organizations, they are hierarchies of structures inside larger structures, at all levels, from the molecular level up; this allows for a different kind of failures, like in earthquakes (many small ones, few large ones, with a power law distribution).

A protein is a chain of small parts, and its function is partially determined by its form. This forms is mostly self-created. But once in a while few other proteins help shape up the other proteins, especially when the temperature is too much high.

Most biological systems are able to self-repair, that usually means cells that die and duplicate and sometimes they build harder structures like bones. This happens at a sub-cellular level too, cells have many systems to repair and clean themselves, they keep destroying and rebuilding their parts at all levels, and you can see it among neurons too: your memory is encoded (among other things) by the connections between neurons, but they die. So new connections among even very old neurons can be created, and they replace the missing wiring, keeping the distributed memory functional even 100 years after the events, in very old people.

Genetic information is encoded in multiple copies, and sometimes in bacteria distribuited in the population. Reliability is necessary when you copy or read the genetic information, this comes from a balance from the energy used to copy and how much reliable you want such read/copy, and how much fast you want it (actually ribosomes and DNA polymerase are about on the theoretical minimum of this 3-variable optimization, you can't do better even in theory).

Control systems, like those in the brain, seek reliability in several different ways. One of them is encoding vectors in a small population of neurons. The final direction of where your finger points is found by such vectorized average. Parkinson's disease can kill 90% of the cells in certain zones, yet I can keep being able to move the hand to grab a glass of water (a little shaky, because the average is computed on much less vectors).

There is enough stuff to write more than one science popularization article :-)


>how would you write a program that would be expected to survive having a random bit in it flipped at random intervals?<

That's a nice question. The program and all its data is stored somewhere, usually on RAM, caches, and registers. How can you use a program if bits in your RAM can flip at random with a certain (low) probability? There are error-correction RAM memories, based on redundancy codes, like Reed-Solomon one. ECC memory is today common enough. Similar error correction schemes can be added to inner parts of the CPU too (and probably someone has done it, for example in CPUs that must work in space on satellites, where the Sun radiation is not shielded by the earth atmosphere).
I am sure related schemes can be used to test if a CPU instruction has done its purpose of if during its execution something has gone wrong. You can fix such things in hardware too.

But there are other solutions beside fixing all errors. Today chips keep getting smaller, and power for each transistor keeps going down. Eventually noise and errors will start to grow. Recently some people have realized that on the screen of a mobile telephone you can tolerate few wrongly decompressed pixels from a video, if this allows the chip to use only 1/10 of the normal energy used. Sometimes you want few wrong pixels here and there if they allow you to keep seeing videos on your mobile telephone for twice long. In future CPUs will probably become less reliable, so they software (mostly operating system, I think) will need to invent ways to fix those errors. This will allow to keep programs globally reliable even with fast low powered CPUs. Molecular-scale adders will need software to fix their errors. Eventually this is going to become more and more like cellular biochemistry, with all its active redundancy :-)

There's no end to the amount of things you can say on this topic.

Bye,
bearophile
November 17, 2009
On Mon, 16 Nov 2009 12:48:51 -0800, Walter Bright <newshound1@digitalmars.com> wrote:

>
>If you've got a system that relies on the software continuing to function after an unexpected null seg fault, you have a VERY BADLY DESIGNED and COMPLETELY UNSAFE system. I really cannot emphasize this enough.

I have an example of such a software: http://www.steinberg.net/en/products/audiopostproduction_product/nuendo4.html

It loads third-party plugins into the host process's address space, an consequently it may fail at any moment. The software's design is not the best ever but it gives the user last chance to save his work in case of fatal error. This feature has saved my back a couple of times.

>
>P.S. I worked for Boeing for years on flight critical systems. Normally I eschew credentialism, but I feel very strongly about this issue and wish to point out that my knowledge on this is based on decades of real world experience by aviation companies who take this issue extremely seriously.

Then, instead of sticking with Windows and the likes, you may want to think about porting dmd to a more serious environment specifically designed for developing such systems. What about a real-time microkernel OS like this one: http://www.qnx.com/products/neutrino_rtos/ ?