Developing Mars lander software (page 2)

On Wed, 2014-02-19 at 02:30 -0800, Walter Bright wrote: […] > PC compilers needed to support multiple pointer types. The 11 did not have segmented addresses, so this was irrelevant for the 11. Trying to retrofit unix compilers with near/far/huge turned out to not be so practical, at least I don't know anyone who tried it. Indeed. I much preferred the PDP-11,UNIX v6 approach to segmentation than the 8086,CP/M,DOS approach. Though it wasn't fun when there was a segmentation violation of course, I hate violation. The only thing worse was bus error. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder

February 19, 2014

Re: Developing Mars lander software

Posted by Xinok
in reply to Tolga Cakiroglu

Permalink

Xinok

Posted in reply to Tolga Cakiroglu

Permalink

On Wednesday, 19 February 2014 at 05:53:55 UTC, Tolga Cakiroglu wrote:
> On Wednesday, 19 February 2014 at 01:09:43 UTC, Xinok wrote:
>> On Wednesday, 19 February 2014 at 00:16:03 UTC, Tolga Cakiroglu wrote:
>>>
>>> TL;DR the link though, how are they detecting that a CPU fails? An information must be passes outside of CPU to do this. The only solution comes to my mind is that main CPU changes a variable on an external memory at every step, and back up CPU checks it continuously to catch a failure immediately. But this would require about 50% of CPU's power already.
>>>
>>> While thinking about this kind of back up systems, knowing and reading that some people are really doing is really great.
>>>
>>
>> I'm assuming this has something to do with it:
>> https://en.wikipedia.org/wiki/Heartbeat_%28computing%29
>>
>> In clustered servers, the active node sends a continuous signal indicating it's still alive. This signal is referred to as a heartbeat. There's a standby node waiting to take over should it stop receiving this signal.
>
> I think only knowing that it has failed is not enough. Because the process is landing, and other CPU should know where the process is left. With that heatbeat signal, only option is that all sensor information must be sent both CPUs continuously and sensor values should be enough about what next step to be taken. Then I think it can continue the process flawlessly.

I don't have experience with, or much knowledge of, these kinds of systems; I'm merely aware of the concepts. The process of one system taking over when another system fails is called failover [1]. Depending on the requirements, the system could be designed so the standby node continues from the last successful state of the failed node [2].

To quote the page on Wikipedia [2], "Most importantly, the application must store as much of its state on non-volatile shared storage as possible. Equally important is the ability to restart on another node at the last state before failure using the saved state from the shared storage."

I would consider it likely that both systems run in conjunction, but the primary system is in control and the backup system merely "observes", ready to take over in an instant as soon as it no longer detects a heartbeat.

[1] https://en.wikipedia.org/wiki/Failover
[2] https://en.wikipedia.org/wiki/High-availability_cluster#Application_design_requirements

On Wednesday, 19 February 2014 at 05:53:55 UTC, Tolga Cakiroglu wrote: > I think only knowing that it has failed is not enough. Because the process is landing, and other CPU should know where the process is left. With that heatbeat signal, only option is that all sensor information must be sent both CPUs continuously and sensor values should be enough about what next step to be taken. Then I think it can continue the process flawlessly. I don't think watching the video answered this, but it hinted toward the second CPU being inactive during landing, if something went wrong the CPU would need to be awoken at which point the backup CPU would take in all the readings from different sensors to decide on actions (possibly it is intended only to land the rover and not land the rover in the correct location.) What was interesting from the video is that the second CPU was going to be turned off for the landing and not used as a backup. A year before landing (I guess that means 3 months before launch) they decided to create the backup software if the main CPU failed during landing, it didn't.

Forums