Hans Zima
Adaptive Fault Tolerance for Many-Core Based Parallel Computation in Space

Hans Zima, CALTEC, USA


Future missions of deep space exploration will require a high degree of autonomy supported by an enhanced onboard computational capability. Earth-based mission controllers will be unable to directly control distant spacecraft and robots to ensure timely precision and safety, or to support “opportunistic science” by capturing rapidly changing events, such as dust devils on Mars or volcanic eruptions on a remote moon in the solar system. Furthermore, the high data volume yielded by smart on-board instruments would overwhelm the limited bandwidth of spacecraft-Earth communication, enforcing on-board data analysis, filtering, and compression. Emerging manycore technology is expected to provide the low-power, high-performance computational capability needed for the support of such missions. Fault tolerance for such systems will face new challenges, but also provide opportunities that do not exist for existing space-borne systems. We present an introspection-based approach that provides adaptive fault tolerance for on-board systems based on many-core technology. The goal is to enable a software system to become self-aware of its health, performance, and power consumption by monitoring its execution behaviour, reasoning about its internal state, making decisions or recommendations about appropriate changes of the system or system state when necessary, and supporting recovery from faults. Emphasis is placed on application-oriented fault tolerance that takes into account knowledge about the application and the algorithms and programs with which it is implemented. We have already implemented prototype versions of an introspection system for a cluster of Cell Broadband Engines and a Tilera Tile64 system; this work has been documented in a 9 number of publications. In this paper we will focus on on-going work at the Jet Propulsion Laboratory (JPL) that allows the user to categorize applications with respect to the required level of fault tolerance and the mechanisms required to implement such a specification. In this context we are studying methods that can provide automatic support for the generation of fault-tolerant software. Specifically, we will describe how automatic analysis of the control and data flow in source programs can be exploited to automatically generate correctness assertions for critical sections of a program, or to generate redundant code that can be embedded in a self-checking programming framework.
Imprint Privacy policy « This page (revision-4) was last updated on Friday, 7. October 2011, 12:04 by Kaiser Dana
  • operated by