[{Image src='zima-hans-3.jpg' caption='' width='500' alt='Hans Zima' class='image_left'}]''__Adaptive Fault Tolerance for Many-Core Based Parallel Computation in Space__''\\ \\
Hans Zima, CALTEC, USA\\ \\
__Abstract:__
\\ \\
Future missions of deep space exploration will require a high degree of autonomy supported by an enhanced onboard
computational capability. Earth-based mission controllers will be unable to directly control distant spacecraft
and robots to ensure timely precision and safety, or to support “opportunistic science” by capturing rapidly
changing events, such as dust devils on Mars or volcanic eruptions on a remote moon in the solar system.
Furthermore, the high data volume yielded by smart on-board instruments would overwhelm the limited bandwidth
of spacecraft-Earth communication, enforcing on-board data analysis, filtering, and compression. Emerging manycore
technology is expected to provide the low-power, high-performance computational capability needed for the
support of such missions. Fault tolerance for such systems will face new challenges, but also provide opportunities
that do not exist for existing space-borne systems. We present an introspection-based approach that provides
adaptive fault tolerance for on-board systems based on many-core technology. The goal is to enable a software
system to become self-aware of its health, performance, and power consumption by monitoring its execution
behaviour, reasoning about its internal state, making decisions or recommendations about appropriate changes of
the system or system state when necessary, and supporting recovery from faults. Emphasis is placed on
application-oriented fault tolerance that takes into account knowledge about the application and the algorithms and
programs with which it is implemented. We have already implemented prototype versions of an introspection
system for a cluster of Cell Broadband Engines and a Tilera Tile64 system; this work has been documented in a
9
number of publications. In this paper we will focus on on-going work at the Jet Propulsion Laboratory (JPL) that
allows the user to categorize applications with respect to the required level of fault tolerance and the mechanisms
required to implement such a specification. In this context we are studying methods that can provide automatic
support for the generation of fault-tolerant software.
Specifically, we will describe how automatic analysis of the control and data flow in source programs can be
exploited to automatically generate correctness assertions for critical sections of a program, or to generate
redundant code that can be embedded in a self-checking programming framework.