designing fault management in spaceflight...
TRANSCRIPT
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Designing Fault Management in
Spaceflight Architectures
Chris J. Walter
WW Technology Group [email protected]
(410) 418-4353
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Challenges
• NASA architectures affected by trends in current
computing architectures
– Network centric
– Security vulnerabilities
– Lower voltages
– SWAP
– Code reuse
• NASA demands
– Higher onboard processing
– Reusable missions and fault tolerance
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Future Spacecraft Onboard
Computing Needs Computation
Category
Mission Need Objective of
Computation
Flight Architecture
Attribute
Vision-based
Algorithms with
Real-Time
Requirements
• Terrain Relative Navigation
• Hazard Avoidance
• Entry, Descent & Landing
• Pinpoint Landing
• Conduct safe proximity
operations around primitive
bodies
• Land safely and accurately
• Achieve robust results within
available timeframe as input
to control decisions
• Severe fault tolerance and
real-time requirements
• Fail-operational
• High peak power needs
Model-Based
Reasoning
Techniques for
Autonomy
• Mission planning, scheduling
& resource management
• Fault management in
uncertain environments
• Contingency planning to
mitigate execution failures
• Detect, diagnose and recover
from faults
• High computational
complexity
• Graceful degradation
• Memory usage (data
movement) impacts energy
management
High Rate Instrument
Data Processing
• High resolution sensors,
e.g., SAR, Hyper-spectral
• Downlink images and
products rather than raw data
• Opportunistic science
• Distributed, dedicated
processors at sensors
• Less stringent fault
tolerance
- Results from NASA study on High Performance Space Computing (HPSC)
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Future Spacecraft Onboard
Computing Needs Computation Category Flight Architecture Attribute
Vision-based Algorithms with
Real-Time Requirements
• Severe fault tolerance and real-time requirements
• Fail-operational
• High peak power needs
Model-Based Reasoning
Techniques for Autonomy
• High computational complexity
• Graceful degradation
• Energy management
High Rate Instrument Data
Processing
• Distributed, dedicated processors at sensors
• Less stringent fault tolerance
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
5
Large Scale “System-of-Systems”
Communication
Link
Processing
Node
Constellation
Cluster
Processing
Cluster
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
6
WWTG has Evolved a Vision for Highly
Reliable Distributed Systems • Our vision defines a system framework coupled with a middleware
infrastructure that facilitates the deployment of robust, autonomous distributed systems.
• Features of our approach include:
Scalability - System Size, Complexity and Dependability
Flexibility - System Composition and System Functionality
Integrity - Analyzable and Verifiable System
Heterogeneity - Diversity in hardware and software components
• These properties are provided by a cluster-based infrastructure that is applicable to many domains
• Embedded Control Systems
• Distributed information Systems
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Scalable “Systems Approach” • Compositional so that a specified set of methods,
algorithms, and components can be used for construction
in a customizable manner.
• Espouses the use of forethought rather than afterthought
in anticipating requirements for real-time and dependable
computing properties.
• Contains a architectural framework with – well defined levels of abstraction
– clear and clean interfaces between layers.
• A general fault/error model to provide robust fault
tolerance properties that enhance flexibility and scalability.
• Well-defined error containment regions
– flexible, tailorable, quantifiable, analyzable
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Scalable “Systems Approach”
• Provides a integrated view of component interactions
beyond healthy process-level interactions • failure semantics and tolerance/detection algorithms
• Uses system level abstractions that can be recursively
applied • application programs
• distributed OS
• board, multi-board
• chip, multi-chip
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
A Scalable Clustering Approach
• Clustering technique can be used to group system resources into composable units
• System Framework provides a set of guidance to system developers – Allows for reasoned trade-offs between competing system
aspects • Performance, Fault Tolerance, Flexibility, Determinism
– Provides a structured approach for assembling required system services; resulting in a system that is: • Analyzable
• Verifiable
• Testable
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
10
Reliable Platform Services
Local Resource
Management
System Capability Management
Element Discovery
Initial Formation
Startup Sequencer
App Services APIs Sys Organization API
Reliable Platform Interface (RPI) Health Monitoring
API
Local Resource Health
Monitoring
System Capability Health Monitoring
Application Service Monitoring
RPS Component Monitoring
Native Hardware, Operating System, and Vendor Device Drivers
Cluster Services (Synchronization, Application
Service Management)
Local Services (Scheduler, Networking, OS Services)
Application Services
Frame Scheduler
Service
Data Integrity Service
Process Group
Service
System Applications
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
11
Adaptiveness in Error Domain
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
12
Property Based Fault Tolerance
• Non-Functional properties are qualitative in nature and define
characteristics associated with the delivered service
– reliability, availability, safety, security
– scalability, flexibility, integrity, interoperability
• Functional properties are quantitative in nature and define what
services the system delivers
– communication,
– resource discovery
– synchronization,
– detection and reconfiguration
– process group management,
– health monitoring,
– scheduling,
– etc.
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
13
Property Compositions
• BASIC PROPERTIES – Functional (services delivered)
– Non-Functional (-ilities)
• COMPOSITE PROPERTIES
– Properties of the system as a whole rather than taken individually
– Composite (Emergent) properties are a consequence of the relationships between system components
– Can assess/measure only after composition of components/services integrated into a system
P3 P1 P1 P2
basic properties
CP3
composite properties
CP2
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
14
Structured Service Hierarchy
Discovery Services
Asynch Group Services
Synchronous Services
Data Integrity Services
Fault Management Services
Scheduling Services
Application Mgt Services
Asynchronous Messaging
Idealized Design Space
Building Blocks
Theories of Time
& Failure Models
System Models
.
Communication
Primitives
Voting/Convergence
Functions
Building Blocks Specification
& Verification
Consistency of Specification
Across Building Blocks
Synergistic Formulation Of Dependable
Distributed Operations
Resource Discovery
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Framework Contains Services That
Establish System Properties • Establishes the necessary properties of bounded
behavior for real-time and dependable computing
– Timeliness
• synchrony of operations
• deadline agreement
– Correctness
• group formations
• group management
– Resilience
• errors that can be tolerated
• Components that are used to implement the
properties (COTS) can be exchanged, as long as
properties are maintained
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
16
Example: System of Distributed Spacecraft
Reorganization of spacecraft for
accomplishing different mission goals
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
17
Cluster A
Fault Tolerant Element Discovery
FT-ED “Cold-Start” Facilitates
Dependable Initial
Organization Formation
2
1
3
4
5
FT-ED “Warm-Start”
Facilitates Dependable
Organization Augmentation
Cluster A
2
1
3
4
5 7
6
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Use Case: High Dependability
Multi-Clustered System
• An instantiation of the
framework
– Supports a multi-cluster
system
– Each cluster performs
high dependability
processing
– Clusters are interfaced to
support :
• Highly dependable cluster
interfaces
• Hierarchical Processing
across cluster boundaries Local Services (Scheduler, Networking, ..)
HM CM
Intra-Cluster Synchronization
Process Interfaces
Groups
Data Integrity Services
Application Management
Apps
HM CM
Inter-Cluster Synchronization
Process Interfaces
Groups
Data Integrity Services
Application Management
Apps
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
19
Distributed Containment Regions
• Once properties identified, DECRs established and tailored to
provide the necessary degrees of dependability.
• Can establish support of DECRs with different levels of criticality
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
20
Distributed Containment Regions • These regions can be organized in a variety of ways
– leader-follower
– peer-to-peer
– hierarchical
– combination of above
• Examples:
– define hardware v. software error containment
regions
– define regions of different criticality
• Approach is effective in dealing with COTS issues
– contain unknown or unspecified behaviors and failure
semantics
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
21
Premeditated Composability
• Design Space is considered before composition
• Framework exists to support methodical construction at
run-time
• Capable of adapting
Operating
Space
DESIGN SPACE
Operating
Space
Operating
Space
Operating
Space
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
22
Strategy
• Creation of idealized design space
– encompasses CSR goals
– accomodates single system to multi-cluster system
– comprehensive error model that is tailorable to specific use case
• Establishing useful abstractions and relationships
– ECRs, Clusters, System-of-System
– components couplings and dependencies
• Composable service architecture
– inheritance of underlying established properties
• time (boundedness & accuracy)
• data (integrity & fault tolerance)
– streamlines the organization of layers
• system users/developers can work at most meaningful abstraction
layer
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
23
Example Use Case 1:
COTS Based Dependable Cluster
COTS CPU 1 COTS CPU 4 COTS CPU 3 COTS CPU 2
Network
Infrastructure
COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform
RPS
Middleware
Processes
RPS
Middleware
Processes
RPS
Middleware
Processes
RPS
Middleware
Processes
R e l i a b l e P l a t f o r m
Hosted App
Space
A-1
RPI
A-2
RPI
B-1
RPI
C-1
RPI
B-2
RPI
C-2
RPI
A-3
RPI
C-3
RPI
RPS-Enabled
Virtual Platform
Space
Replicated App A
Replicated App B
Replicated App C C-1
RPI
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
24
Improving Performance of Individual Node
• Reduce the lifetime operating and support costs of FPGA based systems, specifically the signal processing components. Related needs include: – Reduction in cost of hardware selection
– Reduction in cost of hardware modification (e.g., minimize cost and schedule impact due to COTS Technology Refresh Evolutions)
• Reduce the development costs of FPGA based applications. Related needs include: – Abstracted interfaces to external resources
– Cost effective application growth
– Solutions that will adapt to future changes and improvements to the underlying FPGA technology
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Reconfigurable Fault Tolerance
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Fault
Tolerance
Triggers
Radiation
Hazard
Triggers
Power Mgt
Triggers
Load
Monitoring
Triggers
Performance
Triggers
User
Demand
Triggers
RLO
Reconfiguration
Triggers
Mission Modes
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Tools for Analysis and Certification
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Fault Management Challenges
• We can see there are many types of flexible system
architectures to consider
• In order to make best use of resources there is a need to
employ dynamic redundancy techniques
• This requires intimate understanding of faults and errors
– use a strategy of possibilistic instead of probabilistic
• “Nearly impossible” means possible.
– Emphasize arbitrary errors rather than specific types
– Utilize concepts related to Byzantine Agreement
– Focus on narrowing windows of error arrival and
accumulation so that fault tolerant complexities do not
grow exponentially
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
EDICT Tools
• Model-based
engineering platform
• Coherent aspect
specific views of
organization and
behavior
• Integration of
architectural and
analytical models of
systems and their
constituent
components/services
Safety
BehaviorStructure
DependabilityPerformance
Simulink
AADL
Security
UML/SysML
Augmentations
EDICT
Aspects
Architecture and
Analysis Views
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Structural Architecture Visualization
Architecture Browser provides a graphical view of
architecture models
Component
Hierarchy
Component
Connections
Software
Components Hardware
Components
Externals
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Structural and Behavioral Views
• Architecture Browser
provides many views
• Views show data of
concern in the context of
the overall architecture
– Data elements and usage
– Data/Control flows and
interaction sequences
– Property assignments
• Aspect specific
augmentations are also
shown
– Safety criticalities
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Pilot Blackout due
to excessive
accelleration
Control System
Failure
Sensor Feedback
Error
Control Law
Failure
Sensor
Produces
Incorrect
Value
Sensor Fails
to Produce a
Value
Control Law
Design Error
Control Law
Run-time
Error
EDICT Tools Support Many Modeling and
Analysis Features for Verification
• Architecture Modeling
– Architectural Flows
– Timelines and Events
• Error Propagation Analysis
• Safety Tagging and Visualization
• Performance and Schedulability
Analysis
• Requirements Tagging and
Architectural Tracing
• Simulink Integration for
Application Verification
Display
User Input
Processing
System
Control
Sensor
Filtering
Data
Recording
Device
Control
Device
Actuator
User
Display
Input
Device
Sensor
Device
Network
NetworkNetwork
Network
Asym
Asym
Asym
Asym
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Example Analysis:
Fault Aware Fault Trees
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Challenge
• Fault-trees one of the most widely used FM mechanisms by
practitioners as a visualization/communication media, as well as a
quantitative analysis tool for building mission-critical systems.
• Fault tree analysis is often conducted in an ad hoc manner and is
unable to provide us with high-confidence results.
• The major problem is that with manual fault tree construction, the
resulting trees can be incomplete and failure-event relationships
misrepresented.
• As systems and their interface complexities grow rapidly, the
problem has only worsened. In a remarkably large number of the
failure events, fault management (FM) inappropriately applied to
mitigate the effect of anomaly actually increased the severity.
Therefore we must pay meticulous attention to the misuse of FM
methods.
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Goal: Fault-Class-Aware Fault-tree
Generation & Analysis
• Go beyond mechanical translation and extend method to
consider impact of:
– Awareness of fault class and Fault Management (FM)
coverage limitation during tree generation.
– Prioritize fault-class-oriented decomposition over pure
architectural decomposition.
• Go beyond faults in application systems
– Model-based FM scheme checking to assess whether
appropriate
– Vigilant about critical faults in the use of FM schemes.
– Impact assessment to the exposure of the faults that are
not covered due to inappropriate FM application. 35
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Multifaceted Fault Reference Model
36
Faultdevelopment faultby phase of occurrence
● ● ●
● ● operational fault
internal fault by system boundary
external fault
har
dw
are
fau
ltb
y d
imen
sio
n
soft
war
e fa
ult
●
perm
anen
t fault
by p
ersistence
transien
t fault
●
physical fault
by cause
●
design faultmal
icious f
ault
by obje
ctive
●
benign fa
ult
deliberate fault
by intent
●
Non-deliberate faultac
cidenta
l fau
lt
by cap
abilit
y
●
inco
mpete
nce fa
ult
● ●
●
● ●
●
● ● ●
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Misleading Fault Tree w/o Fault Awareness (ARIANE-5)
37
FT Inertial System failure
ADIRU device failure
Air data software failure
Primary SRI failure
Secondary SRI failure
ADIRU device failure
Air data software failure
410dataSWP
410dataSWP
410ADIRUP
410ADIRUP
2 8(1 (1 )(1 )) 4 10inertialSys ADIRU dataSWP P P
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Fault Tree with Fault Awareness
38
410ADIRUP
410ADIRUP
410dataSWP
421 (1 )(1 ) 1 10inertialSys ADIRU dataSWP P P
FT Inertial System failure
ADIRU device failure
Primary SRI failure
Secondary SRI failure
ADIRU device failure
Air data software failure
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Fault Tree with Augmentation
39
FT Inertial System failure
ADIRU device failure
Primary ADIRU failure
Secondary ADIRU failure
Air data software failure
Primary version failure
Secondary version failure
82 21 (1 )(1 ) 2 10inertialSys ADIRU dataSWP P P
410ADIRUP
410ADIRUP
410dataSWP
410dataSWP
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group
© Copyright 2015 All rights reserved.
Summary of Major Points Fault Management in Spaceflight architectures is a many
dimensional problem
40
Reliable Platform (RP) property based
architecture with hierarchical clustering shown
to be effective
RP FM Strategies can be implemented in many
ways
Reconfigurable Fault Tolerance can accelerate performance and
provide adaptive fault tolerance
• Clusters can be distributed and arranged in
various hierarchical configurations
• Local fault management can be flexible and customizable
Modeling fault effects and impact on system reliability to avoid
incorrect assessments of dependability
Need good modeling and analysis tools (use EDICT!)