designing fault management in spaceflight...

40
The Dependability Solution Provider TM WW Technology Group WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures Chris J. Walter WW Technology Group [email protected] (410) 418-4353

Upload: trinhhanh

Post on 05-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Designing Fault Management in

Spaceflight Architectures

Chris J. Walter

WW Technology Group [email protected]

(410) 418-4353

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Challenges

• NASA architectures affected by trends in current

computing architectures

– Network centric

– Security vulnerabilities

– Lower voltages

– SWAP

– Code reuse

• NASA demands

– Higher onboard processing

– Reusable missions and fault tolerance

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Future Spacecraft Onboard

Computing Needs Computation

Category

Mission Need Objective of

Computation

Flight Architecture

Attribute

Vision-based

Algorithms with

Real-Time

Requirements

• Terrain Relative Navigation

• Hazard Avoidance

• Entry, Descent & Landing

• Pinpoint Landing

• Conduct safe proximity

operations around primitive

bodies

• Land safely and accurately

• Achieve robust results within

available timeframe as input

to control decisions

• Severe fault tolerance and

real-time requirements

• Fail-operational

• High peak power needs

Model-Based

Reasoning

Techniques for

Autonomy

• Mission planning, scheduling

& resource management

• Fault management in

uncertain environments

• Contingency planning to

mitigate execution failures

• Detect, diagnose and recover

from faults

• High computational

complexity

• Graceful degradation

• Memory usage (data

movement) impacts energy

management

High Rate Instrument

Data Processing

• High resolution sensors,

e.g., SAR, Hyper-spectral

• Downlink images and

products rather than raw data

• Opportunistic science

• Distributed, dedicated

processors at sensors

• Less stringent fault

tolerance

- Results from NASA study on High Performance Space Computing (HPSC)

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Future Spacecraft Onboard

Computing Needs Computation Category Flight Architecture Attribute

Vision-based Algorithms with

Real-Time Requirements

• Severe fault tolerance and real-time requirements

• Fail-operational

• High peak power needs

Model-Based Reasoning

Techniques for Autonomy

• High computational complexity

• Graceful degradation

• Energy management

High Rate Instrument Data

Processing

• Distributed, dedicated processors at sensors

• Less stringent fault tolerance

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

5

Large Scale “System-of-Systems”

Communication

Link

Processing

Node

Constellation

Cluster

Processing

Cluster

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

6

WWTG has Evolved a Vision for Highly

Reliable Distributed Systems • Our vision defines a system framework coupled with a middleware

infrastructure that facilitates the deployment of robust, autonomous distributed systems.

• Features of our approach include:

Scalability - System Size, Complexity and Dependability

Flexibility - System Composition and System Functionality

Integrity - Analyzable and Verifiable System

Heterogeneity - Diversity in hardware and software components

• These properties are provided by a cluster-based infrastructure that is applicable to many domains

• Embedded Control Systems

• Distributed information Systems

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Scalable “Systems Approach” • Compositional so that a specified set of methods,

algorithms, and components can be used for construction

in a customizable manner.

• Espouses the use of forethought rather than afterthought

in anticipating requirements for real-time and dependable

computing properties.

• Contains a architectural framework with – well defined levels of abstraction

– clear and clean interfaces between layers.

• A general fault/error model to provide robust fault

tolerance properties that enhance flexibility and scalability.

• Well-defined error containment regions

– flexible, tailorable, quantifiable, analyzable

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Scalable “Systems Approach”

• Provides a integrated view of component interactions

beyond healthy process-level interactions • failure semantics and tolerance/detection algorithms

• Uses system level abstractions that can be recursively

applied • application programs

• distributed OS

• board, multi-board

• chip, multi-chip

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

A Scalable Clustering Approach

• Clustering technique can be used to group system resources into composable units

• System Framework provides a set of guidance to system developers – Allows for reasoned trade-offs between competing system

aspects • Performance, Fault Tolerance, Flexibility, Determinism

– Provides a structured approach for assembling required system services; resulting in a system that is: • Analyzable

• Verifiable

• Testable

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

10

Reliable Platform Services

Local Resource

Management

System Capability Management

Element Discovery

Initial Formation

Startup Sequencer

App Services APIs Sys Organization API

Reliable Platform Interface (RPI) Health Monitoring

API

Local Resource Health

Monitoring

System Capability Health Monitoring

Application Service Monitoring

RPS Component Monitoring

Native Hardware, Operating System, and Vendor Device Drivers

Cluster Services (Synchronization, Application

Service Management)

Local Services (Scheduler, Networking, OS Services)

Application Services

Frame Scheduler

Service

Data Integrity Service

Process Group

Service

System Applications

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

11

Adaptiveness in Error Domain

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

12

Property Based Fault Tolerance

• Non-Functional properties are qualitative in nature and define

characteristics associated with the delivered service

– reliability, availability, safety, security

– scalability, flexibility, integrity, interoperability

• Functional properties are quantitative in nature and define what

services the system delivers

– communication,

– resource discovery

– synchronization,

– detection and reconfiguration

– process group management,

– health monitoring,

– scheduling,

– etc.

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

13

Property Compositions

• BASIC PROPERTIES – Functional (services delivered)

– Non-Functional (-ilities)

• COMPOSITE PROPERTIES

– Properties of the system as a whole rather than taken individually

– Composite (Emergent) properties are a consequence of the relationships between system components

– Can assess/measure only after composition of components/services integrated into a system

P3 P1 P1 P2

basic properties

CP3

composite properties

CP2

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

14

Structured Service Hierarchy

Discovery Services

Asynch Group Services

Synchronous Services

Data Integrity Services

Fault Management Services

Scheduling Services

Application Mgt Services

Asynchronous Messaging

Idealized Design Space

Building Blocks

Theories of Time

& Failure Models

System Models

.

Communication

Primitives

Voting/Convergence

Functions

Building Blocks Specification

& Verification

Consistency of Specification

Across Building Blocks

Synergistic Formulation Of Dependable

Distributed Operations

Resource Discovery

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Framework Contains Services That

Establish System Properties • Establishes the necessary properties of bounded

behavior for real-time and dependable computing

– Timeliness

• synchrony of operations

• deadline agreement

– Correctness

• group formations

• group management

– Resilience

• errors that can be tolerated

• Components that are used to implement the

properties (COTS) can be exchanged, as long as

properties are maintained

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

16

Example: System of Distributed Spacecraft

Reorganization of spacecraft for

accomplishing different mission goals

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

17

Cluster A

Fault Tolerant Element Discovery

FT-ED “Cold-Start” Facilitates

Dependable Initial

Organization Formation

2

1

3

4

5

FT-ED “Warm-Start”

Facilitates Dependable

Organization Augmentation

Cluster A

2

1

3

4

5 7

6

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Use Case: High Dependability

Multi-Clustered System

• An instantiation of the

framework

– Supports a multi-cluster

system

– Each cluster performs

high dependability

processing

– Clusters are interfaced to

support :

• Highly dependable cluster

interfaces

• Hierarchical Processing

across cluster boundaries Local Services (Scheduler, Networking, ..)

HM CM

Intra-Cluster Synchronization

Process Interfaces

Groups

Data Integrity Services

Application Management

Apps

HM CM

Inter-Cluster Synchronization

Process Interfaces

Groups

Data Integrity Services

Application Management

Apps

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

19

Distributed Containment Regions

• Once properties identified, DECRs established and tailored to

provide the necessary degrees of dependability.

• Can establish support of DECRs with different levels of criticality

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

20

Distributed Containment Regions • These regions can be organized in a variety of ways

– leader-follower

– peer-to-peer

– hierarchical

– combination of above

• Examples:

– define hardware v. software error containment

regions

– define regions of different criticality

• Approach is effective in dealing with COTS issues

– contain unknown or unspecified behaviors and failure

semantics

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

21

Premeditated Composability

• Design Space is considered before composition

• Framework exists to support methodical construction at

run-time

• Capable of adapting

Operating

Space

DESIGN SPACE

Operating

Space

Operating

Space

Operating

Space

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

22

Strategy

• Creation of idealized design space

– encompasses CSR goals

– accomodates single system to multi-cluster system

– comprehensive error model that is tailorable to specific use case

• Establishing useful abstractions and relationships

– ECRs, Clusters, System-of-System

– components couplings and dependencies

• Composable service architecture

– inheritance of underlying established properties

• time (boundedness & accuracy)

• data (integrity & fault tolerance)

– streamlines the organization of layers

• system users/developers can work at most meaningful abstraction

layer

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

23

Example Use Case 1:

COTS Based Dependable Cluster

COTS CPU 1 COTS CPU 4 COTS CPU 3 COTS CPU 2

Network

Infrastructure

COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform

RPS

Middleware

Processes

RPS

Middleware

Processes

RPS

Middleware

Processes

RPS

Middleware

Processes

R e l i a b l e P l a t f o r m

Hosted App

Space

A-1

RPI

A-2

RPI

B-1

RPI

C-1

RPI

B-2

RPI

C-2

RPI

A-3

RPI

C-3

RPI

RPS-Enabled

Virtual Platform

Space

Replicated App A

Replicated App B

Replicated App C C-1

RPI

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

24

Improving Performance of Individual Node

• Reduce the lifetime operating and support costs of FPGA based systems, specifically the signal processing components. Related needs include: – Reduction in cost of hardware selection

– Reduction in cost of hardware modification (e.g., minimize cost and schedule impact due to COTS Technology Refresh Evolutions)

• Reduce the development costs of FPGA based applications. Related needs include: – Abstracted interfaces to external resources

– Cost effective application growth

– Solutions that will adapt to future changes and improvements to the underlying FPGA technology

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Reconfigurable Fault Tolerance

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Fault

Tolerance

Triggers

Radiation

Hazard

Triggers

Power Mgt

Triggers

Load

Monitoring

Triggers

Performance

Triggers

User

Demand

Triggers

RLO

Reconfiguration

Triggers

Mission Modes

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Tools for Analysis and Certification

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Fault Management Challenges

• We can see there are many types of flexible system

architectures to consider

• In order to make best use of resources there is a need to

employ dynamic redundancy techniques

• This requires intimate understanding of faults and errors

– use a strategy of possibilistic instead of probabilistic

• “Nearly impossible” means possible.

– Emphasize arbitrary errors rather than specific types

– Utilize concepts related to Byzantine Agreement

– Focus on narrowing windows of error arrival and

accumulation so that fault tolerant complexities do not

grow exponentially

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

EDICT Tools

• Model-based

engineering platform

• Coherent aspect

specific views of

organization and

behavior

• Integration of

architectural and

analytical models of

systems and their

constituent

components/services

Safety

BehaviorStructure

DependabilityPerformance

Simulink

AADL

Security

UML/SysML

Augmentations

EDICT

Aspects

Architecture and

Analysis Views

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Structural Architecture Visualization

Architecture Browser provides a graphical view of

architecture models

Component

Hierarchy

Component

Connections

Software

Components Hardware

Components

Externals

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Structural and Behavioral Views

• Architecture Browser

provides many views

• Views show data of

concern in the context of

the overall architecture

– Data elements and usage

– Data/Control flows and

interaction sequences

– Property assignments

• Aspect specific

augmentations are also

shown

– Safety criticalities

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Pilot Blackout due

to excessive

accelleration

Control System

Failure

Sensor Feedback

Error

Control Law

Failure

Sensor

Produces

Incorrect

Value

Sensor Fails

to Produce a

Value

Control Law

Design Error

Control Law

Run-time

Error

EDICT Tools Support Many Modeling and

Analysis Features for Verification

• Architecture Modeling

– Architectural Flows

– Timelines and Events

• Error Propagation Analysis

• Safety Tagging and Visualization

• Performance and Schedulability

Analysis

• Requirements Tagging and

Architectural Tracing

• Simulink Integration for

Application Verification

Display

User Input

Processing

System

Control

Sensor

Filtering

Data

Recording

Device

Control

Device

Actuator

User

Display

Input

Device

Sensor

Device

Network

NetworkNetwork

Network

Asym

Asym

Asym

Asym

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Example Analysis:

Fault Aware Fault Trees

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Challenge

• Fault-trees one of the most widely used FM mechanisms by

practitioners as a visualization/communication media, as well as a

quantitative analysis tool for building mission-critical systems.

• Fault tree analysis is often conducted in an ad hoc manner and is

unable to provide us with high-confidence results.

• The major problem is that with manual fault tree construction, the

resulting trees can be incomplete and failure-event relationships

misrepresented.

• As systems and their interface complexities grow rapidly, the

problem has only worsened. In a remarkably large number of the

failure events, fault management (FM) inappropriately applied to

mitigate the effect of anomaly actually increased the severity.

Therefore we must pay meticulous attention to the misuse of FM

methods.

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Goal: Fault-Class-Aware Fault-tree

Generation & Analysis

• Go beyond mechanical translation and extend method to

consider impact of:

– Awareness of fault class and Fault Management (FM)

coverage limitation during tree generation.

– Prioritize fault-class-oriented decomposition over pure

architectural decomposition.

• Go beyond faults in application systems

– Model-based FM scheme checking to assess whether

appropriate

– Vigilant about critical faults in the use of FM schemes.

– Impact assessment to the exposure of the faults that are

not covered due to inappropriate FM application. 35

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Multifaceted Fault Reference Model

36

Faultdevelopment faultby phase of occurrence

● ● ●

● ● operational fault

internal fault by system boundary

external fault

har

dw

are

fau

ltb

y d

imen

sio

n

soft

war

e fa

ult

perm

anen

t fault

by p

ersistence

transien

t fault

physical fault

by cause

design faultmal

icious f

ault

by obje

ctive

benign fa

ult

deliberate fault

by intent

Non-deliberate faultac

cidenta

l fau

lt

by cap

abilit

y

inco

mpete

nce fa

ult

● ●

● ●

● ● ●

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Misleading Fault Tree w/o Fault Awareness (ARIANE-5)

37

FT Inertial System failure

ADIRU device failure

Air data software failure

Primary SRI failure

Secondary SRI failure

ADIRU device failure

Air data software failure

410dataSWP

410dataSWP

410ADIRUP

410ADIRUP

2 8(1 (1 )(1 )) 4 10inertialSys ADIRU dataSWP P P

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Fault Tree with Fault Awareness

38

410ADIRUP

410ADIRUP

410dataSWP

421 (1 )(1 ) 1 10inertialSys ADIRU dataSWP P P

FT Inertial System failure

ADIRU device failure

Primary SRI failure

Secondary SRI failure

ADIRU device failure

Air data software failure

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Fault Tree with Augmentation

39

FT Inertial System failure

ADIRU device failure

Primary ADIRU failure

Secondary ADIRU failure

Air data software failure

Primary version failure

Secondary version failure

82 21 (1 )(1 ) 2 10inertialSys ADIRU dataSWP P P

410ADIRUP

410ADIRUP

410dataSWP

410dataSWP

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group

© Copyright 2015 All rights reserved.

Summary of Major Points Fault Management in Spaceflight architectures is a many

dimensional problem

40

Reliable Platform (RP) property based

architecture with hierarchical clustering shown

to be effective

RP FM Strategies can be implemented in many

ways

Reconfigurable Fault Tolerance can accelerate performance and

provide adaptive fault tolerance

• Clusters can be distributed and arranged in

various hierarchical configurations

• Local fault management can be flexible and customizable

Modeling fault effects and impact on system reliability to avoid

incorrect assessments of dependability

Need good modeling and analysis tools (use EDICT!)