etsi nfv#13 nfv resiliency presentation - ali kafel - stratus

Ali Kafel, VP of Business Development

Ensuring High Availability and Resiliency for NFV

Monday 15th February, 2016,

3.00 - 6.00pm

Croke Park, Dublin 3, Ireland

1

MOVING IT TO THE FIELD(CO-LOCATED WITH ETSI NFV#13)

The details of this presentation are covered in this White Paper:

http://www.slideshare.net/akafel/nfv-resiliency-whitepaper-ali-kafel-stratus-technologies

Why We Need Resiliency vs High Availability

Achieving Resiliency Management for NFV

Proof point – ETSI PoC#35

Agenda

2

3

Stratus Technologies

Intel PlatformsftServer

Hardware Fault Tolerance

Proprietary Platforms

1980 - Present

Software Fault Tolerance

everRun Enterprise12,000+ Installed

2008 - Present

Trusted Name in Fault Tolerant Computing for 35 years

Stratus Fault Tolerant Cloud

Resilient Cloud TechnologiesBased of proven SW infrastructure

2015-present




4

5

Why the need for Resiliency in NFV

• It is no longer about voice services ….. Certain data and video services need HA and Resiliency more that voice

• Even “mature” cloud technologies still lack HA and Resiliency

uptime hours mins secs

99.9% 8.76 525.6 31536

99.99% 52.56 3154

99.999% 5.256 315.4

99.9999% 0.526 31.54

Down time

Reliability• How long a system performs its intended function.

• MTBF = total time in service / number of failures

Availability

• % of time an equipment is in an operable state ie. Service accessible and

service continuity

• Availability (A) = Uptime / (Uptime + Downtime);

• A = MTBF / (MTBF + MTTR)

Resiliency

• The ability to recover quickly from failures, to return to its original form /

state to maintain operable state + QoS

• Resiliency (R) = Availability (A) + QoS

What you need is R, not just A… because, for example:… A 99.999% application that fails once a week for just 1 secs and disrupts active services is not

Resilient and not acceptable

A 99.9999% application that causes increases latency during a fault is not acceptable

Defining Reliability, Availability and Resiliency

Stratus Technologies Page 2

Resiliency Management cannot be done in the VNFs…..Because you cannot manage what you cannot see

VNFs

Virtualized Resources

Performance Faults

Resource Depletion

Fault Impacts

External Dependencies

Acce

ss N

etw

ork

s

Are exposed to

Depend 0n

VNFM

SDNC-OL

SDNC-UL

Shared

StorageShared

Network

NFVI Fabric

NODE HW C/N/S

NODE SW

C/N/S

Virtualization SW

vC, vN, Vs

Facility Infra

DCIM

Core

Netw

ork

s

Over 80% of system failure

modes are not directly

visible by the VNFs

Infrastructure decoupling hides

the information required to take

actions on faults from VNFs

VIM

HW Faults

SW Faults

Config Faults

Migrations Upgrades

7Stratus Technologies

Resiliency management can be “designed In” in multiple waysbut it’s best done in the Software Infrastructure

Applications / VNFs

Operating Environment

Hardware

• Transparent – no code change• Fast & Simple Deployment• No special App Software

• Very expensive• Inefficient utilization• Special Hardware• Rigid

Costs

& R

eso

urc

es

Pros

Cons

In the Hardware In the Applications In the Software Infrastructure

Applications / VNFs

.

.

.

.


Hardware

• App specific state can be Customized

• Can’t detect & manage all infrastructure faults • Code written for resiliency increased by ~40%• Most developers don’t have Resiliency experience• More complex & Longer time to develop

Middleware

Applications / VNFs


with Resilience Layer

Hardware

• Needs to be adaptable to a wide range of Application Architectures

• Broader & Faster fault detection and correlation• Faster and simpler Application development• Transparent – no code changes• Multiple levels of Resiliency

Benefits:

• Reduces Development & Verification time

• Lower Risks

• Faster time to market





9

Resiliency ManagementIt’s Complexity, Multi-Dimensional and more than just Fault Management

Detection

(Prediction)

Localization

Isolation Remediation

(Service restoration)

Recovery (Redundancy restoration)

Resiliency on multiple factors• Speed of Service restoration & Redund. restoration

• State Management: Service continuity

• “Key state” versus “All state”

• Redundancy mode: Resource consumption / cost

• Application performance impact


Availability

Management

Configuration

Management

Fault management

State Protection

Remembering the preceding events in a given sequence of

interactions within the application

All or partial?

Service Restoration (or Failover)

Insuring that service is restored either through a fast restart or

failover to an active secondary or hotStandy

The speed of Service Restoration depends on the type of

application

Some applications need State Protection, most

applications need fast Service Restoration

Multi-dimensional aspects of ResiliencyTwo Key Elements: Service Restoration and State protection


Sta

te P

rote

cti

on

No

Sta

te P

rote

cti

on

Sta

te M

an

ag

em

en

t

Slow (mins)

Start from reset

Key state stored on

disk

Re-instantiation afterfailure: No Standby

“OSS, Billing”

“Web server”

Multi-dimensional aspects of ResiliencyState Protection versus Service Restoration

Types of State Protection Full state protection

Key state protection

No state protection

(Stateless)

State Management has

implications on Transparency

Performance

Resources

“Cold Standby”

Service Restoration Speed


Sta

te P

rote

cti

on

No

Sta

te P

rote

cti

on


Sta

te M

an

ag

em

en

t

Slow (mins)

Start from reset

Failover

Medium (secs)

Key state

Stored in RAM

or Disk

Key state stored on

disk

Pre-instantiated Before failure: Failover to running Standby

“OSS, Billing” “email, SMS”

“Web server”“vCE Router

Forwarder”

“Cold Standby” “Warm Standby”



No state protection

(Stateless)



Performance

Resources




Sta

te P

rote

cti

on

No

Sta

te P

rote

cti

on

Sta

te M

an

ag

em

en

t

Slow (mins)Fast (msecs)

Start from reset

Failover + key state

reload

Failover Full VM state in

RAM

Failover

Medium (secs)

Key state

Stored in RAM

or Disk

Key state stored on

disk

Se

rvic

e

Acce

ssib

ility

Se

rvic

e

Co

ntin

uity

“Warm Standby” “Hot Standby or

Active-Active”

“OSS, Billing” “email, SMS”“Voice control,

Router Control”

“Web server”“vPE Router

Forwarder”

“vCE Router

Forwarder”

“Cold Standby”




No state protection

(Stateless)



Performance

Resources


To do Fast Remediation

you need

Pre-instantiation

State management




Immense Pain Loss ofConsciousness

Loss ofBodily Control

TemporaryBrain Loss

Fault Tolerant Systems Provide Service Continuity, Even During Failures

Failure

Cold Restart versus Hot Standby or Active-Active ……it’s like surviving a heart attack versus preventing one

Cold Restart(Instant HA)

Hot StandbyOr Active-Active(Fault Tolerant)

msecs secs mins hours days

Fully ProtectedBackup Activated -

UnprotectedRestored to Fully Protected Redundancy

Customer Affecting Application Outage NormalApp Restart

All state is Lost

All state is Preserved

15

Re-instantiation after failure: No Standby


Stratus Technologies Confidential

State protectionGuaranteeing Globally Consistent State

Different ways to describe StatePointing

• Active-Standby synchronous VM replication

• Also known Checkpointing with I/O barrier, I/O lock-stepping or

buffering

What does it guarantee

• Application transparency

• IO barrier prevents all external communications from the

speculative execution prior to state replication

• Consistent VM memory replica between act-standby and hot-

standby, at the confirmed statepoint

16

We call it StatePointing (VM replication)Providing Service Continuity with fast Service Restoration

VM instances paired between primary and secondary hosts in the cloud infrastructure

State of primary (active) captured regularly and applied to secondary (HotStandby)

StatePoint™ = VM Checkpoint + I/O StateStepping

• Provides globally consistent state

Fast service restoration from the most recent StatePoint upon primary failover to secondary

Automatic redundancy restoration through third host instantiation

Hot Standby Host

SP N-1

If the primary host fails, it automatically switches to the secondary host

Active Host

Guest Run

Epoch N-1

Guest Run

Epoch NSP N-1

SP N

SP N

Guest Run

Epoch N+1

Guest Run

Epoch N+2

Guest Run

Epoch N+1

SP N+1

Third Host(created post primary failure)

17

Guest From

Image

SP N+X

SP N+1 SP N+X

17

Active host

Hot Standby host

Act.-Stby. & Egress Network Traffic

n-1 n+1

QEMU Monitor

n

QEMU Monitor

QEMU Monitor

QEMU Monitor

QEMU Monitor

QEMU Monitor

Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed

PC

R

PC

R

PC

R

Insert n

PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended

Note: For simplicity, n-2 interactions are not shown.

18

P1

P2

P3

P4

P5

P5

QEMU (Standby)

Network EgressQueue

[snapshots]

QEMU (Active)

Enqueue

Insert n-1 state I/O barrier

P1

P2

P3

P4

P5 P1

P2

P3

P4

P1

P2

P3

Guest VM(Active)

Insert n+1 barrier

n-1 I/O barrier Still onn-1 I/O barrier removed

n I/O barrier still onn I/O barrier removed

Multiple levels of resiliency Ensures flexibility and resource optimization based of applications

Deliver Availability as an

infrastructure service to virtual and

cloud ecosystems

Firewall MME IMS Web Server

While every VNF needs Fault

Management, not all need state

protection

VNF-CForwarding

Element

VNF-CForwarding

Element

VNF-CForwarding

Element

VNF-CControlElement

Monolithic

VNFs

De-composed VNFs (separate control and forwarding

elements)

Stateless Fast Path

Forwarding

Elements

Stateful

Control

Element

Fault

Tolerant(includes State

protection)

High

Availability(no State

protection)

Unprotected

Modes of

protection


Commodity

High Volume

Networking

Virtualization

Commodity Hyper Scale

COTS Computing

Commodity

High Volume

Storage

Linux

EP

C

Linux

PC

RF

Linux

HS

S

Linux

IMS…

Linux

Op

tica

l T

ran

sp

ort

Con

tro

l P

lan

e

Linux

L3

Rou

tin

g

Con

tro

l P

lan

e

Linux

Bill

ing

Linux

Cu

sto

me

r C

are

Linux

NO

C

Linux

L2

Sw

itch

ing

Con

tro

l P

lan

e

Virtualized

OSS/BSS

Virtualized

SDN

Orchestration

NFV

Stratus Node Resiliency Services (NRS)

Protection with Application transparency, no code changesResiliency Functionality in the NFVI nodes & managed in the MANO

20

Stratus

Resiliency Management

Services (RMS)

MANO

OpenStackenvironment

The Stratus Approach has implemented enhancements in KVM and plug-ins in OpenStack to make it seamless for the VNFs


SW Infrastructure Resiliency Management

• Fault protection for all applications, no required code changes for most apps

• State Protection, offering globally consistent state

• Multiple levels of Resiliency – Software Defined Availability (SDA) Control vs. Forwarding element, Stateful vs. stateless, etc

Benefits:

• Reduces Development & Verification time

• Lower Risks

• Faster time to market

Benefits of Resiliency Managementthat includes Fault Management, Availability Management and Configuration Management





22

The Stratus led PoC (ETSI PoC#35)

Participants of PoC#35

23

Availability Management with Stateful Fault Tolerance• Demonstrated at NFV World Congress May 6-8 in San Jose, CA

OpenStack Summit, May 2015, Vancouver, Canada

SDN World Congress Oct 2015, Dusseldorf, Germany

• Completed 7/31/2015, final reported submitted

http://nfvwiki.etsi.org/index.php?title=Availability_Management_with_Stateful_Fault_Tolerance


http://nfvwiki.etsi.org/index.php?title=Availability_Management_with_Stateful_Fault_Tolerance

24

OpenStack based VIM mechanisms alone are insufficient for supporting

carrier grade resiliency, but Stratus Cloud Technology solves that and

provided stateful failover enabling service continuity with acceptable QoS

• Service Restoration in millisecs

• Redundancy Restoration in seconds

Any non resilient VNF can be made instantaneously Resilient with no code

change (as long as it is OpenStack ready and there is no standard way to

package VNF)

Multiple levels of Resiliency can be easily provided using Software Defined

Resiliency in the Infrastructure, based on application requirement for State

and service restoration speed

What we proved with PoC#35


25

Thank You!