mtbf presentation

Upload: satyanneshi-er

Post on 24-Feb-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/25/2019 Mtbf Presentation

    1/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    1 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    AVAILABILITY MEASUREMENT

    SESSION NMS-2201

    222 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Agenda

    Introduction

    Availability Measurement Methodologies

    Trouble Ticketing

    Device Reachability: ICMP (Ping), SA Agent, COOL

    SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent

    Application

    Developing an Availability Culture

  • 7/25/2019 Mtbf Presentation

    2/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    333 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Associated Sessions

    NMS-1N01: Intro to Network Management

    NMS-1N02: Intro to SNMP and MIBs

    NMS-1N04: Intro to Service Assurance Agent

    NMS-1N41: Introduction to Performance Management

    NMS-2042: Performance Measurement with Cisco IOS

    ACC-2010: Deploying Mobility in HA Wireless LANs

    NMS-2202: How Cisco Achieved HA in Its LAN

    RST-2514: HA in Campus Network Deployments

    NMS-4043: Advanced Service Assurance Agent

    RST-4312: High Availability in Routing

    INTRODUCTION

    WHY MEASURE AVAILABILITY?

    4 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    3/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    555 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Why Measure Availability?

    1. Baseline the network

    2. Identify areas for network improvement

    3. Measure the impact of improvement projects

    666 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Why Should We Care About

    Network Availability?

    Where are we now? (baseline)

    Where are we going? (business objectives)

    How best do we get from where we are not to wherewe are going? (improvements)

    What if, we cant get there from here?

  • 7/25/2019 Mtbf Presentation

    4/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    777 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Why Should We Care AboutNetwork Availability?

    Percent of downtime that isunscheduled: 44%

    18% of customers experience over 100hours of unscheduled downtime or anavailability of 98.5%

    Average cost of network downtime peryear: $21.6 million or $2,169 per minute!

    SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of DowntimeCauses, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB

    Recent Studies by Sage Research Determined ThatUS-Based Service Providers Encountered:

    DowntimeCosts too Much!!!

    7 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    888 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Cause of Network Outages

    Changemanagement

    Processconsistency

    Hardware

    Links

    Design

    Environmentalissues

    Natural disasters

    Source: Gartner Group

    Software andApplication

    40%

    User Errorand Process

    40%

    Technology20%

    Software issues

    Performanceand load

    Scaling

  • 7/25/2019 Mtbf Presentation

    5/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    999 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Top Three Causes of Network Outages

    Congestive degradation

    Capacity(unanticipated peaks)

    Solutions validation

    Software quality

    Inadvertent configurationchange

    Change management

    Network design

    WAN failure (e.g., major fibercut or carrier failure)

    Power

    Critical services failure(e.g. DNS/DHCP)

    Protocol implementationsand misbehavior

    Hardware fault

    101010 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Method for Attaining a

    Highly-Available Network

    Establish a standardmeasurement method

    Define business goals asrelated to metrics

    Categorize failures, rootcauses, and improvements

    Take action for root causeresolution and improvementimplementation

    Or a Road to Five Nines

  • 7/25/2019 Mtbf Presentation

    6/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    111111 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Where Are We Going?Or What Are Your Business Goals?

    Financial

    ROI Economic Value Added Revenue/Employee

    Productivity

    Time to market

    Organizational mission

    Customer perspective

    Satisfaction Retention Market Share

    Define Your End-State?What Is Your Goal?

    121212 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Why Availability for Business

    Requirements?

    Availability as a basis for productivity data

    Measurement of total-factor productivity

    Benchmarking the organization

    Overall organizational performance metric

    Availability as a basis for organizationalcompetency

    Availability as a core competency

    Availability improvement as an innovation metric Resource allocation information

    Identify defects

    Identify root cause

    Measure MTTRtied to process

  • 7/25/2019 Mtbf Presentation

    7/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    131313 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    It Takes a Design Effort to Achieve HA

    Hardware and Software Design

    Network andPhysical Plant Design

    Process Design

    INTRODUCTION

    WHAT IS NETWORKAVAILABILITY?

    14 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    8/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    151515 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    What Is High Availability?

    30 Seconds99.9999%

    5 Minutes99.999%

    53 Minutes99.990%

    23 Minutes4 Hours99.950%

    46 Minutes8 Hours99.900%

    48 Minutes19 Hours1 Day99.500%

    36 Minutes15 Hours3 Days99.000%

    Downtime per Year (24x7x365)Availability

    High Availability Means an Average EndUser Will Experience Less than FiveMinutes Downtime per Year

    161616 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Availability Definition

    Availability definition isbased on businessobjectives

    Is it the user experience you areinteresting in measuring?

    Are some users more importantthan other?

    Availability groups?

    Definitions of different groups

    Exceptions to the availabilitydefinition

    i.e. the CEO should neverexperience a network problem

  • 7/25/2019 Mtbf Presentation

    9/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    171717 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    How You Define Availability

    Define availability perspective (customer, business, etc.)

    Define availability groups and levels of redundancy

    Define an outage

    Define impact to network

    Ensure SLAs are compatible with outage definition

    Understand how maintenance windows affect outage definition

    Identify how to handle DNS and DHCP within definition ofLayer 3 outage

    Examine component level sparing strategy

    Define what to measure

    Define measurement accuracy requirements

    181818 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Network Design

    What Is Reliability?

    Reliability is often used as a general term thatrefers to the quality of a product

    Failure rate

    MTBF (Mean Time Between Failures) or

    MTTF (Mean Time To Failure)

    Engineered availability

    Reliability is defined as the probability of survival(or no failure) for a stated length of time

  • 7/25/2019 Mtbf Presentation

    10/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    191919 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    MTBF Defined

    MTBF stands for Mean Time Between Failure

    MTTF stands for Mean Time to Failure

    This is the average length of time between failures (MTBF)or, to a failure (MTTF)

    More technically, it is the mean time to go from anOPERATIONAL STATE to a NON-OPERATIONAL STATE

    MTBF is usually used for repairable systems, and MTTF isused for non-repairable systems

    MTTR stands for Mean Time to Repair

    202020 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    One Method of Calculating Availability

    Availability = MTBF(MTBF + MTTR)

    What is the availability of a computer with MTBF =10,000 hrs. and MTTR = 12 hrs?

    A = 10000 (10000 + 12) = 99.88%

    Annual uptime

    8,760 hrs/year X (0.9988)= 8,749.5 hrs

    Conversely, annual DOWN time is,

    8,760 hrs/year X (1- 0.9988)= 10.5 hrs

  • 7/25/2019 Mtbf Presentation

    11/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    212121 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Networks Consist of Series-Parallel

    Combinations of in-series and redundantcomponents

    D1D1

    D2D2

    D3D3

    EE FFCCB1B1

    B2B2

    AA

    RBD

    1/2 2/3

    222222 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    More Complex Redundancy

    Pure active parallel

    All components are on

    Standby redundant

    Backup components are not operating

    Perfect switching

    Switch-over is immediate and without fail

    Switch-over reliabilityThe probability of switchover when it is not perfect

    Load sharing

    All units are on and workload is distributed

  • 7/25/2019 Mtbf Presentation

    12/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    MEASURING THEPRODUCTION NETWORK

    23 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    242424 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Reliability or Engineered Availability vs.

    Measured Availability

    1. Reliability is an engineered probability of thenetwork being available

    2. Measured Availability is the actual outcomeproduced by physically measuring over time theengineered system

    Calculations Are SimilarBoth AreBased on MTBF and MTTR

  • 7/25/2019 Mtbf Presentation

    13/81

  • 7/25/2019 Mtbf Presentation

    14/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    272727 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Some Types of Availability Metrics

    Mean Time to Repair (MTTR)

    Impacted User Minutes (IUM)

    Defects per Million (DPM)

    MTBF (Mean Time Between Failure)

    Performance (e.g. latency, drops)

    282828 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Back to How Availability Is Calculated?

    Availability (%) is calculated by tabulating end user outagetime, typically on a monthly basis

    Some customers prefer to use DPM (Defects per Million) torepresent network availability

    Availability (%) = (Total User Time Total User Outage Time) X 102

    Total User Time

    DPM = Total User Outage Time X 106

    Total User TimeTotal User Time = Total # of End Users X Time in Reporting Period

    Total User Outage Time = (# of End Users X Outage Time in Reporting Period)

    Is over All the Incidents in the Reporting PeriodPorts or Connections May Be Substituted for End Users

  • 7/25/2019 Mtbf Presentation

    15/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    292929 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Defects per Million

    Started with mass produced items like toasters

    For PVCs,

    DPM = (#conns*outage minutes)

    (#conns*total minutes)

    For SVCs or phone calls,

    DPM = (#existing calls lost + #new calls blocked)

    total calls attempted

    For connectionless traffic (application dependent),DPM = (#end users*outage minutes)

    (#end users*total minutes)

    NETWORK AVAILABILITYCOLLECTION METHODS

    TROUBLE TICKETING METHODS

    30 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    16/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    313131 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Availability Improvement Process

    Step I

    Validate data collection/calculation methodology

    Establish network availability baseline

    Set high availability goals

    Step II

    Measure uptime ongoing

    Track defects per million (DPM) or IUM oravailability (%)

    Step III

    Track customer impact for each ticket/MTTR

    Categorize DPM by reason code andbegin trending

    Identify initiatives/areas for a focus toeliminate defects

    323232 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Collection/Analysis Process

    Understand current data collection methodology

    Customer internal ticket database

    Manual

    Monthly collection of network performance data and exportthe following fields to a spreadsheet or database system:

    Outage start time (date/time)

    Service restore time (date/time)

    Problem description

    Root cause

    ResolutionNumber of customers impacted

    Equipment model

    Component/part

    Planned maintenance activity/unplanned activity

    Total customers/ports on network

  • 7/25/2019 Mtbf Presentation

    17/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    333333 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Network Availability Results

    Methodology and assumptions must bedocumented

    Network availability should include:Overall % network availability (baseline/trending)

    Conversion of downtime to DPM by:

    Planned and unplanned

    Root cause

    Resolution

    Equipment type

    Overall MTTR

    MTTR by:

    Root cause

    Resolution

    Equipment type

    Results are not necessarily limited to theabove but should be customized based onyour network and requirements

    343434 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Availability = 1 - 8 x 24 .100 x 24 x 365

    DPM = 8 x 24 x 106

    100 x 24 x 365

    MTBF = 24 x 365 .8

    MTTR = 1095 x (1-0.978082) .0.978082

    = 219.2 failures for every1 million user hours

    = 0.978082

    = 1095 (hours)

    = 0.24 (hours)

    Availability Metrics: Reviewed

    Network has 100 customers

    Time in reporting period is one year or 24 hours x 365 days

    8 customers have 24 hours down time per year

  • 7/25/2019 Mtbf Presentation

    18/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    TROUBLE TICKETING METHOD

    SAMPLE OUTPUT

    35 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    363636 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Network Availability

    99.5099.55

    99.60

    99.65

    99.70

    99.75

    99.80

    99.85

    99.90

    99.95

    100.00

    July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun

    Overall Network Availability

    (Planned/Unplanned)

    Key takeaways

    Illus

    trativ

    e

  • 7/25/2019 Mtbf Presentation

    19/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    373737 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Platform Related DPM Comparison

    Platform related DPM contributed

    13% of total DPM in September

    Platform DPM includes events from:Backbone

    NAS

    PG

    POP

    Radius Server

    VPN Radius Server

    All other events are included in theOther category

    Breakdown of Platform Related DPM

    Network Access Server (NAS)

    accounts for 50% of the totalPlatform related DPM in September

    Private Access Gateway (PG)showing significant decrease overthe past 3 months

    52.610482.549.2Total Platform Related

    3.42.88.80VPN Radius

    .31.200Radius Server

    1.6.53.90POP

    18.956.859.626PG26.12719.421.7NAS

    2.315.7.81.5Backbone

    SeptAugJulyJune

    0

    100

    200

    300

    400

    500

    600

    June July Aug Sept Oct Dec

    100

    Nov

    100

    Oct

    100

    414.8

    52.6

    362.2

    Sept

    100

    Dec

    100100100------99.99% Target

    498.7507.4388.7Total DPM

    10482.549.2Platform Related

    394.7424.9339.5Other

    AugJulyJune

    DPM

    Illus

    trativ

    e

    383838 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    0

    500

    1000

    1500

    2000

    2500

    Dec Jan Feb Mar Apr May

    DPM

    1964.81641.91293.112261202.23789.3TOTAL

    20.2

    474.3

    37

    89.7

    0

    87.7

    Mar

    106.6

    422.5

    314.2

    19

    133.4

    106

    80

    Apr

    201117.5101.6406Config/SW

    240553.6512.7884.3HW

    604.4212.4136.2145.7Other

    14.811.131.4566.1Power

    12718.468.836.1Environmental

    115.28.9823.618.2Human Error

    95.2Unknown

    MayFebJanDec

    Illus

    trativ

    e

    DPM by Cause

  • 7/25/2019 Mtbf Presentation

    20/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    393939 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    MTTR Analysis: Hardware Faults

    Number of faults increased slightlyin September however MTTRdecreased 49% of faults resolved in< 1 Hour in September

    11% of faults resolved in > 24 hourswith an additional 3% >100 Hhours

    Produce for Each Fault TypeRouter HW

    12.42

    15.1

    8.49

    7.19

    0

    2

    4

    6

    8

    10

    12

    14

    16

    Jun Jul Aug Sep Oct Nov Dec

    Hours

    0

    20

    40

    60

    80

    100

    120

    140

    Jun Jul Aug Sep Oct Nov Dec

    #o

    fFaults

    >100

    >24 Hr

    12-24 Hr

    4-12 Hr

    1-4 Hr

    100

    >24 Hr

    12-24 Hr

    4-12 Hr

    1-4 Hr

  • 7/25/2019 Mtbf Presentation

    21/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    414141 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Trouble Ticketing Method

    Pros

    Easy to get started

    No network overhead

    Outages can be categorized based on event

    Cons

    Some internal subjective/consistency process issues

    Outages may occur that are not included in the troubleticketing systems

    Resources needed to scrub data and create reports

    May not work with existing trouble ticketingsystem/process

    Network Availability Collection Methods

    AUTOMATED FAULTMANAGEMENT EVENTS METHOD

    42 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    22/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    434343 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Availability Improvement Process

    Step I

    Determine availability goals

    Validate fault management data collection

    Determine a calculation methodology

    Build software package to use customer event log

    Step II

    Establish network availability baseline

    Measure uptime on an ongoing basis

    Step III

    Track root cause and customer impactBegin trending of availability issues

    Identify initiatives and areas of focusto eliminate defects

    444444 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Event Log Example

    Fri Jun 15 11:05:31 2001 Debug: Looking for message header ...

    Fri Jun 15 11:05:33 2001 Debug: Message header is okay

    Fri Jun 15 11:05:33 2001 Debug: $(LDT) -> "06152001110532"

    Fri Jun 15 11:05:33 2001 Debug: $(MesgID) -> "100013"

    Fri Jun 15 11:05:33 2001 Debug: $(NodeName) -> "ixc00asm"

    Fri Jun 15 11:05:33 2001 Debug: $(IPAddr) -> "10.25.0.235"

    Fri Jun 15 11:05:33 2001 Debug: $(ROCom) -> "xlr8ed!"

    Fri Jun 15 11:05:33 2001 Debug: $(RWCom) -> "s39o!d%"

    Fri Jun 15 11:05:33 2001 Debug: $(NPG) -> "CISCO-Large-special"

    Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN) -> "aSnmpStatus"

    Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) -> "system"

    Fri Jun 15 11:05:33 2001 Debug: $(OSN) -> "Testing"

    Fri Jun 15 11:05:33 2001 Debug: $(OSS) -> "Normal"

    Fri Jun 15 11:05:33 2001 Debug: $(DSN) -> "SNMP_Down"Fri Jun 15 11:05:33 2001 Debug: $(DSS) -> "Agent_Down"

    Fri Jun 15 11:05:33 2001 Debug: $(TrigName) -> "NodeStateUp"

    Fri Jun 15 11:05:33 2001 Debug: $(BON) -> "nl-ping"

    Fri Jun 15 11:05:33 2001 Debug: $(TrapGN) -> "-2"

    Fri Jun 15 11:05:33 2001 Debug: $(TrapSN) -> "-2

    Event Log

    Analysis of eventsreceived from thenetwork devices

    Analysis of accuracyof the data

  • 7/25/2019 Mtbf Presentation

    23/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    454545 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Calculation Methodology: Example

    Primary events are device down/up

    Down time is calculated based on device-typeoutage duration

    Availability is calculated based on the totalnumber of device types, the total time, and thetotal down time

    MTTR numbers are calculated from averageduration of downtime

    With MTTR the shortest and longest outageprovides a simplified curve

    464646 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Automated Fault Management Methodology

    Pros

    Outage duration and scope can be fairly accurate

    Can be implemented within a NMS fault management system

    No additional network overhead

    Cons

    Requires an excellent change management/provisioningprocess

    Requires an efficient and effective fault management system

    Requires a custom development

    Does not account for routing problems

    Not true end-to-end measure

  • 7/25/2019 Mtbf Presentation

    24/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    NETWORK AVAILABILITYDATA COLLECTION

    SAMPLE OUTPUT

    47 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    484848 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Automated Fault Management:

    Example Reports

    18.726:38:110:23:100:00:2099.9170%.0830%844:59:1626478018GRAND

    TOTAL

    16.842:16:100:26:070:00:1799.9491%.0509%212:29:46173897OtherTotals

    14.909:49:350:22:360:00:2499.8691%.1309%430:02:0316734732NetworkTotals

    24.427:48:460:20:470:00:1999.9327%.0673%202:27:278012389HostTotals

    Eventsper

    Device

    LongestOutage

    Duration

    MeanTime toRepair

    ShortestOutage

    Duration

    %Up

    %

    Down

    Total DownTime

    hhh:mm:ss

    Count ofIncidents

    # ofDevices

    DeviceType

  • 7/25/2019 Mtbf Presentation

    25/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    494949 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Count of Incidents

    Automated Fault Management:Example Reports (2)

    Other Totals11% Host Totals

    30%

    NetworkTotals59%

    Host Totals

    Network Totals

    Other Totals

    Other Totals7% Host Totals

    30%

    NetworkTotals63%

    Host Totals

    Network Totals

    Other Totals

    Total Down Time

    Other Totals25%

    Host Totals24%

    NetworkTotals51%

    Host Totals

    Network Totals

    Other Totals

    Number of Managed Devices

    Network Availability Collection Methods

    ICMP ECHO (PING) AND SNMP ASDATA GATHERING TECHNIQUES

    50 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    26/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    515151 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Gathering Techniques

    ICMP ping

    Link and device polling (SNMP)

    Embedded RMON

    Embedded event management

    Syslog messages

    COOL

    525252 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Gathering Techniques

    Method definition:

    Central workstation or computer configured to send pingpackets to the network edges(device or ports) to determinereachability

    How:

    Edge interfaces and/or devices are defined and pinged

    on a determined interval

    Unavailability:

    Pre-defined, non-response from the interface

    ICMP Reachability

  • 7/25/2019 Mtbf Presentation

    27/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    535353 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Availability Measurement Through ICMP

    Periodic ICMP Test

    Periodic Pings to Network Devices Period Ping to Network Leaf Nodes

    545454 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Gathering Techniques

    Pros

    Fairly accurate network availability

    Accounts for routing problems

    Can be implemented for fairly low network overhead

    Cons

    Point to multipoint implies not true end-to-end measure

    Availability granularity limited by ping frequency

    Maintenance of device databasemust have a solidchange management and provisioning process

    ICMP Reachability

  • 7/25/2019 Mtbf Presentation

    28/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    555555 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Gathering Techniques

    Method definition:

    SNMP polling and trapping on links, edge ports,or edge devices

    How:

    An agent is configured to SNMP poll and tabulate outagetimes for defined devices or links; database maintainsoutage times and total service time; sometimes trapinformation is used to augment this method by providingmore accurate information on outages

    Unavailability:

    Pre-defined, non-redundant links, ports, or devices thatare down

    Link and Device Status

    565656 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Polling Interval vs. Sample Size

    Polling interval is the rate at which data is collectedfrom the network

    Polling interval = 1Sampling Rate

    The smaller the polling interval the more detailed(granular) the data collected

    Example polling data once every 15 minutes provides 4 times thedetail (granularity) of polling once an hour

    A smaller polling interval does not necessarily provide abetter margin of error

    Example polling once every 15 minutes for one hour, has thesame margin of error as polling once an hour for 4 hours

  • 7/25/2019 Mtbf Presentation

    29/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    575757 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Link and Device Status Method

    Method definition

    SNMP polling and trapping on links, edge ports,or edge devices

    How:

    Utilizing existing NMS systems that are currently SNMPpolling to tabulate outage times for defined devices or links

    A database maintains outage times and total service time

    SNMP Trap information is also used to augment thismethod by providing more accurate information on

    outages

    585858 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Link and Device Status Method

    Pros

    Outage duration and scope can be fairly accurate

    Utilize existing NMS systems

    Low network overhead

    Cons

    No canned SW to do this; custom development

    Maintaining element device database challenging

    Requires an excellent change mgmt and provisioningprocess

    Does not account for routing problems

    Not a true end-to-end measure

  • 7/25/2019 Mtbf Presentation

    30/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    CISCO SERVICE ASSURANCEAGENT (SA AGENT)

    59 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    606060 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Service Assurance Agent

    Method Definition:

    SA Agent is an embedded feature of Cisco IOS softwareand requires configuration of the feature on routers withinthe customer network; use of the SA agent can provide fora rapid, cost-effective deployment without additionalhardware probes

    How:

    A data collector creates SA Agents on the routers tomonitor certain network/service performances; the data

    collector then collects this data from the routers,aggregates it and makes it available

    Unavailability:

    Pre-defined paths with reporting on non-redundant links,ports, or devices that are down within a path

  • 7/25/2019 Mtbf Presentation

    31/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    616161 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Case Study:Financial Institution (Collection)

    SA Agent Collectors

    Remote Sites

    DNS

    InternetWeb Sites

    626262 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Availability = 1 - Probes with No ResponseTotal Probes Sent

    DPM = Probes with No Response x 106

    Total Probes Sent

    Availability Using Network-Based Probes

    DPM equations used with network-based probes as input data

    Probes can be

    Simple ICMP Ping probe, modified Ping to test specific applications,Cisco IOS SA Agent

    DPM will be for connectivity between 2 points on the network,the source and destination of probe

    Source of probe is usually a management system and the destination arethe devices managed

    Can calculate DPM for every device managed

  • 7/25/2019 Mtbf Presentation

    32/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    636363 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    DPM = 1 x 106

    10000= 100 probes out of 1 million will fail

    Availability = 1 - 1 .

    10000= 0.9999

    Availability Using Network-Based Probes:Example

    Network probe is a ping

    10000 probes are sent between managementsystem and managed device

    1 probe failed to respond

    646464 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Sample Size

    Sample size is the number of samples that havebeen collected

    The more samples collected the higher the confidence thatthe data accurately represents the network

    Confidence (margin of error) is defined by

    Example data is collected from the network every 1 hour

    After One Day After One Month

    0367.03124

    1m =

    x=2041.0

    24

    1m ==

    sizesample

    1m =

  • 7/25/2019 Mtbf Presentation

    33/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    656565 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Service Assurance Agent

    Pros

    Accurate network availability for defined paths

    Accounts for routing problems

    Implementation with very low network overhead

    Cons

    Requires a system to collect the SAA data

    Requires implementation in the router configurations

    Availability granularity limited by polling frequency

    Definition of the critical network paths to be measured

    COMPONENT OUTAGE ONLINEMEASUREMENT (COOL)

    66 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    34/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    676767 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    COOL Objectives

    To automate the measurement to increaseoperational efficiency and reduce operational cost

    To measure the outage as close to the source ofoutage events as possible to pin point the cause ofthe outages

    To cope with large number of network elementswithout causing system and network performancedegradation

    To maintain measurement data reliably in presents

    of element failure or network partition To support simplicity in deployment, configuration,

    and data collection (autonomous measurement)

    686868 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    COOL Features

    NetToolsNetTools3rd Party Tools3rd Party Tools

    Customer Equipment

    Access Router

    NMS

    C-NOTEC-NOTE

    PNLPNL

    COOL Embedded in Router

    Automated Real-Time MeasurementAutonomous Measurement

    Outage Data Stored in Router

    Outage Monitor MIB Open access via Outage Monitor MIB

    Event Notification Filtering

  • 7/25/2019 Mtbf Presentation

    35/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    696969 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    COOL Features (Cont.)

    Support NMS or tools forsuch applications as

    Calculation of software orhardware MTBF, MTTR,

    availability per object, device,or network

    Verification of customers SLA

    Trouble shooting in real-time

    Two-tier framework

    Reduces performance impact onthe router

    Provides scalability to the NMS

    Makes easy to deploy

    Provides flexibility to availabilitycalculation

    NMS

    Customer Equipment

    NMS

    COOL

    Outage Monitor MIB

    Access Routers

    Access RouterCore Router

    Outag

    eMonitoringand

    M

    easurement

    OutageCorrelationand

    Calculation

    NMS

    COOL

    Outage Monitor MIB

    707070 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    A

    DD

    RP

    Power Fan,Etc.

    PhysicalInterface

    LogicalInterface

    Access Router

    Outage ModelC

    B

    Failure of Remote Device (Customer Equipment or PeerNetworking Device) or Link In-between

    Remote ObjectsC

    Failure of Software Processes Running on the RPs and LineCards

    Software ObjectsD

    Interface Hardware or Software Failure, Loss of SignalInterface ObjectsB

    Component Hardware or Software Failure Including the Failureof Line Card, Power Supplies, Fan, Switch Fabric, and So on

    Physical EntityObjects

    A

    Failure ModesObjects MonitoredType

    NetworkManagement

    System

    CustomerEquipmentMUX/

    Hub/Switch

    PeerRouter

    Link

    Link

    A

    DD

  • 7/25/2019 Mtbf Presentation

    36/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    717171 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Outage Characterization

    Data Definition

    Defect threshold: a value across which the object is considered to bedefective (service degradation or complete outage)

    Duration threshold: the minimum period beyond which an outage needsto be reported (given SLA)

    Start time: when the object outage starts

    End time: when the outage ends

    Down Event

    Up Event

    Outage Duration

    DurationThreshold

    DefectThreshold

    Start Time End Time

    Time

    727272 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Architecture

    OutageOutage

    ManagerManager

    Internal ComponentInternal Component

    Outage DetectorOutage Detector

    Fault Manager(IOS)

    EventSource

    Callbacks Syslog

    Remote ComponentOutage Detector

    Remote ComponentOutage Detector

    Customer EquipmentDetection Function

    PingSAAAPIs

    Data Table StructureData Table Structure HA and Persistent Data StoreHA and Persistent Data Store

    Time StampTemp Event DataCrash Reason

    Outage Data

    NVRAM

    ATA Flash

    Outage Monitor MIBOutage Monitor MIB

    SNMP Polling SNMP Notification

    ConfigurationConfiguration

    CustomerAuthentication

    CLI

    Baseline Optional

    CPUUsageDetect

    Outage Component Table

    Event History Table

    Event Map Table

    Process Map Table

    Remote Component Map Table

    Measurement Metrics

    Customer Interfaces

    Measurement Methods

  • 7/25/2019 Mtbf Presentation

    37/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    737373 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Outage Data: AOT and NAF

    Requirements of measurement metrics:

    Enable calculation of MTTR, MTBF, availability, and SLA assessment

    Ensure measurement efficiency in terms of resource (CPU, memory, andnetwork bandwidth)

    Measurement metrics per object:

    AOT: Accumulated Outage Time since measurement started

    NAF: Number of Accumulated Failures since measurement started

    AOT = 20 and NAF = 2

    Router 1

    Time10 10

    System CrashSystem Crash

    Down

    Up

    747474 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Outage Data: AOT and NAF

    Object containment model

    Containment independent propertyRouter Device

    AOT = 20;

    NAF = 2;

    Service AffectingAOT = 27;

    NAF = 3;

    InterfaceAOT = 7;NAF = 1;

    Interface 1

    Interface Failure2020

    77

    20

    Router 1 Interface 1

    Router Device

    Line Card

    Physical Interface

    Logical Interface

    Router 1

    Time10 10

    System Crash System Crash

    Down

    Up

    Time10 10

    Up7

  • 7/25/2019 Mtbf Presentation

    38/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    757575 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Example: MTTR

    Find MTTR for Object i

    MTTRi = AOTi/NAFi

    = 14/2

    = 7 min

    Object i

    Time10 min. 4 min.

    Measurement Interval (T2T1)

    Failure FailureT1 T2

    TTR TTR

    Down

    Up

    767676 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Example: MTBF and MTTF

    Find MTBF and MTTF for Object i

    MTBF = 700,000 = 1,400,000/2

    MTTR = 699,993 = (700,000 7)

    MTBFi = (T2 T1)/NAFi

    MTTFi = MTBFi MTTRi = (T2 T1 AOTi)/NAFi

    Object i

    Time10 min. 4 min.

    Measurement Interval (T2T1)

    Failure FailureT1 T2

    TTR TTF

    Down

    Up

    TBF

    (T2T1) = 1,400,000 min

  • 7/25/2019 Mtbf Presentation

    39/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    777777 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Time10 min. 4 min.

    Failure FailureT1 T2

    Down

    Up

    Example: Availability and DPM

    Find availability and DPM for Object i

    Availability = 99.999% = (700,000/700,007) * 100

    DPMi = [AOTi/(T2 T1)] x 106 = 10 DPM

    Object i

    Measurement Interval = 1,400,000 min.

    Availability (%) =MTBF

    MTBF + MTTR* 100

    787878 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Planned Outage Measurement

    To capture operation CLI commands both reload andforced switchover

    There is a simple rule to derive an upper bound of theplanned outage

    If there is no NVRAM soft crash file, check the reboot reason orswitchover reason

    If its reload or forced switchover, it can be considered as an upperbound of the planned outage

    Send BreakSend Break

    Reload

    Forced Switchover

    Planned Outage

    OperationCausedOutage

    Upper Boundof the PlannedOutage

  • 7/25/2019 Mtbf Presentation

    40/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    797979 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Event Filtering

    Flapping interface detection and filtering:

    Some faulty interface state can be keep changing up and down

    May cause virtual network disconnection

    May occurs event storm when hundreds of messages for eachflapping event

    May make the object MTBF unreasonably low due to frequentshort failures

    This unstable condition needs to get operators attention

    COOL detects the flapping status

    Catching very short outage event (less than the duration threshold)

    Increasing the event counter,

    Flapping status, if it becomes over the flapping threshold (3 eventcounter) for the short period (1 sec); sends a notification

    Stable status, if it becomes less than the threshold; sends anothernotification

    808080 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Persistency and Redundancy

    Data persistency

    To avoid data loss due to link outage or router itself crash

    Data redundancy

    To continue the outage measurement after the switchover

    To retain the outage data even if the RP is physically replaced

    Copy

    NVRAM

    RAMOutage Data

    FLASHPersistent

    Outage Data

    NVRAM

    RAMOutage Data

    FLASHPersistent

    Outage Data

    Copy

    Active RP Standby RP

    COOLCOOL

    Router

    PersistentOutage Data

    PersistentOutage Data

    PeriodicUpdate

    EventDrivenUpdate

  • 7/25/2019 Mtbf Presentation

    41/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    818181 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Outage Monitor MIB

    (Physical Entity Object Description)

    (Interface Object Description)

    ifTable

    entPhysicalTable

    (Process Object Description)

    cpmProcessTable

    CISCO-OUTAGE-MONITOR-MIB

    cOutageHistoryTable

    cOutageObjectTable

    Remote Object Map Table

    (Remote Object Description)

    Object-Type;Object-Index;

    Event-Reason-Index;Event-Time;Event-Interval;

    Object-Type;Object-Index;

    Object-Status;Object-AOT;Object-NAF;

    IF-MIB

    ENTITY-MIB

    CISCO-PROCESS-MIB

    Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB

    1.3.6.1.4.1.9.9.280

    Event Reason Map Table

    (Event Description)

    Process MIB Map

    828282 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Configuration

    MIB Display

    Customer EquipmentDetection Function

    Cisco IOSConfigurationCOOL

    Update

    Update

    Show CLI

    run;

    add;

    removal

    filtering-enable;

    Config CLI

    Show event-table

    Show object-table

    Object Table

    Event Table

  • 7/25/2019 Mtbf Presentation

    42/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    838383 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Enabling COOL

    ari#dirDirectory of disk0:/

    1 -rw- 19014056 Oct 29 2003 16:09:28 +00:00 gsr-k4p-mz.120-26.S.bin

    128057344 bytes total (109051904 bytes free)ari#copy tftp disk0:Address or name of remote host []? 88.1.88.9Source filename []? auth_fileDestination filename [auth_file]?Accessing tftp://88.1.88.9/auth_file...Loading auth_file from 88.1.88.9 (via FastEthernet1/2): ![OK - 705 bytes]

    705 bytes copied in 0.532 secs (1325 bytes/sec)ari#clear cool perari#clear cool persist-filesari#conf tEnter configuration commands, one per line. End with CNTL/Z.

    ari(config)#cool run

    ari(config)#^Zari#wr memBuilding configuration...[OK][OK][OK]

    ObtainAuthorization

    File

    Enable COOL

    848484 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    COOL

    Pros

    Accurate network availability for devices, components,and software

    Accounts for routing problems

    Implementation with low network overhead.

    Enables correlation between active and passive availabilitymethodologies

    Cons

    Only a few system currently have the COOL featureRequires implementation in the router configurations ofproduction devices

    Availability granularity limited by polling frequency

    New Cisco IOS Feature

  • 7/25/2019 Mtbf Presentation

    43/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    Network Availability Collection Methods

    APPLICATION LAYERMEASUREMENT

    85 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    868686 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Application Reachability

    Similar to ICMP Reachability

    Method definition:

    Central workstation or computer configured to send packets thatmimic application packets

    How:

    Agents on client and server computers and collecting data

    Fire Runner, Ganymede Chariot, Gyra Research, ResponseNetworks, Vital Signs Software, NetScout, Custom applicationsqueries on customer systems

    Installing special probes located on user and serversubnets to send, receive and collect data; NikSun andNetScout

    Unavailability:

    Pre-defined QoS definition

  • 7/25/2019 Mtbf Presentation

    44/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    878787 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Application Reachability

    Pros

    Actual application availability can be understood

    QoS, by application, can be factored into the availabilitymeasurement

    Cons

    Depending on scale, potential high overhead and cost canbe expected

    DATA COLLECTION FOR ROOTCAUSE ANALYSIS (RCA) OFNETWORK OR DEVICEDOWNTIME

    88 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    45/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    898989 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Gathering Techniques

    Alarm and event

    History and statistics

    Set thresholds in router configuration

    Configure SNMP trap to be sent when MIB variablerises above and/or falls below a given threshold

    Alleviates need for frequent polling

    Not an availability methodology by itself but canadd valuable information and customization to thedata collection method

    Cisco IOS Embedded RMON

    909090 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Gathering Techniques

    Provide information on what the router is doing

    Categorized by feature and severity level

    User can configure Syslog logging levels

    User can configure Syslog messages to be sent asSNMP traps

    Not an availability methodology by itself but canadd valuable information and customization to thedata collection method

    Syslog Messages

  • 7/25/2019 Mtbf Presentation

    46/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    919191 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Expression and Event MIB

    Expression MIB

    Allows you to create new SNMP objects based upon formulas

    MIB persistence is supported a MIBs SNMP data persists acrossreloads

    Delta and wildcard support allows you to:

    Calculate utilization for all interfaces with one expression

    Calculate errors as a percentage of traffic

    Event MIB

    Allows you to create custom notifications and log them and/or sendthem as SNMP traps or informs

    MIB persistence is supported a MIBs SNMP data persists across

    reloads Can be used to test objects on other devices

    More flexible than RMON events/alarms

    RMON is tailored for use with counter objects

    929292 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Data Gathering Techniques

    Underlying philosophy:

    Embed intelligence in routers and switches to enable ascalable and distributed solution, with OPEN interfaces forNMS/EMS leverage of the features

    Mission statement:

    Provide robust, scalable, powerful, and easy-to-use

    embedded managers to solve problems such as syslog andevent management within Cisco routers and switches

    Embedded Event Manager

  • 7/25/2019 Mtbf Presentation

    47/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    939393 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Embedded Event Manager (Cont.)

    Development goal: predictable, consistent, scalablemanagement

    Distributed

    Independent of central management system

    Control is in the customers hands

    Customization

    Local programmable actions:

    Triggered by specific events

    949494 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    EEMPolicies

    EEMPolicies

    Cisco IOS Embedded Event Manager:

    Basic Architecture (v1)

    Event Detector Feeds EEMEvent Detector Feeds EEM

    Embedded Event Manager EEMPolicies

    Notify

    SyslogEvent Detector

    OtherEvent Detector

    Switch-over

    Reload

    Actions

    NetworkKnowledge

    SNMPEvent Detector

    Syslog EventSyslog Event SNMP DataSNMP Data Other EventOther Event

  • 7/25/2019 Mtbf Presentation

    48/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    959595 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    EEM Versions

    EEM Version 1

    Allows policies to be defined using the Cisco IOS CLI applet

    The following policy actions can be established:

    Generate prioritized syslog messages

    Generate a CNS event for upstream processing byCisco CNS devices

    Reload the Cisco IOS software

    Switch to a secondary processor in a fully redundant hardwareconfiguration

    EEM Version 2

    EEM Version 2 adds programmable actions using the Tclsubsystem within Cisco IOS

    Includes more event detectors and capabilities

    969696 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    PosixPosixProcessProcessManagerManager

    IOS ProcessIOS ProcessWatchdogWatchdog

    SyslogSyslogDaemonDaemon

    SystemSystemManagerManager

    WatchdogWatchdogSysmonSysmon

    HAHARedundancyRedundancy

    FacilityFacility

    SyslogSyslog

    SystemSystemManagerManager

    TimerTimerServicesServices

    CountersCounters

    InterfaceInterfaceCounters andCounters and

    StatsStats

    RedundancyRedundancyFacilityFacility

    SNMPSNMP

    IOS SubsystemsSubscribers to

    Receive ApplicationEvents, PublishesApplication EventsUsing Application

    Specific EventDetector

    Tcl Shell

    EEM PolicySubscribers to

    Receive Events,Implements Policy

    Actions

    Embedded EventEmbedded EventManager ServerManager Server

    ApplicationSpecific

    Event Detector

    Event Detectors

    EventSubscriber

    Event Publishers

    EEM Version 2 Architecture

    More eventdetectors!

    Define policies orprogrammablelocal actionsusing Tcl

    Register policywith EEM Server

    Events triggerpolicy execution

    Tcl extensions forCLI control anddefined actions

    Cisco Internal Use Only 96Cisco Internal Use Only 9696

  • 7/25/2019 Mtbf Presentation

    49/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    979797 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    What Does This Mean to the Business?

    Better problem determinationWidely applicable scripts from Cisco engineering and TAC

    Automated local action triggered by events

    Automated data collection

    Faster problem resolutionReduces the next time it happensplease collect

    Better diagnostic data to Cisco engineering

    Faster identification and repair

    Less downtimeReduce susceptibility and Mean Time to Repair (MTTR)

    Better serviceResponsiveness

    Prevent recurrenceHigher availability

    Not an availability methodology by itself but can add valuableinformation and customization to the data collection method

    INSTILLING ANAVAILABILITY CULTURE

    98 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    50/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    999999 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Putting an Availability Programinto Practice

    Track network availability

    Identify defects

    Identify root cause andimplement fix

    Reduce operating expenseby eliminating non valueadded work

    How much does an outage

    cost today?How much can i save thruprocess and productenhancements?

    100100100 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    How Do I Start?

    1. What are you using now?

    a. Add or modify trouble ticketing analysis

    b. Add or improve active monitoring method

    2. Processanalyze the data!

    a. What caused an outage?

    b. Can a root cause be identified andaddressed?

    3. Implement improvements or fixes

    4. Measure the results

    5. Back to step 1are other metricsneeded?

  • 7/25/2019 Mtbf Presentation

    51/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    101101101 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    If You Have a Network Availability Method

    Use the current method and metric for improvement

    Dont try to change completely

    Use incremental improvements

    Develop additional methods to gather data as identified

    Concentrate on understanding unavailabilitycausesAll unavailability causes should beclassified at a minimum under:

    Change, SW, HW, power/facility, or link

    Identify the actions to correct unavailability causes

    i.e., network design, customer process change, HW MTBFimprovement, etc.

    102102102 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Multilayer Network Design

    Distribution

    Access

    Core/Backbone

    WAN Internet PSTN

    Server Farm

    Building BlockAdditions

    Core

    SA Agent

    Between Accessand Distribution

  • 7/25/2019 Mtbf Presentation

    52/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    103103103 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Distribution

    Access

    Core/Backbone

    WAN Internet PSTN

    Server Farm

    Building BlockAdditions

    Core

    Multilayer Network DesignSA Agentbetween

    Servers andWAN Users

    104104104 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Distribution

    Access

    Core/Backbone

    WAN Internet PSTN

    Server Farm

    Building BlockAdditions

    Core

    Multilayer Network Design

    COOL for High-

    End CoreDevices

  • 7/25/2019 Mtbf Presentation

    53/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    105105105 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Distribution

    Access

    Core/Backbone

    WAN Internet PSTN

    Server Farm

    Building BlockAdditions

    Core

    Multilayer Network DesignTrouble

    TicketingMethodology

    AVAILABILITY MEASUREMENTSUMMARY

    106 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    54/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    107107107 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Summary

    Availability metric is governed by your businessobjectives

    Availability measurements primary goal is:

    To provide an availability baseline (maintain)

    To help identify where to improve the network

    To monitor and control improvement projects

    Can you identify Where you are now? for yournetwork?

    Do you know Where you are going? as networkoriented business objectives?

    Do you have a plan to take you there?

    108108108 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Complete Your Online Session Evaluation!

    WHAT: Complete an online session evaluationand your name will be entered into adaily drawing

    WHY: Win fabulous prizes! Give us your feedback!

    WHERE: Go to the Internet stations locatedthroughout the Convention Center

    HOW: Winners will be posted on the onsiteNetworkers Website; four winners per day

  • 7/25/2019 Mtbf Presentation

    55/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    109 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    110110110 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Recommended Reading

    Performance and FaultManagement

    ISBN: 1-57870-180-5

    High Availability NetworkFundamentals

    ISBN: 1-58713-017-3

    Network PerformanceBaselining

    ISBN: 1-57870-240-2

    The Practical PerformanceAnalyst

    ISBN: 0-07-912946-3

  • 7/25/2019 Mtbf Presentation

    56/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    111111111 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Recommended Reading (Cont.)

    The Visual Display of Quantitative Information

    by Edward Tufte (ISBN: 0-9613921-0)

    Practical Planning for Network Growth

    by John Blommers (ISBN: 0-13-206111-2)

    The Art of Computer Systems Performance Analysis

    by Raj Jain (ISBN: 0-421-50336-3)

    Implementing Global Networked Systems Management: Strategiesand Solutions

    by Raj Ananthanpillai (ISBN: 0-07-001601-1)

    Information Systems in Organizations: Improving BusinessProcesses

    by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)

    Integrated Management of Networked SystemsConcepts,Architectures, and Their Operational Application

    by Hegering, Abeck, Neumair (ISBN: 1558605711)

    112112112 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Appendix A: Acronyms

    AVGAverage

    ATMAsynchronous Transfer Mode

    DPMDefects Per Million

    FCAPSFault, Config, Acct, Perf,Security

    GEGigabit Ethernet

    HAHigh Availability

    HDLCHigh Level Data Link Control

    HSRPHot Standby RoutingProtocol

    IPMInternet Performance Monitor IUMImpacted User Minutes

    MIBManagement Information Base

    MTBFMean Time Between Failure

    MTTRMean Time to Repair

    RMEResource Manager Essentials

    RMONRemote Monitor

    SA AgentService Assurance Agent

    SNMPSimple Network ManagementProtocol

    SPFSingle Point of Failure; ShortestPath First (routing protocol)

    TCPTransmission Control Protocol

  • 7/25/2019 Mtbf Presentation

    57/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    BACKUP SLIDES

    113 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    ADDITIONALRELIABILITY SLIDES

    114 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    58/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    115115115 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Network DesignWhat Is Reliability?

    Reliability is often used as a general term thatrefers to the quality of a product

    Failure Rate

    MTBF (Mean Time Between Failures) or

    MTTF (Mean Time to Failure)

    Availability

    116116116 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Reliability Defined

    1. The probability of survival (or no failure) for astated length of time

    2. Or, the fraction of units that will not fail in thestated length of time

    A mission time must be stated

    Annual reliability is the probability ofsurvival for one year

    Reliability:

  • 7/25/2019 Mtbf Presentation

    59/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    117117117 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Availability Defined

    1. The probability that an item (or network, etc.) isoperational, and ready-to-go, at any point in time

    2. Or, the expected fraction of time it is operational.annual uptime is the amount (in days, hrs., min.,etc.) the item is operational in a year

    Example: For 98% availability, the annual availability is0.98 * 365 days = 357.7 days

    Availability:

    118118118 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    MTBF Defined

    MTBF stands for Mean Time Between Failure

    MTTF stands for Mean Time to Failure

    This is the average length of time between failures (MTBF)or, to a failure (MTTF)

    More technically, it is the mean time to go from anoperational state to a non-operational state

    MTBF is usually used for repairable systems, and MTTF is

    used for non-repairable systems

  • 7/25/2019 Mtbf Presentation

    60/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    119119119 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    How Reliable Is It?

    MTBF Reliability:

    R = e-(MTBF/MTBF)

    R = e-1 = 36.7%

    MTBF reliability is only 37%; that is, 63% of yourHARDWARE fails before the MTBF!

    But remember, failures are still random!

    120120120 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    MTTR Defined

    MTTR stands for Mean Time to Repair

    or

    MRT (Mean Restore Time)

    This is the average length of time it takes to repair an item

    More technically, it is the mean time to go from a non-operational state to an operational state

  • 7/25/2019 Mtbf Presentation

    61/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    121121121 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    One Method of Calculating Availability

    Availability = MTBF(MTBF + MTTR)

    What is the availability of a computer withMTBF = 10,000 hrs. and MTTR = 12 hrs?

    A = 10000 (10000 + 12) = 99.88%

    122122122 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Uptime

    Annual uptime

    8,760 hrs/year X (0.9988)= 8,749.5 hrs

    Conversely, annual DOWNtime is,

    8,760 hrs/year X (1- 0.9988)= 10.5 hrs

  • 7/25/2019 Mtbf Presentation

    62/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    123123123 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Systems

    Components In-Series

    Components In-Parallel (Redundant)

    Component 1 Component 2

    Component 1

    Component 2

    RBD

    124124124 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Series

    Part 1

    Part 2

    In-Series

    Up Up Up

    UpUp Up

    Up Up Up Up

    Down Down

    Down Down

    Down DownDown

  • 7/25/2019 Mtbf Presentation

    63/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    125125125 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Parallel

    In-Parallel

    Up Down Up

    Part 1

    Part 2

    Up Up Up

    UpUp Up

    Down Down

    Down Down

    126126126 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Series MTBF

    COMPONENT 1

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    COMPONENT 2

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    System Failure Rate

    = 0.0004 + 0.0004 = 0.0008

    System MTBF

    = 1/(0.0008) = 1,250 hrs.

    Component Failure Rate

    = 1/2500 = 0.0004

  • 7/25/2019 Mtbf Presentation

    64/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    127127127 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Series Reliability

    System ANNUAL Reliability:

    R = 0.03 X 0.03 = 0.0009

    Component ANNUAL Reliability:

    R = e-(8760/2500) = 0.03

    COMPONENT 1

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    COMPONENT 2

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    128128128 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Series Availability

    System Availability:

    A = 0.996 X 0.996 = 0.992

    Component Availability:

    A = 2500 (2500 + 10) = 0.996

    COMPONENT 1

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    COMPONENT 2

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

  • 7/25/2019 Mtbf Presentation

    65/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    129129129 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Parallel MTBF

    System MTBF*:

    = 2500 + 2500/2=3,750 hrs.

    COMPONENT 1

    MTBF = 2,500 hrs.

    COMPONENT 2

    MTBF = 2,500 hrs.

    In general*, =

    n

    i

    i

    MTBF

    1*For 1-of-n Redundancy of n Identical Componentswith NO Repair or Replacement of Failed Components

    130130130 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    1-of-4 Example

    = 5,208 hrs.

    *For 1-of-n Redundancy of n Identical Componentswith NO Repair or Replacement of Failed Components

    In general*,

    =

    n

    i

    i

    MTBF

    1

    42500

    32500

    22500

    12500

    4

    1

    2500 +++==i

    i

  • 7/25/2019 Mtbf Presentation

    66/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    131131131 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Parallel Reliability

    COMPONENT 1

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    System ANNUAL Reliability:

    R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06

    COMPONENT 1

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    Component ANNUAL Reliability:

    R = e-(8760/2500) = 0.03 Unreliability

    132132132 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    In-Parallel Availability

    Unavailability

    Component Availability:

    A = 2500 (2500 + 10) = 0.996System Availability:

    A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984

    COMPONENT 1

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

    COMPONENT 1

    MTBF = 2,500 hrs.

    MTTR = 10 hrs.

  • 7/25/2019 Mtbf Presentation

    67/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    133133133 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Complex Redundancy

    1

    2

    3

    n

    m-of-n

    .

    .

    .

    Examples:

    1-of-2

    2-of-3

    2-of-4

    8-of-10

    Pure Active Parallel

    134134134 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    More Complex Redundancy

    Pure active parallel

    All components are on

    Standby redundant

    Backup components are not operating

    Perfect switching

    Switch-over is immediate and without fail

    Switchover reliabilityThe probability of switchover when it is not perfect

    Load sharing

    All units are on and workload is distributed

  • 7/25/2019 Mtbf Presentation

    68/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    135135135 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Networks Consist of Series-Parallel

    Combinations of in-series and redundantcomponents

    D1D1

    D2D2

    D3D3

    EE FFCCB1B1

    B2B2

    AA 2/31/2

    136136136 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Failure Rate

    The number of failures per time:

    Failures/hour

    Failures/day

    Failures/week

    Failures/106 hours

    Failures/109 hours called FITs (Failures in Time)

  • 7/25/2019 Mtbf Presentation

    69/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    137137137 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Approximating MTBF

    13 units are tested in a lab for 1,000 hours with 2failures occurring

    Another 4 units were tested for 6,000 hours with 1failure occurring

    The failed units are repaired (or replaced)

    What is the approximate MTBF?

    138138138 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Approximating MTBF (Cont.)

    MTBF = 13*1000 + 4*6000

    1 + 2

    = 37,000

    3

    = 12,333 hours

  • 7/25/2019 Mtbf Presentation

    70/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    139139139 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Modeling

    Distributions

    Normal

    Log-Normal

    Weibull

    Exponential

    Frequency

    Time-to-Failure

    MTBF

    Fr

    equency

    Time-to-Failure

    MTBF

    MTBF

    140140140 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Constant Failure Rate

    The Exponential Distribution

    The exponential function:

    f(t) = e-t, t > 0

    Failure rate, , IS CONSTANT

    = 1/MTBF

    If MTBF = 2,500 hrs., what is the failure rate?

    = 1/2500 = 0.0004 failures/hr.

  • 7/25/2019 Mtbf Presentation

    71/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    141141141 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    The Bathtub Curve

    Time

    FailureRate

    Wear-OutUseful Life PeriodInfant

    Mortality

    DECREASINGFailure Rate

    CONSTANT Failure Rate

    INCREASINGFailure Rate

    142142142 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    The Exponential Reliability Formula

    Commonly used for electronic equipment

    The exponential reliability formula:

    R(t) = e-t or R(t) = e-t/MTBF

  • 7/25/2019 Mtbf Presentation

    72/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    143143143 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Calculating Reliability

    A certain Cisco router has an MTBF of 100,000 hrs;what is the annual reliability?

    Annual reliability is the reliability for one year or 8,760 hrs

    R =e-(8760/100000) = 91.6%

    This says that the probability of no failure in oneyear is 91.6%; or, 91.6% of all units will surviveone year

    ADDITIONAL TROUBLETICKETING SLIDES

    144 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

  • 7/25/2019 Mtbf Presentation

    73/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    145145145 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Essential Data Elements

    Description of Action Taken to Fix the ProblemStringResolution

    Identity if the Event Was Due to PlannedMaintenance Activity or Unplanned OutagePlanned/UnplannedType

    For HW Problems include Product ID; for SWInclude Release Version

    AlphanumericComponent/Part/SWVersion

    HW, SW, Process, Environmental, etc.StringRoot Cause

    Outline of the ProblemStringProblem Description

    Number of Customers that Lost Service; NumberImpacted or Names of Customers Impacted

    IntergerCustomers Impacted

    Time of Resolutionhh:mmResolution Time

    Date of Resolutiondd/mmm/yyResolution Date

    Time of Faulthh:mmStart Time

    Date of Faultdd/mmm/yyStart Date

    Trouble Ticket NumberAlphanumericTicket

    Date Ticket Issueddd/mmm/yyDate

    DescriptionFormatParameter

    Note: Above Is the Minimum Data Set, However, ifOther Information Is Captured it Should Be Provided

    146146146 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Trouble Tickets Definitions Data accuracy Collection

    processes

    OperationalProcess and Procedures

    AnalysisData Analysis

    HA Metrics/NAIS Synergy

    Network reliabilityimprovement analysis

    Problem management

    Fault management

    Resiliency assessment

    Change management

    Performancemanagement

    Availabilitymanagement

    Baseline availability

    Determine DPM(Defects Per Million)by:

    Planned/Unplanned

    Root Cause

    Resolution

    Equipment

    MTTR

    Analyzed Trouble Ticket DataReferral for Process/Procedural Improvement

    Referral forAnalysis

  • 7/25/2019 Mtbf Presentation

    74/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    ADDITIONAL SA AGENT SLIDES

    147 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    148148148 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    SA Agent: How It Works

    1. User configures Collectorsthrough Mgmt Application GUI

    2. Mgmt Application provisionsSource routers with Collectors

    6. Application retrieves data fromSource routers once an hour

    7. Data is written to a database

    8. Reports are generated

    3. Source router measures andstores performance data,e.g.:

    Response time

    Availability

    4. Source router evaluatesSLAs, sends SNMP Traps

    5. Source router stores latestdata point and 2 hours ofaggregated points

    SNMP

    Management Application SA Agent

  • 7/25/2019 Mtbf Presentation

    75/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    149149149 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    SAA Monitoring IP Core

    R1

    R3

    R2

    IP CoreIP Core

    P1

    P2

    P3

    Management System

    150150150 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Monitoring Customer IP Reachability

    P1-Pn Service Assurance Agent ICMPPolls to a Test Point in the IP Core

    TP1TP1

    TPxTPx

    P1

    P3

    P2

    PN

    Nw1

    Nw3

    Nw3

    NwN

  • 7/25/2019 Mtbf Presentation

    76/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    151151151 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Service Assurance Agent Features

    Measures Service Level Agreement (SLA) metrics

    Packet Loss

    Response time Throughput

    Availability Jitter

    Evaluates SLAs

    Proactively sends notification of SLA violations

    152152152 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    SA Agent Impact on Devices

    Low impact on CPU utilization

    18k memory per SA agent

    SAA rtr low-memory

  • 7/25/2019 Mtbf Presentation

    77/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    153153153 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Monitored Network Availability Calculation

    Not calculated:

    Already have availability baseline

    Fault type, frequency and downtime may be more useful

    Faults directly measured from management system(s)

    154154154 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Monitored Network Availability

    Assumptions

    All connections below IP are fixed

    Management systems can be notified of all fixedconnection state changes

    All (L2) events impact on IP (L3) service

  • 7/25/2019 Mtbf Presentation

    78/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    ADDITIONAL COOL SLIDES

    155 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    156156156 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    CLIs

    [no] cool run

    [no] cool interface interface-name(idb)

    [no] cool physical-FRU-entity entity-index (int)

    [no] cool group-interface group-objectID(string)

    [no] cool add-cpu objectID threshold duration

    [no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]

    [no] cool if-filter group-objectID (string)

    Configuration CLI Commands

    Router#show cool event-table [] displays all if not s pecified

    Router#show cool object-table [] displays all object ty pes if not specified

    Router#show cool fru-entity

    Display CLI Commands

    Router#clear cool event-table

    Router#clear cool persistent-files

    Exec CLI Commands

  • 7/25/2019 Mtbf Presentation

    79/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    157157157 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Measurement Example:Router Device Outage

    Reload (Operational) ,Power Outage, orDevice H/W failure

    Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4).Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB.Status: Up (1) Down (2).Last-change: last object status change time.AOT:Accumulated Outage Time (sec).NAF: Number of Accumulated Failure.

    158158158 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Measurement Example:

    Cisco IOS S/W OutageStandby RP in Slot 0 Crash Using Address Error (4) Test Crash;AdEL Exception It Is Caused Purely by Cisco IOS S/W

    Standby RP Crash Using Jump to Zero (5) Test Crash;Bp Exception It Can Be Caused by S/W, H/W, or Operation

  • 7/25/2019 Mtbf Presentation

    80/81

    2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

    Presentation_ID.scr

    159159159 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Measurement Example: Linecard Outage

    Add a Linecard

    Reset the Linecard

    Down Event Captured

    Up Event Captured

    AOT and NAF Updated

    160160160 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Measurement Example: Interface Outage

    12406-R1202(config)#cool group-interface ATM2/0.12406-R1202(config)#no cool group-interface ATM2/0.3

    sh cool object 1 | include ATM2/0.33 1 1054859087 0 0 0 ATM2/0.135 1 1054859088 0 0 0 ATM2/0.239 1 1054859090 0 0 0 ATM2/0.441 1 1054859090 0 0 0 ATM2/0.5

    12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 1 1054859105 18 1 ATM2/0.11 35 1 1054859106 18 2 ATM2/0.21 39 1 1054859107 17 3 ATM2/0.41 41 1 1054859108 18 4 ATM2/0.5

    12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 0 1054859146 41 1 ATM2/0.11 35 0 1054859147 41 2 ATM2/0.21 39 0 1054859149 42 3 ATM2/0.41 41 0 1054859150 42 4 ATM2/0.5

    sh cool object 1 | include ATM2/0.33 1 1054859087 0 41 1 ATM2/0.135 1 1054859088 0 41 1 ATM2/0.239 1 1054859090 0 42 1 ATM2/0.441 1 1054859090 0 42 1 ATM2/0.5

    Configure to Monitor All the Interfaces whichIncludes ATM2/0; String, Except ATM2/0.3

    1

    2 3

    4 5

    Object Table

    Shut ATM2.0Interface Down

    Down EventCaptured

    Up EventCaptured

    No Shut ATM2.0Interface

    Object Table Shows AOT and NAF

  • 7/25/2019 Mtbf Presentation

    81/81

    161161161 2004 Cisco Systems, Inc. All rights reserved.

    NMS-22019627_05_2004_c2

    Measurement Example:Remote Device Outage

    12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 112406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 112406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1

    sh cool object-table 4 | include remobj1 1 1054867061 0 0 remobj.12 1 1054867063 0 0 remobj.23 1 1054867065 0 0 remobj.3

    12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shut

    12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shut

    4 2 5 1054867105 42 2 remobj.24 1 5 1054867108 47 3 remobj.14 3 5 1054867130 65 10 remobj.3

    4 1 4 1054867171 63 1 remobj.1

    4 3 4 1054867193 63 8 remobj.34 2 4 1054867200 95 10 remobj.2

    sh cool object-table 4 | include remobj1 1 1054867061 63 1 remobj.12 1 1054867063 63 1 remobj.23 1 1054867065 95 1 remobj.3

    3 Remote Devices AreAdded

    Object Table

    Shut Down the Interface Link Between the RemoteDevice and Router

    Down Event Captured

    Up Event Captured

    Object Table Shows AOT and NAF

    No Shut the Interface Link