jeff sly - case study nagios @ nu skin

Upload: krovidiprasanna

Post on 02-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    1/58

    Jeff Sly

    Principal IT Architect

    [email protected]

    Case Study

    Nagios @ Nu Skin

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    2/58

    Who is in the Audience?

    How many of you are: Suppliers of Nagios or some value add-on for

    Nagios?

    Customers using Nagios?

    Just implementing Nagios or expanding

    implementation?

    Using NagiosXI?

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    3/58

    Who is Nu Skin?

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    4/58

    Our Technology Footprint

    EcommerceHome grownApplicationsJava, EJB, ABAP, .Net

    DatabasesOracle, MySQL, MSSQL

    OSHPUX, Redhat, Windows, VMWare

    ERPSAP Supply Chain, CRM, FI

    Datacenters6 locations in 6 countries

    Offices50 Countries

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    5/58

    Monitoring Goals

    Monitoring presents operations with acompletely integrated global view.

    Good monitoring is proactive; it helps

    teams prevent problems from becomingoutages.

    Good monitoring helps minimize outage

    downtime, quickly identify root cause andcontacts correct people.

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    6/58

    Centralized Monitoring System

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    7/58

    Our Monitoring History

    We tried for 10 years

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    8/58

    Do it all in One Tool Projects

    One Monitoring Tool to rule them all: Mercury SiteScope

    Remedy Help Desk

    HP OpenView

    Quest Foglight

    Home grown (several)

    One monitoring person

    He decided to quit!

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    9/58

    Could never get everything

    All FailedWe always gave up! Why?Servers and agents that were proprietary

    Huge foot print inefficient performance

    Steep learning curve

    Very expensive

    Updates costly and very time consuming

    System Administrators like their own

    scripts, can see what they are doing

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    10/58

    Resulting Monitoring Issues

    Tried to make Operations clearing housefor all warnings and alerts from 10+ tools

    Operations was overwhelmed

    Took 4 process steps and lots of softwareto notify of critical failures

    Most Administrators setup own private

    monitoring to receive warningsMany false notifications

    Late notifications

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    11/58

    As Is (start of project)

    Our Business Customers were Unhappy

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    12/58

    Old Monitoring Work Flow

    Four steps to notify system administrator

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    13/58

    Network

    Foglight

    Email

    HelpDesk

    Error

    System

    Scripts

    BAC

    HP

    NNM

    SiteScope

    8

    Sitescope

    6

    Step 1: Everything Emails Operations

    Nagios

    Database

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    14/58

    Network

    Foglight

    System

    Scripts

    BAC

    HP

    NNM

    SiteScope

    8

    Sitescope

    6

    Step 2: Operations Opens Email

    Nagios

    Database

    Email

    HelpDesk

    Error

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    15/58

    Email

    HelpDesk

    Error

    Step 3: Operations Checks Source

    Network

    Foglight

    System

    Scripts

    BAC

    HP

    NNM

    SiteScope

    8

    Sitescope

    6

    Nagios

    Database

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    16/58

    Email

    HelpDesk

    Error

    Step 4: Operations Calls admin

    Network

    Foglight

    System

    Scripts

    BAC

    HP

    NNM

    SiteSco

    pe 8

    Sitescope

    6

    Nagios

    Database

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    17/58

    Inventory of Existing Checks

    Regular Expression

    found on Web Page

    Monitoring

    HTTPCheck - Up or

    Down

    PingHost Up or Down

    PORTmonitoring

    FTPchecking

    SMTPchecking

    SNMPmonitoring - no

    trap catching yet

    Radius

    DNSmonitoring

    DiskSpace monitoring CPUand Load Average

    monitoring

    MemoryMonitoring

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    18/58

    Inventory of Existing Checks

    Servicemonitoring

    Transactionmonitoring -

    page load times

    performance graph

    Website click through

    (Webinject not working)

    Log File monitorparse

    for Errors

    JavaHEAP, Thread,

    Threadlock monitoring

    Apachethread and

    worker count monitors

    Ecommerceshop

    monitors

    Emailcan send and

    receive

    SQLquery ODBC

    (catalog ODBC had bugs)

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    19/58

    To Be

    Happy Customers

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    20/58

    Key Ideas

    1. MoM2. Tool Requirements

    3. Shared Ownership

    4. Lowest Level5. Nagios Monitor Method

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    21/58

    Idea 1: MoM

    Our first break though was the idea thateven through we needed a centralized

    view for all monitoring that did not mean all

    monitoring had to be done by one

    monitoring tool.

    We had to pick a Manager

    of the Monitors (MoM)to bring together the best of

    breed monitoring.

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    22/58

    MoM - according to Gartner

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    23/58

    Idea 2: Tool Requirements

    Opennot proprietary and closed

    Mainstreamwanted good native support and

    strong community

    Interfaceto 3rdParty Monitoring

    Flexibleadapt to many types of monitoring

    Efficientminimal foot print on production

    servers, not chatty on network

    Notificationgranular controlReliablegood clean architecture

    UsabilityGUI interface, reporting

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    24/58

    Idea 3: Shared Ownership

    Core team Operation of Monitoring Environment: backups,

    upgrades, & custom plug-ins

    Monitoring Experts

    TrainingMonitoring leads in Development & Admin teams:

    Set up own monitors

    Keep own monitors current

    Adjust monitors

    If something is not monitored not core teams fault

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    25/58

    Email

    HelpDesk

    Error

    Operations Owned Monitoring

    Network

    Foglight

    System

    Scripts

    BAC

    HP

    NNM

    SiteScope

    8

    Sitescope

    6

    Nagios

    Database

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    26/58

    Team Leads Own Monitoring

    Network

    System

    Scripts

    SAP

    Asia

    Europe

    Web

    Database

    Operations

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    27/58

    How to Guides

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    28/58

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    29/58

    How to Setup NRPE - HPUX

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    30/58

    Idea 4: Lowest Level

    Handle alerts at the lowest possible level in theorganization

    Only forward alerts if not handled at lower levels

    before they become critical

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    31/58

    Handle events at lowest level

    Network

    System

    Scripts

    SAP

    Asia

    Europe

    Web

    Database

    Operations

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    32/58

    Only forward unhandled alerts

    Network

    System

    Scripts

    SAP

    Asia

    Europe

    Web

    Database

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    33/58

    Idea 5: Nagios Monitor Method

    Choose the Nagios Monitoring MethodActive Check from Nagios Server (normal)

    Active Check performed by remote client

    NRPE, NSClientPassive CheckListen to 3rdparty

    monitors

    NSCA

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    34/58

    Active LocalCheck

    DB

    DBMonitor

    Web

    Unix

    Win

    HTTP

    or

    Ping

    Nagios

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    35/58

    Active RemoteCheck - UX

    DB

    DBMonitor

    Web

    Unix

    Win

    CPU, RAM

    (NRPE)

    Nagios

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    36/58

    Active Remote Check - Win

    DB

    DBMonitor

    Web

    Unix

    Win

    CPU, RAM

    (NSClient)

    Nagios

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    37/58

    Passive 3rdParty Alert

    DB

    DBMonitor

    Web

    Unix

    Win

    Nagios

    3rdParty

    Alert NSCA

    3rdParty Check DB

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    38/58

    Bonus Idea - Tune

    Tune the database

    Add Ram Drive

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    39/58

    Tune the Database

    Modify contents of the /etc/my.cnf [mysqld] section.

    tmp_table_size=524288000

    max_heap_table_size=524288000

    table_cache=768

    set-variable=max_connections=100

    wait_timeout=7800query_cache_size = 12582912

    query_cache_limit=80000

    thread_cache_size = 4

    join_buffer_size = 128K

    http://web3us.comInfo on: MySQL Tuning, Nagios Tuning

    http://web3us.com/http://web3us.com/
  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    40/58

    RAM DriveCreate a RAM disk for Nagios tempory files

    I created a ramdisk by adding the following entry to the /etc/fstabfile:

    none /mnt/ram tmpfs size=500M 0 0

    Mount the disk using the following commands

    # mkdir -p /mnt/ram; mount /mnt/ram

    Verify the disk was mounted and created

    # df -k

    Modify the /usr/local/nagios/etc/nagios.cfg file with the following

    tuned parameters

    temp_file=/mnt/ram/nagios.tmp

    temp_path=/mnt/ram

    status_file=/mnt/ram/status.dat

    precached_object_file=/mnt/ram/objects.precache

    object_cache_file=/mnt/ram/objects.cache

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    41/58

    Implementation Methodology

    Site Survey

    Inventory existing monitors

    Proof of concept

    Build new environmentMigrate monitors from each platform to

    Nagios, one at a time

    Integrate OEM, and to send monitors toNagios

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    42/58

    Three Project Phases

    Deliver something useful in each phase

    Build a level at a time

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    43/58

    Phase I1. Set up a pilot of Nagios XI using Trial License.

    2. Set up Foglight monitoring of JVM (Java Virtual Machine).3. Purchase NagiosXI and Consulting Support

    4. Bring in a consultant for two weeks to help set up the

    architecture and help us work with the system.

    5. Documentation Web Site for Nagios learning's and How to

    guides

    6. Define a set of standardsand guidelines to follow to help aid an

    effective monitoring process.

    7. Backups on Running on Production Nagios Server

    8. Set up services which aren't being caught right now and move afew of the important services over to the new Nagios XI

    monitoring system.

    9. Test Nagios plugins and server performance

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    44/58

    Phase II

    1. Migrate off of Sitescope 6 and shutdown

    2. Migrate off of Sitescope 8 and shutdown3. Decommission Foglight

    4. Clean up the old monitoring server

    5. Migrate the network team from old Nagios to core NagiosXI

    system

    6. Set up standby NagiosXI system, cron to replicate weekly

    7. Research missing alerts and add them to the new NagiosXI

    system

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    45/58

    Phase III

    1. Implement Global Monitoring

    Add monitors for existing international systems Add monitors using JMX to monitor Java servers

    Nagios Remote Process Execution (NRPE) to monitor remotely

    Remote Monitoring for Windows Servers (NS Client++)

    Implement notification and escalation of alerts Add monitors for critical business functions

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    46/58

    Phase III continued

    2. Corporate Enhancements

    Request recurring down time enhancement from Ethan Galstad Automate refresh of NagiosXI standby system

    Build Network Map

    Retire Windows SiteScope

    Add monitors for phone systems Add monitors to data center (UPS, Temperature, Humidity)

    Integrate to SAP Tidal monitoring tool

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    47/58

    Phase III continued

    3. Business

    Business review and approve SLA (using business terms) Monitor both the Business Functions and the individual point

    devices that provide the Business Function

    Follow the Sun with Eyes on Glass.

    Training

    How to setup alerts

    How to receive alerts

    How to report on performance graphs

    Create a new Dashboard for HelpDesk and International IT

    Staff

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    48/58

    Inventory of Monitor Checks

    Qty Things we figured out how to do from Nagios Solution

    50 Regular Expression found on Web Page Monitoring HTTP Check170 HTTP Check - Up or down HTTP Check

    600 Ping Host Up or down Nagios Check alive

    100 PORT monitoring Check TCP port #

    10 FTP checking Nagios FTP plugin

    8SMTP checking Nagios SMTP plugin

    5 SNMP monitoring - no trap catching yet Not Using

    4 Radius Nagios plugin, difficult

    16 DNS monitoring Nagios Check DNS

    250 Disk Space monitoringNSClient, NRPE

    -Nagios Disk plugin

    170 CPU and Load Average monitoringNSClient, NRPE

    - Custom Linux plugin

    170 Memory MonitoringNSClient, NRPE

    -Custom Linux plugin

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    49/58

    Inventory continued

    Qty Things we figured out how to do from Nagios Solution

    170 Memory Monitoring NSClient, NRPE-CustomLinux plugin

    80 Service monitoringNSClient, NRPE

    with bash shell script

    30Transaction monitoring - page load times -

    performance data graphsCustomusing Selenium Scripts

    30 Website click through (webinject not working) Customusing mechanize

    10 Log File monitor -p parse for Errors NRPE - script parse log files

    6 Day HEAP, Thread, Threadlock monitoringJava Management Extensions

    (JMX)

    8 Apache thread and worker count monitors Customplugin Apache statics

    18 ShopApp and SignupApp monitorsHTTP Check Customapp status

    page

    5 Email can send and receive CustomNagios plugin

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    50/58

    Nagios XI Interface

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    51/58

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    52/58

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    53/58

    Data Centers in 7 Countries

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    54/58

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    55/58

    Goal Quick Notification & Recovery

    from Outage

    Type of

    Monitor

    Notification of outages with details

    on which system is down, so we

    know who to contact

    Solution Migrate from Sitescope, Openview

    to NagiosXI

    IT Operations

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    56/58

    Goal Prevention of outage

    Type of

    Monitor

    Warnings about conditions before

    outages occur, allow for corrective

    actions that will prevent likely

    outages

    Solution Migrate from Sitescope, Openview

    to NagiosXI, Integrate OEM SAP

    and Scripts with Nagios

    IT Team Managers

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    57/58

    Summary

    1. MoM ~ Manager of Managers Allow specialized tools

    2. Tool Requirements, enough but not all

    3. Ownership for implementation, shared4. Handle alerts, lowest level in organization

    5. Choose Nagios monitoring method

  • 8/11/2019 Jeff Sly - Case Study Nagios @ Nu Skin

    58/58

    Tips , Tr icks & Demos

    Nag ios XI Large Implementat ionDay 3, 2:00 Track 3 (Nate Broderick)

    3 Demos

    Performance challenges and solutionsIntegrating monitoring solutions Oracle

    Migrating from BAC & Foglight

    Customization

    Graphing, and more.