nagios xi best practices

Nagios XIBest Practices

By Troy [email protected]

About Me•Tech Support Contractor for Nagios Enterprises

• Based in Australia• Typically cover UTC+10 from 9am to 5pm

•Nagios & XI Dev (Box293)•Nagios MVP3

What's Covered In This Talk•Getting the most from Nagios XI•Time saving information•Configuration practices

• Object definitions• Backend setup• Performance enhancements

Nagios XIServer Internals

Nagios XI License Entitlements•XI license entitles 3 instances:

• Production • Test & Dev (T&D)• Disaster Recovery (DR)

•License activation is tied to IP Address of each XI host

Whats Monitoring Nagios XI? •How would you know your XI server died?

• “Nagios XI Server” Monitoring Wizard•DR instance monitors production instance

• Production instance is UP & HEALTHY•Production Instance monitors DR instance

• DR instance is UP & HEALTHY

localhost services •Do you know how your XI server is performing?•Basic local services are included in XI base•You should ideally be monitoring:

• Service Status (check_init_service)• crond, httpd, mysql, ndo2db, npcd, ntpd, postgresql, snmptrapd, snmptt

localhost services • File Counts (check_file_count)

• NPCD Perfdata spool directory• xidpe spool directory• Check results folder• snmptt spool folder

• nagios user account has not expired• (check_pass_expire.pl)

localhost services • root mailbox size

• (box293_check_mbox)

• MySQL / MariaDB• Database tables crashed?

• (box293_check_mysql_table_status)• Date/Time correct?

• (box293_check_mysql_date)

localhost services • Overall Load (check_load)

• Memory Free – Physical (check_memory)

• Swap Usage (check_swap)

• Disk Free (check_disk)

Date and Timezone!•Configure Timezone

• Admin > Manage System Config•Sync with trusted time source•VM? Don’t sync with hypervisor!•Can be the source of confusing problems

CPU •CPU Cores vs Speed!•Not everything is multi-threaded

• 3.4 GHz vs 2.2 GHz•Number of cores is still important

•Refer to XI hardware requirements

Memory•Enough memory to cope in a major outage

• Event handlers consume memory quickly (+GB in a matter of minutes in a major outage)

• Have at least 50% more memory than needed•Refer to XI hardware

requirements

RAM Disk•Lots of little files created/deleted/updated•Using a RAM Disk:

• Reduces disk I/O & load• Speeds up processing of performance data

• Speeds up processing of spooled check results

• Speeds up nagios restarts•Refer to official procedure

Solid State Disk (SSD)•Greatly improves overall performance•Compliments RAM Disk•Helps read/writes with:

• Logs• Database• Performance Graphs• Reports

SSD vs RAID ?•SSD beats* a spinning disk RAID set

• *Depends on how much money you have•Still need to RAID1 SSD for redundancy!•SSD may not give you the required capacity

• 3.8TB SAS SSD now available !!!

rrdcached•Enabling rrdcached accumulates the spooled performance data, after x amount of time it is processed into backend RRD files

• Reduces Disk I/O• Can be a delay in data appearing in graphs

•Refer to official procedure

Offloaded MySQL / MariaDB•Data constantly written to databases

• Historical and Configuration•Offload to separate server to reduce load•Don't forget to monitor offloaded server!!!

• Disk/CPU/Memory/Tables/Service • Refer to earlier slides

•Refer to official procedure

Mod-Gearman•Used for offloading plugins to workers•Plugins need to be installed on all workers•Be aware of plugins that use /tmp files!•XI 2014 onwards uses Core 4

• Core 4 has it's own workers (only local workers)

• nagios.cfg “check_workers” option•Refer to official procedure

Disaster Recovey•Failover and High Availability Solutions for Nagios XI

• Andy Brist - NWC2014 – Failover & HA Solutions

•What is really important in disaster?•Plan and test

Backups!!!•Admin > System Backups

• Schedule backups of XI• Location can be local, FTP, SSH• Remote location recommended

•Manual Backups• Local Backup Archives via Admin menu• /usr/local/nagiosxi/scripts/backup_xi.sh

Restoring Backups•Official Backup and Restore procedure

• Brings system back online with ease• Great for migrating from old XI to new XI• Also good for:

• DR• Test & Dev

Configuration

Intervals - Host vs Services•Host down HARD = service notifications suppressed

• What happens when host and services use the same check intervals?

• Unnecessary Notifications get sent :(

• Make host go down HARD quicker than it’s services!

Service Dependencies•When a master service goes down:

• Prevents notifications from being sent

• Prevents service checks from execution

•Make master service go down HARD quicker than dependent services!

• Otherwise dependencies are pointless

•Master service e.g. - Ping or NRPE Version

Disable Service Checks ?•host_down_disable_service_checks

• Nagios Core 4.1.x feature (XI 5)• System wide setting• Reduces load on XI host• Think of it as automatic service dependencies on their own hosts

• Service dependencies ignored if host is down

Check Intervals - Be Realistic•Does it need to be checked every 5 minutes?

• Disk Free Space – every 60 minutes perhaps?

• Too long = no performance data

•Different intervals to spread the load• 3, 5, 7 minute intervals• 58, 60, 62 minute intervals

Notification & Check Intervals •Nagios determines if it is allowed to send a notification every service HARD state

• e.g. 15 minute check and 60 minute notification

• Internal scheduling may cause 14min 55sec to pass, 4 x 14:55 = 59min 40sec … it’s < 60min!

• Notification not sent until 75min!•Scheduling is geared +/- to reduce load!

Use Hostgroups!•Assign ONE service to a hostgroup of common servers

• Windows Servers• Linux Servers

•Consistent monitoring, standards enforced!•Directive changes - all hosts get updated •Reduces management overhead

Use Contact Groups!•Use contact groups in all definitions•Makes it easy when staff join/leave

• Just add/remove the contact from groups• Reduces administrative overhead• Enforces your company policy

•Similar principle to host groups

Configuration Wizards•Pros

• Great for getting up and running quickly• No need to learn how a plugin works

•Cons• Creates individual services • More work later when enforcing “standards”

Templates•Common settings applied to objects

• Helps enforce standards• Reduces administrative overhead• Layer multiple templates• Can be additive or ignore inheritance

•XI Config Wizard objects use templates

• Example of common icmp check

User Macros – resources.cfg•$USERx$ macros are good for common items like a username or password•Allows passwords with a ! exclamation mark•Values not visible in object definitions

•$USER1$• /usr/local/nagios/libexec

Custom Object Variables•Allows you to create your own variables•Can be defined in host or service objects•E.G. hosts have their own check_nt password

• Define _CHECK_NT_PASSWORD in host object• In command definitions reference it as:

• $_HOSTCHECK_NT_PASSWORD$•VERY POWERFULL!

MTRG Clean Configs•Your MRTG configs may be collecting more than what you think

• /etc/mrtg/conf.d/*.cfg files• Created by Network Switch / Router Wizard

• Comment out unused ports • About 37 lines per port

• Comment out unused non-interfaces (VLANs)

Plugins – Compiled vs Scripts•Compiled runs quicker

• Official nagios-plugins are compiled• “Custom modifications” require re-compiling

•Scripts run slower, consume more resources

• Perl plugins known to consume +CPU +RAM

•“nice” can reduce impact of plugins•Check Profiler component by box293

Backend API - Read Only User•API provides you with URLs for use in third party products without needing user/pass

• Requires a user account to be created• Account should be READ ONLY

Performance Data Tool•Component developed by box293

• Allows you to manipulate RRD files• Great for merging RRD data• Can also delete old RRD files for old services

• View raw data in tables•Find it in the Nagios Exchange

Thank you!

What Is Your Best Practice?

Any Questions?

end

done

fi esac)

}

;

od

until

.

nagios xi best practices

Presentations & Public Speaking