nagios xi best practices
TRANSCRIPT
About Me•Tech Support Contractor for Nagios Enterprises
• Based in Australia• Typically cover UTC+10 from 9am to 5pm
•Nagios & XI Dev (Box293)•Nagios MVP3
What's Covered In This Talk•Getting the most from Nagios XI•Time saving information•Configuration practices
• Object definitions• Backend setup• Performance enhancements
Nagios XI License Entitlements•XI license entitles 3 instances:
• Production • Test & Dev (T&D)• Disaster Recovery (DR)
•License activation is tied to IP Address of each XI host
Whats Monitoring Nagios XI? •How would you know your XI server died?
• “Nagios XI Server” Monitoring Wizard•DR instance monitors production instance
• Production instance is UP & HEALTHY•Production Instance monitors DR instance
• DR instance is UP & HEALTHY
localhost services •Do you know how your XI server is performing?•Basic local services are included in XI base•You should ideally be monitoring:
• Service Status (check_init_service)• crond, httpd, mysql, ndo2db, npcd, ntpd, postgresql, snmptrapd, snmptt
localhost services • File Counts (check_file_count)
• NPCD Perfdata spool directory• xidpe spool directory• Check results folder• snmptt spool folder
• nagios user account has not expired• (check_pass_expire.pl)
localhost services • root mailbox size
• (box293_check_mbox)
• MySQL / MariaDB• Database tables crashed?
• (box293_check_mysql_table_status)• Date/Time correct?
• (box293_check_mysql_date)
localhost services • Overall Load (check_load)
• Memory Free – Physical (check_memory)
• Swap Usage (check_swap)
• Disk Free (check_disk)
Date and Timezone!•Configure Timezone
• Admin > Manage System Config•Sync with trusted time source•VM? Don’t sync with hypervisor!•Can be the source of confusing problems
CPU •CPU Cores vs Speed!•Not everything is multi-threaded
• 3.4 GHz vs 2.2 GHz•Number of cores is still important
•Refer to XI hardware requirements
Memory•Enough memory to cope in a major outage
• Event handlers consume memory quickly (+GB in a matter of minutes in a major outage)
• Have at least 50% more memory than needed•Refer to XI hardware
requirements
RAM Disk•Lots of little files created/deleted/updated•Using a RAM Disk:
• Reduces disk I/O & load• Speeds up processing of performance data
• Speeds up processing of spooled check results
• Speeds up nagios restarts•Refer to official procedure
Solid State Disk (SSD)•Greatly improves overall performance•Compliments RAM Disk•Helps read/writes with:
• Logs• Database• Performance Graphs• Reports
SSD vs RAID ?•SSD beats* a spinning disk RAID set
• *Depends on how much money you have•Still need to RAID1 SSD for redundancy!•SSD may not give you the required capacity
• 3.8TB SAS SSD now available !!!
rrdcached•Enabling rrdcached accumulates the spooled performance data, after x amount of time it is processed into backend RRD files
• Reduces Disk I/O• Can be a delay in data appearing in graphs
•Refer to official procedure
Offloaded MySQL / MariaDB•Data constantly written to databases
• Historical and Configuration•Offload to separate server to reduce load•Don't forget to monitor offloaded server!!!
• Disk/CPU/Memory/Tables/Service • Refer to earlier slides
•Refer to official procedure
Mod-Gearman•Used for offloading plugins to workers•Plugins need to be installed on all workers•Be aware of plugins that use /tmp files!•XI 2014 onwards uses Core 4
• Core 4 has it's own workers (only local workers)
• nagios.cfg “check_workers” option•Refer to official procedure
Disaster Recovey•Failover and High Availability Solutions for Nagios XI
• Andy Brist - NWC2014 – Failover & HA Solutions
•What is really important in disaster?•Plan and test
Backups!!!•Admin > System Backups
• Schedule backups of XI• Location can be local, FTP, SSH• Remote location recommended
•Manual Backups• Local Backup Archives via Admin menu• /usr/local/nagiosxi/scripts/backup_xi.sh
Restoring Backups•Official Backup and Restore procedure
• Brings system back online with ease• Great for migrating from old XI to new XI• Also good for:
• DR• Test & Dev
Intervals - Host vs Services•Host down HARD = service notifications suppressed
• What happens when host and services use the same check intervals?
• Unnecessary Notifications get sent :(
• Make host go down HARD quicker than it’s services!
Service Dependencies•When a master service goes down:
• Prevents notifications from being sent
• Prevents service checks from execution
•Make master service go down HARD quicker than dependent services!
• Otherwise dependencies are pointless
•Master service e.g. - Ping or NRPE Version
Disable Service Checks ?•host_down_disable_service_checks
• Nagios Core 4.1.x feature (XI 5)• System wide setting• Reduces load on XI host• Think of it as automatic service dependencies on their own hosts
• Service dependencies ignored if host is down
Check Intervals - Be Realistic•Does it need to be checked every 5 minutes?
• Disk Free Space – every 60 minutes perhaps?
• Too long = no performance data
•Different intervals to spread the load• 3, 5, 7 minute intervals• 58, 60, 62 minute intervals
Notification & Check Intervals •Nagios determines if it is allowed to send a notification every service HARD state
• e.g. 15 minute check and 60 minute notification
• Internal scheduling may cause 14min 55sec to pass, 4 x 14:55 = 59min 40sec … it’s < 60min!
• Notification not sent until 75min!•Scheduling is geared +/- to reduce load!
Use Hostgroups!•Assign ONE service to a hostgroup of common servers
• Windows Servers• Linux Servers
•Consistent monitoring, standards enforced!•Directive changes - all hosts get updated •Reduces management overhead
Use Contact Groups!•Use contact groups in all definitions•Makes it easy when staff join/leave
• Just add/remove the contact from groups• Reduces administrative overhead• Enforces your company policy
•Similar principle to host groups
Configuration Wizards•Pros
• Great for getting up and running quickly• No need to learn how a plugin works
•Cons• Creates individual services • More work later when enforcing “standards”
Templates•Common settings applied to objects
• Helps enforce standards• Reduces administrative overhead• Layer multiple templates• Can be additive or ignore inheritance
•XI Config Wizard objects use templates
• Example of common icmp check
User Macros – resources.cfg•$USERx$ macros are good for common items like a username or password•Allows passwords with a ! exclamation mark•Values not visible in object definitions
•$USER1$• /usr/local/nagios/libexec
Custom Object Variables•Allows you to create your own variables•Can be defined in host or service objects•E.G. hosts have their own check_nt password
• Define _CHECK_NT_PASSWORD in host object• In command definitions reference it as:
• $_HOSTCHECK_NT_PASSWORD$•VERY POWERFULL!
MTRG Clean Configs•Your MRTG configs may be collecting more than what you think
• /etc/mrtg/conf.d/*.cfg files• Created by Network Switch / Router Wizard
• Comment out unused ports • About 37 lines per port
• Comment out unused non-interfaces (VLANs)
Plugins – Compiled vs Scripts•Compiled runs quicker
• Official nagios-plugins are compiled• “Custom modifications” require re-compiling
•Scripts run slower, consume more resources
• Perl plugins known to consume +CPU +RAM
•“nice” can reduce impact of plugins•Check Profiler component by box293
Backend API - Read Only User•API provides you with URLs for use in third party products without needing user/pass
• Requires a user account to be created• Account should be READ ONLY
Performance Data Tool•Component developed by box293
• Allows you to manipulate RRD files• Great for merging RRD data• Can also delete old RRD files for old services
• View raw data in tables•Find it in the Nagios Exchange