beyond pretty charts, analytics for the rest of us. toufic boubez devops days silicon valley...
DESCRIPTION
Current monitoring tools are clearly reaching the limit of their capabilities. That's because these tools are based on fundamental assumptions that are no longer true such as assuming that the underlying system being monitored is relatively static or that the behavioral limits of these systems can be defined by static rules and thresholds. Interest in applying analytics and machine learning to detect anomalies in dynamic web environments is gaining steam. However, understanding which algorithms should be used to identify and predict anomalies accurately within all that data we generate is not so easy. This talk builds on an Open Space discussion that was started at DevOps Days Austin. We will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as: Understanding your data and the two main approaches for analyzing operations data: parametric and non-parametric methods The importance of context Simple data transformations that can give you powerful resultsTRANSCRIPT
Beyond The Pretty Charts
Analytics for the rest of us
Toufic Boubez, Ph.D.Co-Founder, CTOMetafor Software
2
Toufic intro – who I am
• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies
– API Management– Acquired by Computer Associates in 2013
• I escaped • Building large scale software systems for 20
years (I’m older than I look, I know!)
3
Why this talk?
• DevOps Days Austin: Open Space talk– Blog:
http://metaforsoftware.com/beyond-the-pretty-charts-a-report-from-devopsdays-in-austin/
• Five major discussion points/lessons learned
• Note: no labels on charts – on purpose!!• Note: real data
4
Wall of charts
5
1. We’ve moved beyond static thresholds
• Most current monitoring tools assume that the underlying system is relatively static so we can surround it with static thresholds and rules. BUT:– So what if my unicorn usage is at 91%, and has
been stable at 91% for a while?– I’d much rather know if it’s at 60% and has been
rapidly increasing over the last few hours.
6
Need more better analytics
• Thresholds won’t help you in this case• Need some more dynamic analytics
7
2. Context is really important
– Do I really want to be alerted when I know someone is performing maintenance or backups?
– Is there an event that caused the change in behaviour (e.g. new deploy)?
– Correlate your event line with your monitoringDown for maintenance?
8
3. Know your data!!
– You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use.
• For example, it’s important to know if your data is normally distributed.
• http://codeascraft.com/2013/06/11/introducing-kale/• https://github.com/etsy/skyline/blob/master/src/analy
zer/algorithms.py– Three-sigma, Grubbs and other algorithms assume normal
distribution
9
What’s normal?
10
What’s my distribution?
11
Another common distribution
12
4. Is all data important to collect?
– Two camps:• Data is data, let’s collect and analyze everything and
figure out the trends. • Not all data is important, so let’s figure out what’s
important first and understand the underlying model so we don’t waste resources on the rest.
– Similar to the very public bun fight between Noam Chomsky and Peter Norvig
• http://norvig.com/chomsky.html
– Unresolved as far as I know
13
Do we need both metrics?
14
5. We all want to automate
• Having humans in the way of detecting and solving DevOps issues doesn’t scale.
• At some point, we need systems that can detect anomalies before problems become critical, and take appropriate action.
15
Open Loop Control System:Heating your house – the wrong way!
• Steps:– Tweak heater input– Get to ideal temperature– Lock gas valve– Hope nothing changes
Controller (gas valve)
System (heater)
Sensor (thermometer)
16
Controller (gas valve)
System (heater)
Sensor (thermometer)
+
-
delta
desiredtemperature
currenttemperature
Open Loop Control System:Heating your house – the right way
• Steps:– Set the desired temperature– Sit back and let the system deal with changes
17
Controller System
Sensor
+
-
PuppetChef
CFEngine…
MyInfrastrucutre
NagiosCacti
Zabbix…
?desiredstate
currentstate
What’s missing to get to self-healing systems
delta
• We have most of the tools already• Need to add:
– Error tracking (anomaly detection)– Corrective action
18
How much data do we need?
• Trend towards higher and higher sampling rates in data collection
• Reminds me of Jorge Luis Borges’ story about Funes the Memorious– Perfect recollection of the slightest details of every
instant of his life, but lost the ability for abstraction
• Our brain works on abstraction– We notice patterns BECAUSE we can abstract
19
The danger of over-abstraction
+
= comfortable?
20
So, how much data DO you need?
– You don’t need more resolution that twice your highest frequency (Nyquist-Shanon sampling theorem)
– Most of the algorithms for analytics will smooth, average, filter, and pre-process the data.
– Watch out for correlated metrics (e.g. used vs. available memory)
21
More?
• I want to talk more about analytics, in more depth, but time’s up!!– (Actually John won’t let me)
• Come talk to me during the breaks!• Thank you!