beyond pretty charts, analytics for the rest of us. toufic boubez devops days silicon valley...

Beyond The Pretty Charts

Analytics for the rest of us

Toufic Boubez, Ph.D.Co-Founder, CTOMetafor Software

2

Toufic intro – who I am

• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies

– API Management– Acquired by Computer Associates in 2013

• I escaped • Building large scale software systems for 20

years (I’m older than I look, I know!)

3

Why this talk?

• DevOps Days Austin: Open Space talk– Blog:

http://metaforsoftware.com/beyond-the-pretty-charts-a-report-from-devopsdays-in-austin/

• Five major discussion points/lessons learned

• Note: no labels on charts – on purpose!!• Note: real data



4

Wall of charts

5

1. We’ve moved beyond static thresholds

• Most current monitoring tools assume that the underlying system is relatively static so we can surround it with static thresholds and rules. BUT:– So what if my unicorn usage is at 91%, and has

been stable at 91% for a while?– I’d much rather know if it’s at 60% and has been

rapidly increasing over the last few hours.

6

Need more better analytics

• Thresholds won’t help you in this case• Need some more dynamic analytics

7

2. Context is really important

– Do I really want to be alerted when I know someone is performing maintenance or backups?

– Is there an event that caused the change in behaviour (e.g. new deploy)?

– Correlate your event line with your monitoringDown for maintenance?

8

3. Know your data!!

– You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use.

• For example, it’s important to know if your data is normally distributed.

• http://codeascraft.com/2013/06/11/introducing-kale/• https://github.com/etsy/skyline/blob/master/src/analy

zer/algorithms.py– Three-sigma, Grubbs and other algorithms assume normal

distribution

http://codeascraft.com/2013/06/11/introducing-kale/

https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py

https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py

9

What’s normal?

10

What’s my distribution?

11

Another common distribution

12

4. Is all data important to collect?

– Two camps:• Data is data, let’s collect and analyze everything and

figure out the trends. • Not all data is important, so let’s figure out what’s

important first and understand the underlying model so we don’t waste resources on the rest.

– Similar to the very public bun fight between Noam Chomsky and Peter Norvig

• http://norvig.com/chomsky.html

– Unresolved as far as I know

http://norvig.com/chomsky.html

13

Do we need both metrics?

14

5. We all want to automate

• Having humans in the way of detecting and solving DevOps issues doesn’t scale.

• At some point, we need systems that can detect anomalies before problems become critical, and take appropriate action.

15

Open Loop Control System:Heating your house – the wrong way!

• Steps:– Tweak heater input– Get to ideal temperature– Lock gas valve– Hope nothing changes

Controller (gas valve)

System (heater)

Sensor (thermometer)

16

Controller (gas valve)

System (heater)

Sensor (thermometer)

+

-

delta

desiredtemperature

currenttemperature

Open Loop Control System:Heating your house – the right way

• Steps:– Set the desired temperature– Sit back and let the system deal with changes

17

Controller System

Sensor

+

-

PuppetChef

CFEngine…

MyInfrastrucutre

NagiosCacti

Zabbix…

?desiredstate

currentstate

What’s missing to get to self-healing systems

delta

• We have most of the tools already• Need to add:

– Error tracking (anomaly detection)– Corrective action

18

How much data do we need?

• Trend towards higher and higher sampling rates in data collection

• Reminds me of Jorge Luis Borges’ story about Funes the Memorious– Perfect recollection of the slightest details of every

instant of his life, but lost the ability for abstraction

• Our brain works on abstraction– We notice patterns BECAUSE we can abstract

19

The danger of over-abstraction

+

= comfortable?

20

So, how much data DO you need?

– You don’t need more resolution that twice your highest frequency (Nyquist-Shanon sampling theorem)

– Most of the algorithms for analytics will smooth, average, filter, and pre-process the data.

– Watch out for correlated metrics (e.g. used vs. available memory)

21

More?

• I want to talk more about analytics, in more depth, but time’s up!!– (Actually John won’t let me)

• Come talk to me during the breaks!• Thank you!

beyond pretty charts, analytics for the rest of us. toufic boubez devops days silicon valley...

Technology