logging makes perfect - riemann, elasticsearch and friends

49
Logging Makes Perfect Itamar SynHershko h1p://code972.com @synhershko Real-world monitoring and visualizations

Upload: itamar

Post on 27-Jan-2017

1.449 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Logging makes perfect - Riemann, Elasticsearch and friends

Logging Makes Perfect

Itamar  Syn-­‐Hershko  h1p://code972.com  

@synhershko    

Real-world monitoring and visualizations

Page 2: Logging makes perfect - Riemann, Elasticsearch and friends

Me?

•  Itamar  Syn-­‐Hershko  /  @synhershko  •  Blog  at  h1p://code972.com  •  Long  @me  OSS  contributor  •  Lucene.NET  PMC  and  lead  commi1er  •  Elas@csearch  consul@ng  partner  •  Building  soKware  to  catch  fraudsters  @  Forter  

Page 3: Logging makes perfect - Riemann, Elasticsearch and friends
Page 4: Logging makes perfect - Riemann, Elasticsearch and friends
Page 5: Logging makes perfect - Riemann, Elasticsearch and friends

$15K Purchase Amount Washington (high income)

Billing Neighborhood

Platinum+ Credit Card Type

Kibuve, Rwanda IP Location

Virginia (Low income) Shipping Neighborhood

< $100 Past Purchase Amounts

Ivy League School Billing Address

Expedited Shipping Shipping type

Sean Smith, Macbook Air x 3

DECLINED

Page 6: Logging makes perfect - Riemann, Elasticsearch and friends

$2,450 Purchase Amount

Washington (high income) Billing Neighborhood

Platinum+ Credit Card Type

Kibuve, Rwanda IP Location

Virginia (Low income) Shipping Neighborhood

< $100 Past Purchase Amounts

Ivy League School Billing Address

(202) 222-3850 Phone

Sean is an officer, serving overseas in Rwanda.

His shipping address matches the pattern of diplomat

mailboxes for US embassies and his billing address is

still linked to his old college accommodation.

How did we know?  

 

Sean Smith, Macbook Air x 3

BEHAVIORAL DATA

REAL-TIME APPROVED

Page 7: Logging makes perfect - Riemann, Elasticsearch and friends

Different systems, different SLAs

   

Performance   Data  Loss   Business  Logic  

TX  Processing   Low  Latency   Nope   Best  Effort  

Stream  Processing   High  Throughput   Best  Effort   Best  Effort  

Batch  Processing   High  Volume   Nope   Reconcilia@on  

Page 8: Logging makes perfect - Riemann, Elasticsearch and friends

Forter’s TX Processing API

Forter  

Transac@on  details  

 Decision  (approve  /  decline)  

millise

cond

s  

Page 9: Logging makes perfect - Riemann, Elasticsearch and friends

Achieving Low Latencies

For  comparison:  

•  A  few  network  hops  between  AWS  availability  zones  (~1ms)  •  JVM  young  garbage  collec@on  of  a  few  hundred  MBs  (~20ms)  •  Read  by  primary  key  from  SSD  storage  (~30ms)  •  US  West  Coast  to  East  Coast  network  latency  (~100ms)  •  JVM  full  garbage  collec@on  a  few  hundred  MBs  (~500ms-­‐1000ms)  

Page 10: Logging makes perfect - Riemann, Elasticsearch and friends

Distributed Systems

Page 11: Logging makes perfect - Riemann, Elasticsearch and friends

Our stack

 nginx  (lua)    

expressjs  (nodejs)    

Storm  (java)      

Stability  

Code  Changes  

Page 12: Logging makes perfect - Riemann, Elasticsearch and friends

Monitoring and Alerting

Page 13: Logging makes perfect - Riemann, Elasticsearch and friends
Page 14: Logging makes perfect - Riemann, Elasticsearch and friends

What do we monitor?

•  Latencies  – Benchmark  tools  are  useless  here  – Always  prefer  percen@les,  measure  the  long  tail  – Dis@nguishing  real  traffic  from  synthe@c  requests    

•  Availability  – No  non-­‐200  HTTP  status  codes  returned  

•  High  /  suspicious  metrics  •  Errors,  excep@ons  •  Wrong  processing  (data  loss?)  

Page 15: Logging makes perfect - Riemann, Elasticsearch and friends

Detec@on  

Page 16: Logging makes perfect - Riemann, Elasticsearch and friends

filter  &  route  

Page 17: Logging makes perfect - Riemann, Elasticsearch and friends

alert  

Page 18: Logging makes perfect - Riemann, Elasticsearch and friends

diagnos@cs  

Page 19: Logging makes perfect - Riemann, Elasticsearch and friends

•  Pingdom  probes  -­‐  low  coverage,  reliable  •  CloudWatch,  collectd  -­‐  fast,  no  root  cause  •  Internal  probes  -­‐  be1er  coverage,  false  alarms  

•  libbeat  and  logstash  for  monitoring  3rd  party  applica@ons  –  some@mes  noisy,  false  alarms  

•  App  events  (excep@ons)  -­‐  too  noisy,  root  cause  

Page 20: Logging makes perfect - Riemann, Elasticsearch and friends

Redundancy of detection sources

….  is  important!  

•  Different  concerns  •  Varying  accuracy  and  verbosity  •  High  availability  •  Correla@on  

Page 21: Logging makes perfect - Riemann, Elasticsearch and friends

Automatic Latency Reporting: NodeJS

georg  (named  aKer  Georg  Friedrich  Bernhard  Riemann)  is  a  nodejs  library  that  acts  as  an  in-­‐process  riemann  agent.  The  library  can  send  nodejs  specific  and  applica@on  specific  metrics  to  riemann.    h1ps://github.com/forter/georg    

Page 22: Logging makes perfect - Riemann, Elasticsearch and friends
Page 23: Logging makes perfect - Riemann, Elasticsearch and friends

Apache Storm Basics

Page 24: Logging makes perfect - Riemann, Elasticsearch and friends

Automatic Latency Reporting: Apache Storm

             h1ps://github.com/forter/riemann-­‐storm-­‐monitor    

Page 25: Logging makes perfect - Riemann, Elasticsearch and friends

The ELK Stack

Queue  (Redis)  

“Shippers”  

“Indexer”  

Page 26: Logging makes perfect - Riemann, Elasticsearch and friends

The ELK Stack + Riemann

Queue  (Redis)  

“Shippers”  

Page 27: Logging makes perfect - Riemann, Elasticsearch and friends

•  Detect  failures  •  Expect  different  latencies  from  different  components,  dynamically  

•  Tolerate  spikes  •  High  /  low  watermarks  •  Reshape  event  streams  (split,  merge,  …)  

Done  via  Riemann  –  an  event  stream  processor  

Page 28: Logging makes perfect - Riemann, Elasticsearch and friends

Riemann

Page 29: Logging makes perfect - Riemann, Elasticsearch and friends

; Listen on the local interface over TCP (5555), UDP (5555) & websockets ; (5556) (let [host "127.0.0.1"] (tcp-server {:host host}) (udp-server {:host host}) (ws-server {:host host})) ; Expire old events from the index every 5 seconds. (periodically-expire 5) (let [index (index)] ; Inbound events will be passed to these streams: (streams (default :ttl 60 ; Index all events immediately. index ; Log expired events. (expired (fn [event] (info "expired" event))))))

Page 30: Logging makes perfect - Riemann, Elasticsearch and friends

Filter tests using a state machine

Page 31: Logging makes perfect - Riemann, Elasticsearch and friends

(tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch

[key] (let [pd (pagerduty key)] (changed-state {:init "passed"}

(where (state "passed") (:resolve pd)) (where (state "failed") (:trigger pd)))))

Filter tests using a state machine

Page 32: Logging makes perfect - Riemann, Elasticsearch and friends

Re-open manually resolved alert

Page 33: Logging makes perfect - Riemann, Elasticsearch and friends

Re-open manually resolved alert (tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch [key] (let [pd (pagerduty key)]

(sdo (changed-state {:init "passed"}

(where (state "passed") (:resolve pd)))

(where (state "failed") (by [:host :service] (throttle 1 60 (:trigger pd)))))))

Page 34: Logging makes perfect - Riemann, Elasticsearch and friends

Riemann Events

Page 35: Logging makes perfect - Riemann, Elasticsearch and friends

Riemann’s Index

Key  (host  +  service)   Event   TTL  

10.0.0.1-­‐redisfree   {"metric":"5"}   60  

10.0.0.1-­‐probe1   {"state":"failed"}   300  

10.0.0.2-­‐probe1   {"state":"passed”}   300  

Page 36: Logging makes perfect - Riemann, Elasticsearch and friends

Riemann Ghosts

•  “Expired  events”  

•  Use  case:  Detect  dead  services  •  Use  case:  Cron  stopped  working  (e.g.  backup)    

Page 37: Logging makes perfect - Riemann, Elasticsearch and friends

Riemann Overloaded

key  (IP  +  cookie)   event   TTL  

199.25.1.1-­‐1234   {"state":"loaded"}   300  

199.25.2.1-­‐4567   {"state":"downloaded"}   300  

199.25.3.1-­‐8901    

{"state":"loaded"}    

300  

Page 38: Logging makes perfect - Riemann, Elasticsearch and friends

Something failed, now what?

•  Always  prefer  auto-­‐healing  – Humans  are  slow  – But  careful  from  a  domino  effect  

•  Fail  fast  and  automa@c  failover  – Code  throws  excep@on,  logs  it  and  moves  forward  – Fatal  excep@ons  will  cause  the  process  to  restart  (upstart)  – Auto  scaling  groups  keep  enough  machines  up  – ELB  will  phase  out  dead  stacks  – HTTP  fencing  (commercial  products)  

Page 39: Logging makes perfect - Riemann, Elasticsearch and friends

Alert

•  When  human  interven@on  is  required  

•  Urgent:  immediate  ac@on  – PagerDuty  (phone  calls,  SMS,  push  no@fica@ons)  

•  Low  urgency:  events  are  logged  and  reviewed  the  morning  aKer  – PagerDuty  (Phone  push  no@fica@ons)  – Slack  – Email  

Page 40: Logging makes perfect - Riemann, Elasticsearch and friends

Diagnostics

Page 41: Logging makes perfect - Riemann, Elasticsearch and friends

The ELK stack For  real-­‐@me  analy@cs  and  dashboards  

Page 42: Logging makes perfect - Riemann, Elasticsearch and friends

The ELK Stack + Riemann

Queue  (Redis)  

“Shippers”  

Page 43: Logging makes perfect - Riemann, Elasticsearch and friends
Page 44: Logging makes perfect - Riemann, Elasticsearch and friends

Analyzing sleep patterns J

*  Times  are  Israel  Day  Time  

Page 45: Logging makes perfect - Riemann, Elasticsearch and friends

Storm topology visualization & timing

Page 46: Logging makes perfect - Riemann, Elasticsearch and friends

Storm timelines

Page 47: Logging makes perfect - Riemann, Elasticsearch and friends

Alerting via anomaly detection

Page 48: Logging makes perfect - Riemann, Elasticsearch and friends

To conclude…

•  Fail  fast  – With  back-­‐off  

•  Log  everything  •  Aler@ng  via  state  machines  •  By  host,  service,  @me  •  Alerts  are  useless  without  proper  debug  tools  •  Try  to  self-­‐heal  when  possible.  

Page 49: Logging makes perfect - Riemann, Elasticsearch and friends

Thank  you  Ques>ons?  

Itamar  Syn-­‐Hershko  h1p://code972.com  

@synhershko  /  @ForterTech