uc berkeley detecting large-scale system problems by mining console logs wei xu ( 徐葳 )* ling...

56
UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐徐 )* Ling Huang Armando Fox* David Patterson* Michael Jordan* 1 *UC Berkeley Intel Labs Berkeley Jan 18, 2009, MSRA

Upload: maryann-hood

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

1

UC Berkeley

Detecting Large-Scale System Problems by Mining Console Logs

Wei Xu (徐葳 )* Ling Huang†

Armando Fox* David Patterson* Michael Jordan*

*UC Berkeley † Intel Labs Berkeley

Jan 18, 2009, MSRA

Page 2: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

2

Why console logs?

• Detecting problems in large scale Internet services often requires detailed instrumentation

• Instrumentation can be costly to insert & maintain• Often combine open-source building blocks that are not

all instrumented• High code churn

• Can we use console logs in lieu of instrumentation?

+ Easy for developer, so nearly all software has them

– Imperfect: not originally intended for instrumentation

Page 3: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

3

The ugly console log

HODIE NATUS EST RADICI FRATER

today unto the Root a brother is born.

* "that crazy Multics error message in Latin." http://www.multicians.org/hodie-natus-est.html

Page 4: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

4

Result preview

200 nodes, >24 million lines of logs

Abnormal log segments A single page visualization

ParseDetect

Visualize

• Fully automatic process without any manual input

Page 5: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

5

Outline

Key ideas and general methodology• Methodology illustrated with Hadoop case study

• Four step approach• Evaluation

• An extension: supporting online detection• Challenges• Two stage detection• Evaluation

• Related work and future directions

Page 6: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

6

Our approach and contribution

Machine Learning VisualizationParsing

Feature Creation

• A general methodology for processing console logs automatically (SOSP’09)

• A novel method for online detection (ICDM ’09)• Validation on real systems• Working on production data from real companies

Page 7: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

7

Key insights for analyzing logs

• The log contains the necessary information to create features• Identifiers• State variables• Correlations among messages (sequences)

NORMAL

receiving blk_1received blk_1

receiving blk_2

ERROR

• Console logs are inherently structured• Determined by log printing statement

Page 8: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

8

Outline

• Key ideas and general methodology Methodology illustrated with Hadoop case study

• Four step approach• Evaluation

• An extension: supporting online detection• Challenges• Two stage detection• Evaluation

• Related work and future directions

Page 9: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

9

• Non-trivial in object oriented languages• String representation of objects• Sub-classing

• Our Solution• Type inference on source code• Pre-compute all possible message templates

Step 1: Parsing

• Free text → semi-structured text• Basic ideas

Receiving block blk_1

Log.info(“Receiving block ” + blockId);

Receiving block (.*) [blockId]

Type: Receiving blockVariables: blockId(String)=blk_1

Page 10: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

10

Parsing: Scale parsing with map-reduce

0 20 40 60 80 1000

5

10

15

20

Tim

e us

ed (

min

)

Number of nodes

24 Million lines of console logs= 203 nodes * 48 hours

Page 11: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

11

Parsing: Language specific parsers

• Source code based parsers• Java (OO Language)• C / C++ (Macros and complex format strings)• Python (Scripting/dynamic languages)

• Binary based parsers• Java byte code• Linux + Windows binaries using printf format strings

Page 12: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Step 2: Feature creation - Message count vector

• Identifiers are widely used in logs• Variables that identify objects manipulated by the

program• file names, object keys, user ids

• Grouping by identifiers • Similar to execution traces

• Identifiers can be discovered automatically

12

receiving blk_1

receiving blk_1

received blk_1received blk_1

receiving blk_2

received blk_2

receiving blk_2

Page 13: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

Feature creation – Message count vector example

Receiving blk_1

Receiving blk_2

Received blk_2

Receiving blk_1

Received blk_1

Received blk_1

Receiving blk_2

13

0 1 2 0 0 2 0 0 0 0 0 0 0 0 2 2

0 0 1 2 0 0 2 0 0 0 0 0 0 0 12

• Numerical representation of these “traces”• Similar to bag of words model in information retrieval

blk_1

blk_2

Page 14: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Step 3: Machine learning – PCA anomaly detection

• Most of the vectors are normal• Detecting abnormal vectors

• Principal Component Analysis (PCA) based detection• PCA captures normal patterns in these vectors

• Based on correlations among dimensions of the vectors

14NORMAL

receiving blk_1received blk_1

receiving blk_2

ERROR

0 1 2 0 0 2 0 0 0 0 0 0 0 0 2 2

Page 15: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

15

Evaluation setup

• Experiment on Amazon’s EC2 cloud• 203 nodes x 48 hours• Running standard map-reduce jobs• ~24 million lines of console logs• ~575,000 HDFS blocks

• 575,000 vectors• ~ 680 distinct ones• Manually labeled each distinct cases

• Normal/abnormal• Tried to learn why it is abnormal• For evaluation only

Page 16: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

PCA detection results

16

Anomaly Description Actual Detected1 Forgot to update namenode for deleted block 4297 42972 Write block exception then client give up 3225 32253 Failed at beginning, no block written 2950 29504 Over-replicate-immediately-deleted 2809 27885 Received block that does not belong to any file 1240 12286 Redundant addStoredBlock request received 953 9537 Trying to delete a block, but the block no longer exists on data node 724 6508 Empty packet for block 476 4769 Exception in receiveBlock for block 89 8910 PendingReplicationMonitor timed out 45 4511 Other anomalies 108 107

Total anomalies 16916 16808Normal blocks 558223

Description False Positives1 Normal background migration 13972 Multiple replica ( for task / jobdesc files ) 349

Total 1746False Positives

How can we make the results easy for operators to understand?

Page 17: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Step 4: Visualizing results with decision tree

17

OK

1

1

1

1

0

0

0

writeBlock # received exception

# Starting thread to transfer block # to #

#: Got exception while serving # to #:#

Unexpected error trying to delete block #\. BlockInfo Not found in volumeMap

addStoredBlock request received for # on # size # But it does not belong to any file

# starting thread to transfer block # to #

#Verification succeeded for #

Receiving block # src: # dest: #

ERROR0

<=2

0

0

0

0

0

>=3

>=1

>=3

>=1

>=1

>=1

>=1

>=1

<=2

ERROR

ERROR

ERROR

ERROR

OK

OK

OK

Page 18: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

18

The methodology is general

• Surveyed a number of software apps• Linux kernel, OpenSSH• Apache, MySQL, Jetty• Cassandra, Nutch

• Our methods apply to virtually all of them• Same methodology worked on multiple

features / ML algorithms• Another feature “state count vector”• Graph visualization

• WIP: Analyzing production logs at Google

Page 19: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

19

Outline

• Key ideas and general methodology• Methodology illustrated with Hadoop case study

• Four step approach• Evaluation

An extension: supporting online detection• Challenges• Two stage detection• Evaluation

• Related work and future directions

Page 20: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

What makes it hard to do online detection?

PCA Detection VisualizationParsing

Feature Creation

Message count vector feature is based on sequences, NOT individual messages

20

Page 21: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Online detection: when to make detection?

• Cannot wait for the entire trace• Can last arbitrarily long time

• How long do we have to wait?• Long enough to keep correlations• Wrong cut = false positive

• Difficulties• No obvious boundaries• Inaccurate message ordering • Variations in session duration

21

receiving blk_1

received blk_1

reading blk_1

reading blk_1

deleting blk_1

deleted blk_1

receiving blk_1

received blk_1

Time

Page 22: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Frequent patterns help determine session boundaries

• Key Insight: Most messages/traces are normal• Strong patterns

• “Make common paths fast”• Tolerate noise

22

Page 23: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

23

Two stage detection overview

Frequent pattern based filtering

OK

PCA DetectionOK

ERROR

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing

Free text logs200 nodes

Page 24: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Stage 1 - Frequent patterns (1): Frequent event sets

24

receiving blk_1

received blk_1reading blk_1

deleting blk_1

deleted blk_1

receiving blk_1

received blk_1

Time

Coarse cut by time

Find frequent item set

Refine time estimation

reading blk_1

error blk_1

Repeat until all patterns found

PCA Detection

Page 25: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

25

Stage 1 - Frequent patterns (2) : Modeling session duration time

• Cannot assume Gaussian• 99.95th percentile estimation is off by half• 45% more false alarms

• Mixture distribution• Power-law tail + histogram head

DurationDuration

Cou

nt

Pr(

X>

=x)

Page 26: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Frequent pattern matching filters most of the normal events

Frequent pattern based filtering

OK

PCA DetectionOK

ERROR

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing*

Free text logs200 nodes

100%86%

14% 13.97%

0.03%

26

Page 27: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

27

Frequent patterns in HDFS

Frequent Pattern 99.95th percentile Duration

% of messages

Allocate, begin write 13 sec 20.3%

Done write, update metadata 8 sec 44.6%

Delete - 12.5%

Serving block - 3.8%

Read exception - 3.2%

Verify block - 1.1%

Total 85.6%

• Covers most messages• Short durations

(Total events ~20 million)

Page 28: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

28

Detection latency

• Detection latency is dominated by the wait time

Single event pattern

Frequent pattern (matched)

Frequent pattern (timed out)

Non pattern events

Page 29: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

29

Detection accuracy

True Positives

False Positives

False Negatives

Precision Recall

Online 16,916 2,748 0 86.0% 100.0%

Offline 16,808 1,746 108 90.6% 99.3%

(Total trace = 575,319)

• Ambiguity on “abnormal”• Manual labels: “eventually normal”• > 600 FPs in online detection as very long latency• E.g. a write session takes >500sec to complete

(99.99th percentile is 20sec)

Page 30: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

30

Outline

• Key ideas and general methodology• Methodology illustrated with Hadoop case study

• Four step approach• Evaluation

• An extension: supporting online detection• Challenges• Two stage detection• Evaluation

Related work and future directions

Page 31: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

31

Related work: system monitoring and problem diagnosis

Heavier instrumentation

Numerical tracesSystemHistory[Cohen05]Fingerprints[Bodik10][Lakhina04]

Finer granularity data

Statistical debugging[Zheng06]AjaxScope [Kiciman07]

Path-based problem detectionPinpiont [Chen04]

Source/binary •Aspect-oriented programming•Assertions

Replay-basedDeterministic replay [Altekar09]

Configurable Tracing frameworks•Dtrace•Xtrace [Fonseca07]•VM-based tracing [King05]

Event as time series[Hellerstein02][Lim08][Yamanishi05]

Alternative loggingPeerPressure[Wang04]Heat-ray [Dunagan09][Kandula09]

Runtime predicatesD3S [Liu08]

(Passive) data collection•Syslog•SNMP•Chukwa [Boulon08]

Page 32: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

32

Related work: log parsing

• Rule based and scripts• Microsoft log parser, splunk, …

• Frequent pattern based• [Vaarandi04] R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event

logs. In proceedings of INTELLCOMM, 2004.

• Hierarchical• [Fu09] Q. Fu, et al. Execution Anomaly Detection in Distributed Systems through

Unstructured Log Analysis. In Proc. of ICDM 2009• [Makanju09] A. A. Makanju, et al. Clustering event logs using iterative partitioning. In Proc.

of KDD ’09• [Fisher08] K. Fisher, et al. From dirt to shovels: fully automatic tool generation from ad hoc

data. In Proc. of POPL ’08

• Key benefit of using source code• Handles rare messages, which is likely to be more

useful in problem detection

Page 33: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

33

Related work: modeling logs

• As free texts• [Stearley04] J. Stearley. Towards informatic analysis of syslogs. In Proc. of CLUSTER,

2004.

• Visualize global trends• [CretuCioCarlie08] G. Cretu-Ciocarlie et al, Hunting for problems with Artemis, In Proc of

WASL 2008• [Tricaud08] S. Tricaud, Picviz: Finding a Needle in a Haystack, In Proc. of WASL 2008

• A single event stream (time series)• [Ma01] S. Ma, et.al.. Mining partially periodic event patterns with unknown periods. In

Proceedings of the 17th International Conference on Data Engineering, 2001. • [Hellerstein02] J. Hellerstein, et.al. Discovering actionable patterns in event data, 2002.• [Yamanishi05] K. Yamanishi, et al. Dynamic syslog mining for network failure monitoring. In

Proceedings of ACM KDD, 2005.

• Multiple streams / state transitions• [Tan08] J. Tan, et al. SALSA: Analyzing Logs as State Machines. In Proc. of WASL 2008• [Fu09] Q. Fu, et al. Execution Anomaly Detection in Distributed Systems through

Unstructured Log Analysis. In Proc. of ICDM 2009

Page 34: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

34

Related work: techniques that inspired this project

• Path based analysis• [Chen04] M. Y. Chen, et al. Path-based failure and evolution management. In Proceedings

of NSDI, 2004. • [Fonseca07] R. Fonseca, et.al. Xtrace: A pervasive network tracing framework. In In NSDI,

2007.

• Anomaly detection and kernel methods• [Gartner03] T. Gartner. A survey of kernels for structured data. SIGKDD Explore, 2003.• [Twining03] C. J. Twining, et al. The use of kernel principal component analysis to model

data distributions. Pattern Recognition, 36(1):217–227, January 2003.

• Vector space information retrieval• [Papineni 01] K. Papineni. Why inverse document frequency? In NAACL, 2001.• [Salton87] G. Salton et al. Term weighting approaches in automatic text retrieval. Cornell

Technical report, 1987.

• Using source code as alternative info• [Tan07] L. Tan, et.al.. /*icomment: bugs or bad comments?*/. In Proceedings of ACM SOSP,

2007.

Page 35: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

35

Future directions: console log mining

• Distributed log stream processing• Handle large scale cluster + partial failures

• Integration with existing monitoring systems• Allows existing detection algorithm to apply directly

• Logs from the entire software stack• Handling logs across software versions• Allowing feedback from operators• Providing suggestions on improving logging

practice

Page 36: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

36

Beyond console logs: textual information in SW dev-operation

Documentation

Comments

Bug tracking

Version control logs

Unit testing

Operation docs

Scripts

Configurations

Problem tickets

Makefiles

Versioning diffs

Continuous integration

Profiling Monitoring data

Platform counters

Console logs

Core dumps

CodePaper

Page 37: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

37

Summary

http://www.cs.berkeley.edu/~xuw/Wei Xu <[email protected]>

Machine Learning VisualizationParsing

Feature Creation

• A general methodology for processing console logs automatically (SOSP’09)

• A novel method for online detection (ICDM ’09)• Validation on real systems

Page 38: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

38

Backup slides

Page 39: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

39

Rare messages are important

Receiving block blk_100 src: … dest:...

Receiving block blk_200 src: … dest:...

Receiving block blk_100 src: …dest:…

Received block blk_100 of size 49486737 from …

Received block blk_100 of size 49486737 from …

Pending Replication Monitor timeout block blk_200

Page 40: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

40

Compare to natural language (semantics) based approach

• We use console log solely as a trace of program execution

• Semantic meanings sometimes misleading• Does “Error” really mean it?• Missing link between the message and the actual

program execution

• Hard to completely eliminate manual search

Page 41: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

41

Compare to text based parsing

• Existing work on inferring structure of log messages without source codes• Patterns in strings• Words weighing • Learning “grammar” of logs• Pros -- Don’t need source code analysis• Cons -- Unable to handle rare messages

• 950 possible, only <100 appeared in our log

• Per-operation features require highly accurate parsing

Page 42: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

42

Better logging practices

• We do not require developers to change their logging habits

• Use logging• Traces vs. error reporting• Distinguish your messages from others’

• printf(“%d”, i);

• Use a logging framework• Levels, timestamps etc.

• Use correct levels (think in a system context)• Unnecessary WARNINGs and ERRORs

Page 43: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

43

Ordering of events

• Message count vectors do not capture any ordering/timing information

• Cannot guarantee the global ordering on distributed systems– Needs significant change to the logging infrastructure

Page 44: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

44

Too many alarms -- what can operators do with them?

• Operators see clusters of alarms instead of individual ones• Easy to cluster similar alarms• Most of the alarms are the same

• Better understanding on problems• Deterministic bugs in the systems

• Provides more informative bug reports • Either fix the bug or the log

• Due to transient failures• Clustered around single nodes/subnets/time

• Better understanding of normal patterns

Page 45: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

45

Case studies

• Surveyed a number of software apps• Linux kernel, OpenSSH• Apache, MySQL, Jetty• Hadoop, Cassandra, Nutch• Our methods apply to virtually all of them

Page 46: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

46

Log parsing process

Page 47: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

Step 1: Log parsingChallenges in object oriented languages

47

starting: (.*) [transact][Transaction] [Participant.java:345]

xact (.*) is (.*) [tid, state] [int, String]

LOG.info(“starting:” + transact);

starting: xact 325 is PREPARINGstarting: xact 325 is STARTING at Node1:1000

starting: xact 325 is PREPARING

starting: (.*) [transact][Transaction] [Participant.java:345]

xact (.*) is (.*) at (.*) [tid, state, node] [int, String, Node]

starting: xact 325 is STARTING at Node1:1000

Page 48: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

48

Intuition behind PCA

Page 49: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

49

Another decision tree

Page 50: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

50

Existing work on textual log analysis

• Frequent item set mining– R. Vaarandi. SEC - a lightweight event correlation tool. IEEE Workshop on IP Operations

and Management, 2002.–

• Temporal properties analysis– Y. Liang, A. Sivasubramaniam, and J. Moreira. Filtering failure logs for a bluegene/L

prototype. In Proceedings of IEEE DSN, 2005. – C. Lim, et.al. A log mining approach to failure analysis of enterprise telephony systems. In

Proceedings of IEEE DSN, 2008.

• Statistical modeling of logs– R. K. Sahoo, et.al.. Critical event prediction for proactive management in largescale

computer clusters. In Proceedings of ACM KDD, 2003.

Page 51: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

51

Real world logging tools

• Generation• Log4j, Jakarta commons logging, logback, SLF4J, ...• Standard web servers log format (W3C, Apache)

• Collection - syslog and beyond• syslog-ng (better availability)• nsyslog, socklog… (better security)

• Analysis– Query languages for logs

• Microsoft log parser• SQL and continuous query / Complex events processing / GUI – Talend Open Studio

– Manually specified rules• A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In

Proceedings of IEEE DSN, 2007.• logsurfer, Swatch, OSSEC – an open source host intrusion detection system

– Searching the logs• Splunk

– Log visualization• Syslog watcher …

Page 52: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

52

• [1] R. S. Barga, J. Goldstein, M. Ali, and M. Hong. Consistent streaming through time: A vision for event stream processing. In Proceedings of CIDR, volume 2007, pages 363–374, 2007.

• [2] D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.

• [3] R. Dunia and S. J. Qin. Multi-dimensional fault diagnosis using a subspace approach. In Proceedings of American Control Conference, 1997.

• [4] R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 12 2006.

• [5] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. Xtrace: A pervasive network tracing framework. In In NSDI, 2007.

• [6] T. G¨artner. A survey of kernels for structured data. SIGKDD Explor. Newsl.,5(1):49–58, 2003.• [7] T. G¨artner, J.W. Lloyd, and P. A. Flach. Kernels and distances for structured data. Mach.

Learn., 57(3):205–232, 2004.• [8] M. Gilfix and A. L. Couch. Peep (the network auralizer): Monitoring your network with sound.

In LISA ’00: Proceedings of the 14th USENIX conference on System administration, pages 109–118, Berkeley, CA, USA, 2000. USENIX Association.

• [9] L. Girardin and D. Brodbeck. A visual approach for monitoring logs. In LISA ’98: Proceedings of the 12th USENIX conference on System administration, pages 299–308, Berkeley, CA, USA, 1998. USENIX Association.

• [10] C. Gulcu. Short introduction to log4j, March 2002. http://logging.apache.org/log4j.• [11] J. Hellerstein, S. Ma, and C. Perng. Discovering actionable patterns in event data, 2002.

Page 53: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

53

• [12] J. E. Jackson and G. S. Mudholkar. Control procedures for residuals associated with principal component analysis. Technometrics, 21(3):341–349, 1979.

• [13] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In SIGCOMM ’04: Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, pages 219–230, New York, NY, USA, 2004. ACM.

• [14] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and N. Taft. Structural analysis of network traffic flows. In SIGMETRICS ’04/Performance’04: Proceedings of the joint international conference on Measurement and modeling of computer systems, pages 61–72, New York, NY, USA, 2004. ACM.

• [15] Y. Liang, A. Sivasubramaniam, and J. Moreira. Filtering failure logs for a bluegene/L prototype. In DSN ’05: Proceedings of the 2005 International Conference on Dependable Systems and Networks, pages 476–485, Washington, DC, USA, 2005. IEEE Computer Society.

• [16] C. Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Proceedings of International conference on dependable systems and networks, June 2008.

• [17] S. Ma and J. L. Hellerstein. Mining partially periodic event patterns with unknown periods. In Proceedings of the 17th International Conference on Data Engineering, pages 205–214, Washington, DC, USA, 2001. IEEE Computer Society.

• [18] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, editors, KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 935–940, New York, NY, USA, August 2006. ACM.

Page 54: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

54

• [19] A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In DSN ’07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 575–584, Washington, DC, USA, 2007. IEEE Computer Society.

• [20] J. E. Prewett. Analyzing cluster log files using logsurfer. In in Proceedings of the 4th Annual Conference on Linux Clusters, 2003.

• [21] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in largescale computer clusters. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–435, New York, NY, USA, 2003. ACM.

• [22] J. Stearley. Towards informatic analysis of syslogs. In CLUSTER ’04: Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 309–318, Washington, DC, USA, 2004. IEEE Computer Society.

• [23] J. Stearley and A. J. Oliner. Bad words: Finding faults in spirit’s syslogs. In CCGRID ’08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pages 765–770, Washington, DC, USA, 2008. IEEE Computer Society.

• [24] T. Takada and H. Koike. Tudumi: information visualization system for monitoring and auditing computer logs. pages 570–576, 2002.

• [25] C. J. Twining and C. J. Taylor. The use of kernel principal component analysis to model data distributions. Pattern Recognition, 36(1):217–227, January 2003.

• [26] R. Vaarandi. Sec - a lightweight event correlation tool. IP Operations and Management, 2002 IEEE Workshop on, pages 111–115, 2002.

• [27] R. Vaarandi. A data clustering algorithm for mining patterns from event logs. IP Operations and Management, 2003. (IPOM 2003). 3rd IEEE Workshop on, pages 119–126, Oct. 2003.

Page 55: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

55

• [28] R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. In F. A. Aagesen, C. Anutariya, and V.Wuwongse, editors, INTELLCOMM, volume 3283 of Lecture Notes in Computer Science, pages 293–308. Springer, 2004.

• [29] I. H. Witten and E. Frank. Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann Publishers Inc., 2000.

• [30] K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 499–508, New York, NY, USA, 2005. ACM.

• [31] P. Bodik, G. Friedman, L. Biewald, H. Levine, G. Candea, K. Patel, G. Tolle, J. Hui, A. Fox, M. I. Jordan, and D. Patterson. Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization. volume 0, pages 89–100, Los Alamitos, CA, USA, 2005. IEEE Computer Society.

• [32] M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based faliure and evolution management. In NSDI’04: Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation, pages 23–23, Berkeley, CA, USA, 2004. USENIX Association.

• [33] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. SIGOPS Oper. Syst. Rev.,39(5):105–118, 2005.

• [34] E. Kiciman and B. Livshits. Ajaxscope: a platform for remotely monitoring the client-side behavior of web 2.0 applications. In SOSP ’07: Proceedings of twentyfirst ACM SIGOPS symposium on Operating systems principles, pages 17–30, New York, NY, USA, 2007. ACM.

Page 56: UC Berkeley Detecting Large-Scale System Problems by Mining Console Logs Wei Xu ( 徐葳 )* Ling Huang † Armando Fox* David Patterson* Michael Jordan* 1 *UC

56

• [35] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: bugs or bad comments?*/. In SOSP ’07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 145–158, New York, NY, USA, 2007. ACM.

• [36] K. Papineni. Why inverse document frequency? In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA, 2001. Association for Computational Linguistics.

• [37] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA, 1987.

• Samuel T. King, George W. Dunlap, Peter M. Chen, "Debugging operating systems with time-traveling virtual machines", Proceedings of the 2005 Annual USENIX Technical Conference , April 2005 (presentation by Sam King). Best Paper Award.

• [Dunagan09] J. Dunagan, et al. Heat-ray: Combating Identity Snowball Attacks using Machine Learning, Combinatorial Optimization and Attack Graphs, In Proc. of SOSP 2009

• [Kandula09] Detailed Diagnosis in Enterprise Networks   Srikanth Kandula (Microsoft Research), Ratul Mahajan (Microsoft Research), Patrick Verkaik (UC San Diego), Sharad agarwal (Microsoft Research), Jitu Padhye (Microsoft Research), Paramvir Bahl (Microsoft Research), In Proc. of SigComm 2009

• [Kiciman07] Emre Kiciman and Benjamin Livshits, AjaxScope: A Platform for Remotely Monitoring the Client-side Behavior of Web 2.0 Applications. In Proc. of SOSP 2007

• [King05] Samuel T. King, George W. Dunlap, Peter M. Chen, "Debugging operating systems with time-traveling virtual machines", Proceedings of the 2005 Annual USENIX Technical Conference , April 2005 (presentation by Sam King).