data mining has emerged as a critical tool for knowledge discovery in large data sets. it has been...

26
CS 790g COMPLEX NETWORKS FALL 2009 HAKAN KARDES MINING SCIENTIFIC DATA SETS USING GRAPHS

Upload: devonte-woodmancy

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

CS 790g COMPLEX NETWORKS

FALL 2009

HAKAN KARDES

MINING SCIENTIFIC DATA SETSUSING GRAPHS

OUTLINE

1) MINING SCIENTIFIC DATA-SETS

2) USING GRAPHS AND MINING THEM

3) PATTERN DISCOVERY IN GRAPHS

4) GRAPH BASED INDUCTION

5) REFERENCES

6) CONCLUSION

DATA MINING IN SCIENTIFIC DOMAIN Data mining has emerged as a critical tool for

knowledge discovery in large data sets.• It has been extensively used to analyze business,

financial, and textual data sets.

The success of these techniques has renewed interest in applying them to various scientific and engineering fields. • Astronomy• Life Sciences• Ecosystem Modeling• Structural Mechanics• …

CHALLENGES IN SCIENTIFIC DATA MINING

Most of existing data mining algorithms assume that the data is represented via• Transactions (set of items)

• Sequence of items or events

• Multi-dimensional vectors

• Time series

Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. • e.g., Numerical simulations, 3D protein structures, chemical

compounds, etc.

Need algorithms that operate on scientific datasets in their native representation

HOW TO MODEL SCIENTIFIC DATA MINING

There are two basic choices.• Treat each dataset/application differently and

develop custom representations/algorithms.• Employ a new way of modeling such datasets and

develop algorithms that span across different applications!

What should be the properties of this general modeling framework?• Abstract compared with the original raw data.• Yet powerful enough to capture the important

characteristics.

Labeled directed/undirected topological/geometric graphs and hyper graphs

ROADMAP

1) MINING SCIENTIFIC DATA-SETS

2) USING GRAPHS AND MINING THEM

3) PATTERN DISCOVERY IN GRAPHS

4) GRAPH BASED INDUCTION

5) REFERENCES

6) CONCLUSION

MODELING DATA WITH GRAPHS

Graphs are suitable for capturing arbitrary relations between the various elements.

VertexElement

Element’s Attributes

Relation BetweenTwo Elements

Type Of Relation

Vertex Label

Edge Label

Edge

Data Instance Graph Instance

Relation between a Set of Elements

Hyper Edge

Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be

modeled

EXAMPLE PROTEIN 3D STRUCTURE

PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor ProteinAlzheimer's disease amyloid A4 protein precursor

β

α

β

β

β

β

α

β

β

Backbone

Contact

GRAPH MINING

GOAL

Develop algorithms to mine and analyze graph data sets.

• Finding patterns in these graphs

• Finding groups of similar graphs (clustering)

• Building predictive models for the graphs (classification)

GRAPH MINING

APPLICATIONS

• Structural motif discovery• High-throughput screening• Protein fold recognition• VLSI reverse engineering• A lot more …

Beyond Scientific ApplicationsSemantic webMining relational profilesBehavioral modelingIntrusion detectionCitation analysis…

GRAPH MINING

Approach #1: Frequent Subgraph Mining

Find all subgraphs g within a set of graph transactions G such that

where t is the minimum supportFocus on pruning and fast, code-based graph matching

tG

gfreq

||

)(

FREQUENT SUBGRAPH MINING

Approach #1: Algorithms• Apriori-based Graph Mining (AGM)

Inokuchi, Washio & Motoda (Osaka U., Japan)• Frequent Sub-Graph discovery (FSG)

Kuramochi & Karypis (U. Minnesota)• Graph-based Substructure pattern mining (gSpan)

Yan & Han (UIUC)• Fast Frequent Subgraph Mining (FFSM), Spanning tree

based maximal graph mining (Spin) Huan, Wang & Prins (UNC Chapel Hill)

• Graph, Sequences and Tree extraction (Gaston) Kazius & Nijssen (U. Leiden, Netherlands)

OUTLINE

1) MINING SCIENTIFIC DATA-SETS

2) USING GRAPHS AND MINING THEM

3) PATTERN DISCOVERY IN GRAPHS

4) GRAPH BASED INDUCTION

5) REFERENCES

6) CONCLUSION

FINDING FREQUENT PATTERNS IN GRAPHS

A pattern is a relation between the object’s elements that is recurring over and over again.• Common structures in a family of chemical compounds

or proteins.• Similar arrangements of vortices at different “instances”

of turbulent fluid flows.• …

There are two general ways to formally define a pattern in the context of graphs

Arbitrary subgraphs (connected or not)Induced subgraphs (connected or not)

Frequent pattern discovery translates to frequent subgraph discovery…

COMPUTATIONAL CHALLENGES

Candidate generation

Candidate pruning

Frequency counting

Key to FSG’s computational efficiency

Simple operations become complicated & expensive when dealing with graphs…

VALIDATE GENERATION BASED ON CORE DETECTION

Multiple candidates for the same core!

CANDIDATE GENERATION BASED ON CORE DETECTION

First Core

Second Core

First Core Second Core

Multiple cores between two

(k-1)-subgraphs

CANONICAL LABELING

1

1

1111

11

111

1

5

4

3

2

1

0

543210

Av

Av

Bv

Bv

Bv

Bv

AABBBB

vvvvvv

1

1

1

11

111

1111

0

5

4

2

1

3

054213

Bv

Av

Av

Bv

Bv

Bv

BAABBB

vvvvvv

v0 B

v1B

v2 B

v3B

v4A

v5A

Label = “1 01 011 0001 00010”

Label = “1 11 100 1000 01000”

GRAPH CLASSIFICATION APPROACH

Discover FrequentSub-graphs

1 Select DiscriminatingFeatures

2

Learn a ClassificationModel

4 Transform Graphs

in Feature Representation

3

Graph Databases

OUTLINE

1) MINING SCIENTIFIC DATA-SETS

2) USING GRAPHS AND MINING THEM

3) PATTERN DISCOVERY IN GRAPHS

4) GRAPH BASED INDUCTION

5) REFERENCES

6) CONCLUSION

GRAPH MINING

Approach #2:• Find subgraph S within a set of one or more graphs

G that maximally compresses G

• where (G|S) is G compressed by S, i.e., instances of S in G replaced by single vertex

Focus on efficient subgraph generation and heuristic search

)|()(

)(maxarg

SGsizeSsize

GsizeS

S

GRAPH BASED INDUCTION (GBI)

THE BASIC IDEA BEHIND THE GBI

GRAPH BASED INDUCTION (GBI)

PAIRWISE CHUNKING

CONCLUSIONS

Graphs provide a powerful mechanism to represent relational and physical datasets.

Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem.

Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…

REFERENCES

Takashi Matsuda, Hiroshi Motoda, Takashi Washio, Graph-based induction and its applications, Advanced Engineering Informatics, Volume 16, Issue 2, April 2002, Pages 135-143.

Michihiro Kuramochi, George Karypis, "Frequent Subgraph Discovery," Data Mining, IEEE International Conference on, pp. 313, First IEEE International Conference on Data Mining (ICDM'01), 2001.

THANK YOU FORYOUR ATTENTION

ANY QUESTIONS?