data mining has emerged as a critical tool for knowledge discovery in large data sets. it has been...
TRANSCRIPT
OUTLINE
1) MINING SCIENTIFIC DATA-SETS
2) USING GRAPHS AND MINING THEM
3) PATTERN DISCOVERY IN GRAPHS
4) GRAPH BASED INDUCTION
5) REFERENCES
6) CONCLUSION
DATA MINING IN SCIENTIFIC DOMAIN Data mining has emerged as a critical tool for
knowledge discovery in large data sets.• It has been extensively used to analyze business,
financial, and textual data sets.
The success of these techniques has renewed interest in applying them to various scientific and engineering fields. • Astronomy• Life Sciences• Ecosystem Modeling• Structural Mechanics• …
CHALLENGES IN SCIENTIFIC DATA MINING
Most of existing data mining algorithms assume that the data is represented via• Transactions (set of items)
• Sequence of items or events
• Multi-dimensional vectors
• Time series
Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. • e.g., Numerical simulations, 3D protein structures, chemical
compounds, etc.
Need algorithms that operate on scientific datasets in their native representation
HOW TO MODEL SCIENTIFIC DATA MINING
There are two basic choices.• Treat each dataset/application differently and
develop custom representations/algorithms.• Employ a new way of modeling such datasets and
develop algorithms that span across different applications!
What should be the properties of this general modeling framework?• Abstract compared with the original raw data.• Yet powerful enough to capture the important
characteristics.
Labeled directed/undirected topological/geometric graphs and hyper graphs
ROADMAP
1) MINING SCIENTIFIC DATA-SETS
2) USING GRAPHS AND MINING THEM
3) PATTERN DISCOVERY IN GRAPHS
4) GRAPH BASED INDUCTION
5) REFERENCES
6) CONCLUSION
MODELING DATA WITH GRAPHS
Graphs are suitable for capturing arbitrary relations between the various elements.
VertexElement
Element’s Attributes
Relation BetweenTwo Elements
Type Of Relation
Vertex Label
Edge Label
Edge
Data Instance Graph Instance
Relation between a Set of Elements
Hyper Edge
Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be
modeled
EXAMPLE PROTEIN 3D STRUCTURE
PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor ProteinAlzheimer's disease amyloid A4 protein precursor
β
α
β
β
β
β
α
β
β
Backbone
Contact
GRAPH MINING
GOAL
Develop algorithms to mine and analyze graph data sets.
• Finding patterns in these graphs
• Finding groups of similar graphs (clustering)
• Building predictive models for the graphs (classification)
GRAPH MINING
APPLICATIONS
• Structural motif discovery• High-throughput screening• Protein fold recognition• VLSI reverse engineering• A lot more …
Beyond Scientific ApplicationsSemantic webMining relational profilesBehavioral modelingIntrusion detectionCitation analysis…
GRAPH MINING
Approach #1: Frequent Subgraph Mining
Find all subgraphs g within a set of graph transactions G such that
where t is the minimum supportFocus on pruning and fast, code-based graph matching
tG
gfreq
||
)(
FREQUENT SUBGRAPH MINING
Approach #1: Algorithms• Apriori-based Graph Mining (AGM)
Inokuchi, Washio & Motoda (Osaka U., Japan)• Frequent Sub-Graph discovery (FSG)
Kuramochi & Karypis (U. Minnesota)• Graph-based Substructure pattern mining (gSpan)
Yan & Han (UIUC)• Fast Frequent Subgraph Mining (FFSM), Spanning tree
based maximal graph mining (Spin) Huan, Wang & Prins (UNC Chapel Hill)
• Graph, Sequences and Tree extraction (Gaston) Kazius & Nijssen (U. Leiden, Netherlands)
OUTLINE
1) MINING SCIENTIFIC DATA-SETS
2) USING GRAPHS AND MINING THEM
3) PATTERN DISCOVERY IN GRAPHS
4) GRAPH BASED INDUCTION
5) REFERENCES
6) CONCLUSION
FINDING FREQUENT PATTERNS IN GRAPHS
A pattern is a relation between the object’s elements that is recurring over and over again.• Common structures in a family of chemical compounds
or proteins.• Similar arrangements of vortices at different “instances”
of turbulent fluid flows.• …
There are two general ways to formally define a pattern in the context of graphs
Arbitrary subgraphs (connected or not)Induced subgraphs (connected or not)
Frequent pattern discovery translates to frequent subgraph discovery…
COMPUTATIONAL CHALLENGES
Candidate generation
Candidate pruning
Frequency counting
Key to FSG’s computational efficiency
Simple operations become complicated & expensive when dealing with graphs…
CANDIDATE GENERATION BASED ON CORE DETECTION
First Core
Second Core
First Core Second Core
Multiple cores between two
(k-1)-subgraphs
CANONICAL LABELING
1
1
1111
11
111
1
5
4
3
2
1
0
543210
Av
Av
Bv
Bv
Bv
Bv
AABBBB
vvvvvv
1
1
1
11
111
1111
0
5
4
2
1
3
054213
Bv
Av
Av
Bv
Bv
Bv
BAABBB
vvvvvv
v0 B
v1B
v2 B
v3B
v4A
v5A
Label = “1 01 011 0001 00010”
Label = “1 11 100 1000 01000”
GRAPH CLASSIFICATION APPROACH
Discover FrequentSub-graphs
1 Select DiscriminatingFeatures
2
Learn a ClassificationModel
4 Transform Graphs
in Feature Representation
3
Graph Databases
OUTLINE
1) MINING SCIENTIFIC DATA-SETS
2) USING GRAPHS AND MINING THEM
3) PATTERN DISCOVERY IN GRAPHS
4) GRAPH BASED INDUCTION
5) REFERENCES
6) CONCLUSION
GRAPH MINING
Approach #2:• Find subgraph S within a set of one or more graphs
G that maximally compresses G
• where (G|S) is G compressed by S, i.e., instances of S in G replaced by single vertex
Focus on efficient subgraph generation and heuristic search
)|()(
)(maxarg
SGsizeSsize
GsizeS
S
CONCLUSIONS
Graphs provide a powerful mechanism to represent relational and physical datasets.
Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem.
Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…
REFERENCES
Takashi Matsuda, Hiroshi Motoda, Takashi Washio, Graph-based induction and its applications, Advanced Engineering Informatics, Volume 16, Issue 2, April 2002, Pages 135-143.
Michihiro Kuramochi, George Karypis, "Frequent Subgraph Discovery," Data Mining, IEEE International Conference on, pp. 313, First IEEE International Conference on Data Mining (ICDM'01), 2001.