大規模幾何データからの高速な極大部分グラフ発見 efficient maximal pattern...

32
大大大大大大大大大大大大大大大大大大大大大大大 大大大大大大大大大大大大大大大大大大大大大大大 Efficient Maximal Efficient Maximal Pattern Discovery from Pattern Discovery from Massive Geometric Graphs Massive Geometric Graphs 大大大大 大大大大大 大大大大大大大大大大 大大大大 大大大大大大大大 大大大大大大大大大大大大大大 This work is partly supported by MEXT Grant-in-Aid for Scientific Research for Specially Promoted Research on “Semi-structured Data Mining”

Upload: tatum

Post on 13-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs. 有村博紀 北海道大学 大学院情報科学研究科 宇野毅明 国立情報学研究所 下薗真一 九州工業大学情報工学部. This work is partly supported by MEXT Grant-in-Aid for Scientific Research for Specially Promoted Research on “Semi-structured Data Mining”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

大規模幾何データからの高速な大規模幾何データからの高速な極大部分グラフ発見極大部分グラフ発見Efficient Maximal Pattern Efficient Maximal Pattern Discovery from Massive Discovery from Massive Geometric GraphsGeometric Graphs

有村博紀 北海道大学 大学院情報科学研究科宇野毅明 国立情報学研究所下薗真一 九州工業大学情報工学部

This work is partly supported by MEXT Grant-in-Aid for Scientific Research for Specially Promoted Research on “Semi-structured

Data Mining”

Page 2: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

BackgroundsBackgroundsRapid growth of both the amount

and the varieties of nonstandard datasets in scientific, spatial, and relational domains.

There are increasing demands for efficient methods to extract useful patterns and rules from weakly structured datasets.

Graph Mining…

Page 3: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Graph miningGraph miningFinding interesting subgraphs

appearing in an input collection of labeled graphs.

One of the most promising approaches for knowledge discovery from weakly structured datasets.

A most popular approach is frequent subgraph mining [Inokuchi et al. 2000], but it can often generate a huge number of redundant subgraphs, which degrate the efficiency and the comprehensiveness very much.

How to cope with this proplem ...

Page 4: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Knowledge Discovery Knowledge Discovery from Geometric Datafrom Geometric DataNetwork data with geometric

information◦Chemical compound with 2D or 3D

information on their atoms and edges [Kuramochi and Karypis [ICDE’02]

◦CIty map with infrastructure information Geographic Information Systems (GIS)

◦VLSI layout with chips and wires

Geometric graphs ...

Page 5: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Geometric matchingGeometric matchingP matches Q iff P is geometrically

isomorphic to a subgraph of QDefined through the invariance under a

class of “rigid” geometric transformations and graph isomorphism.

A

A

A

A

A

g

g

g gg

g

g

1.0 2.0 3.0

1.0

2.0

x

y

A

A

ggg

A

Page 6: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Maximal pattern Maximal pattern discovery problemdiscovery problemA maximal pattern is a geometric

graph which is not included in any properly larger subgraph having the same set of occurrences in D.

The maximal subgraph mining problem asks to find all maximal patterns (closed patterns) appearing in a given input geometric graph D without repetition

The set M of all maximal patterns is expected to be much smaller than the set F of all frequent patterns

Page 7: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Difficulties in maximal Difficulties in maximal pattern miningpattern miningA number of efficient maximal pattern

algorithms are proposed for sets, sequences, and graphs [3, 9, 20, 22, 25].◦Some algorithms use explicit duplicate

detection and maximality test with a collection of already discovered patterns.

◦This requires large memory and delay time by these approaches, and introduces difficulties to use efficient search techniques, e.g., depth-first search.

Open problem: output-polynomial time computability of the maximal pattern problem for the class of geometric graphs.

Page 8: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Related works: Graph Related works: Graph miningminingFrequent subgraph mining:

◦ AGM [Inokuchi, Washio, Motoda, PKDD’00]◦ TreeMiner [Zaki, KDD’02]◦ Freqt [Asai et al., SDM’02]◦ NK [Nijssen & Kok, MGTS’03]

Maximal/closed subgraph mining ◦ CloseGraph [Yan & Han, KDD’03]◦ CMTreeMiner [Chi, Yang, Xia, Muntz,

PAKDD’04]◦ Dryade [Termier, Rousset, Sebag, ICDM’04]◦ CloAtt [Arimura & Uno, ILP’05]

Combination with machine learning◦ XRule [Zaki & Aggrawal, KDD’03]◦ Weighted Substructure Mining [Tsuda & Kudo,

ICML’06]

Page 9: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Related works: Related works: Maximal/Closed pattern Maximal/Closed pattern miningmining 1. The first: Flexible Patterns

◦ Classes of “elastic” or “flexible” patterns◦ Polynojmial delay and space algoarithms are

developed using a very simple “reverse search property” holds

◦ CMTreeMiner [Chi et al. PAKDD’04], BIDE [Yan & HanICDE’04], and MaxFlex [Arimura & Uno, LLLL’07]

The second: Rigid patterns◦ deal with mining of “rigid” patterns which have◦ Polynojmial delay and space algoarithms based on

the existence of least general generalization or closure-like operations.

◦ LCM [Uno et al. FIMI’03,’04, DS’04] proposes ppc-extension for maximal sets, and then CloATT [Arimura & Uno ILP’05] and MaxMotif [Arimura & Uno ISAAC’05]

The third: others◦ Heuristic algorithms◦ CloseGraph [25]: frequent pattern discovery augmented

with maximality test and the duplicate detection◦ Difficult to achieve output-polynomial time computability

Page 10: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Def: Enumeration Def: Enumeration AlgorithmsAlgorithmsEfficient data mining algorithm

= output-polynomial time algorithms

18

Output-polynomial (OUT-POLY)Total time is poly(Input, Output)

polynomial-time enumeration (POLY-ENUM)

Amotized delay is poly(Input), or Total time is Output·poly(Input)

polynomial-delay (POLY-DELAY)Maximum of delay is poly(Input)

polynomial-space(POLY-SPACE)

+

Output size M

Delay D

Input

Input size M

Total Time T

Page 11: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Algorithm Algorithm MaxGeoMaxGeoA time and space efficient algorithm for mining all maximal geometric subgraphs

Depth-first search over the space of all maximal geometric subgraphs

◦ To do this ...Achieves first time polynomial

delay and polynomial space

Page 12: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

We develop We develop techniques...techniques...A polynomial time computable

canonical code for all geometric graphs which is invariant under geometric transformations.

Characterization of M by the intersection operation (the least general generalization) and then Polytime computable closure operation for geographs

The tree-shaped search route T for all maximal patterns in G

A new pattern growth technique combining reverse search and closure extension

Page 13: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Main resultMain result

Theorem: ◦Given an input geometric graph D,

algorithm MaxGeo enumerates all frequent maximal pattern P in M without duplicates in O(m(m+n)||D||2 log ||D||) = O(n8 log n) time per pattern and in O(m) = O(n2) space,

◦with the maximum number m of occurrences of a pattern other than trivial patterns, the number n of vertices in D, and the number ||D|| of vertices and edges in D.

Corollary: ◦The maximal pattern enumeration problem

is solvable in polynomial delay and polynomial space in the total input size.

Page 14: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Geometric graph Geometric graph (Geograph)(Geograph)A vertex- and edge-labeled graph

G = (V, E, ; c)◦ Having vertex labels (v) and edge labels

(e) ◦ which represent geometric features and

their relationships◦ Whose vertices v have the coordinates

c(v) in the 2D plane R2

AlphabetsV = {A, B, C}E = {a, b}

Geograph G

vertex v in V (v) = Ac(v) = (2.5, 1.0)

edge e in E (e) = g

A

A

A

A

A

f

f

e eg

g

g

1.0 2.0 3.0

1.0

2.0

x

y

Page 15: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Basics in GeometryBasics in GeometryR2 : 2-dim Euclidean space

◦ The set R2 of all points p = (x, y) (x, y : real numbers)

||x|| : the norm of a vector x||x - y|| : the distance of x

and yc x : a scalar productx + y : the addition of vectorsAx : the product of matrix A and a 2-

vector xdet(A) : the determinant of matrix AA1 : the inverse of matrix Af : R2 R2 : a geometric transformationsf(x) = Ax + b : an affine transformation

Page 16: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Geometric IsomorphismGeometric Isomorphism

2

1:t

t

y

x

y

xM

y

x

y

xR

cossin

sincos:

y

x

s

s

y

xS

0

0:

Geograph P is geometrically isomorphic to Q iff there exists some F in Tgeo such that T(P) = Q

Class Tgeo of Geometric Transformations: Any combinations F of :◦ Translation M◦ Rotation R◦ Scaling S

Page 17: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Geometric matchingGeometric matchingP matches Q iff P is geometrically

isomorphic to a subgraph of QDefined through the invariance under a

class of “rigid” geometric transformations and graph isomorphism.

A

A

A

A

A

g

g

g gg

g

g

1.0 2.0 3.0

1.0

2.0

x

y

A

A

ggg

A

A

Agg

gA

geometric matching function

F

Page 18: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Geometric matchingGeometric matchingP matches Q iff P is geometrically

isomorphic to a subgraph of QDefined through the invariance under a

class of “rigid” geometric transformations and graph isomorphism.

A

A

A

A

A

g

g

g gg

g

g

1.0 2.0 3.0

1.0

2.0

x

y

A

A

ggg

A

Page 19: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Geometric matchingGeometric matchingP matches Q (P ≦ Q) iff

◦ P is geometrically isomorphic to a subgraph of Q via geometric graph isomorphism under rigid geometric transformations.

(Geo, ≦) : A partial ordering over geographs

A

A

A

A

A

g

g

g gg

g

g

1.0 2.0 3.0

1.0

2.0

x

y

A

A

ggg

A

Page 20: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Geometric DatabaseGeometric Database

Page 21: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Occurrence and Occurrence and frequencyfrequencyThe location list L(P) of geograph P in the

input geograph D◦ the set of all geometric transformations that

matches P to D.The frequency of P in D: freq(P) = |L(P)|

A

A

A

A

A

g

g

g gg

g

g

1.0 2.0 3.0

1.0

2.0

x

y

A

A

ggg

A

L(P) = {f1, f2, f3}freq(P) = 3

Pattern PDatabase D

Page 22: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Equivalence of patternEquivalence of patternTwo geographs P and Q are equivalent each other in D if L(P) = L(Q) holds in D.

A

A

ggg

A

L(P) = {f1, f2, f3}freq(P) = 3

Pattern P

A

A

gg

A

L(Q) = {f1, f2, f3}freq(Q) = 3

Pattern Q

Page 23: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Maximal patternsMaximal patternsA maximal pattern

◦A geometric graph which is not included in any properly larger subgraph w.r.t. ≦ having the same set of occurrences in D.

◦A maximal element within the equivalence class of geographs w.r.t. location list equivalence.

Lemma 1 (unique maximal pattern)For any geometric pattern P, there exists the unique maximal pattern equivalent to P

Proof: Take the intersection of all geographs in the equivalence class [P] = { Q in Geo : L(P) = L(Q) in D }. This is the unique maximal patterns equivalent to P. QED.

Page 24: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Maximal pattern Maximal pattern miningminingA maximal pattern

◦ is a geometric graph which is not included in any properly larger subgraph having the same set of occurrences in D.

The maximal subgraph mining problem ◦asks to find all maximal patterns (closed

patterns) appearing in a given input geometric graph D without repetition

The set M of all maximal patterns ◦ is expected to be much smaller than the

set F of all frequent patterns and still contains the complete information of D

要修正!

Page 25: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Canonical form Canonical form Given a geograph P of size kDefine the canonical code Cano(P) of

P as the lexicographically smallest code C(P, N) for all numbering N, where

C(P, N) is defined as follows◦Determine a numbering N of all the

vertices of P in 1, 2, 3, ..., k◦Sort the collection of all labeled verticies

and labeled edges: (c(v), (v) for all v in V (c(u), c(v), (u,v)) for all edges e = (u,v) in E

◦Let C(P, N) be the resulting list as the code by N要修正!

Page 26: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

ILP'05, Aug 2005, Hiroki Arimura, Hokkaido University 26

Intersection of geometric Intersection of geometric graphsgraphs

Lemma: The intersection of geographsT1 and T2 is the unique geograph Merge(T1, T2) = T whose object sets is given by α(T) = α(T1)∩α(T2).

Merge(T1, T2)

T2T1

α(T1) α(T2)

α(Merge(T1, T2))

α(G) = The object set of G, that is, the set of all labeled vertices and labeled edges in a geometric graph G

Page 27: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Closure of geographClosure of geographThe intersection Merge(P1, P2)

of a pair of geographs P1 and P2◦The intersection of P1 and P2 as the

first order (relational) structureThe closure of geograph P

◦Closure(P) = Merge(L(P))

Theorem: ◦P is maximal in D iff CLosure(P) = P

要修正!

Page 28: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Tree-shaped search route for Tree-shaped search route for maximal patternsmaximal patternsThe core of (the code of) a geograph P◦The shortest prefix core(P) of code(P)

such that L(P) = L(core(P))The parent P of maximal geograph Q

◦Parent(P) = Closure(the proper prefix of core(the code of P))

Theorem: The graph Tree(Geo) = (Geo, Parent(.)) forms a spanning tree for Geo with the empty geograph as root要修正!

Page 29: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

ILP'05, Aug 2005, Hiroki Arimura, Hokkaido University 29

Our algorithm MaxGeo: Basic Idea

Jump

Tree(Geo) = (Geo, Parent(.))

Depth-first search over a tree-shaped search space for all maximal gegraphs

Jumping from one maximal geograph to another maximal geograph

Page 30: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Main resultMain result

Theorem: ◦Given an input geometric graph D,

algorithm MaxGeo enumerates all frequent maximal pattern P in M without duplicates in O(m(m+n)||D||2 log ||D||) = O(n8 log n) time per pattern and in O(m) = O(n2) space,

◦with the maximum number m of occurrences of a pattern other than trivial patterns, the number n of vertices in D, and the number ||D|| of vertices and edges in D.

Corollary: ◦The maximal pattern enumeration problem

is solvable in polynomial delay and polynomial space in the total input size.

Page 31: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

Summary: Summary: We develop techniques...We develop techniques...

A polynomial time computable canonical code for all geometric graphs which is invariant under geometric transformations.

Characterization of M by the intersection operation (the least general generalization) and then Polytime computable closure operation for geographs

The tree-shaped search route T for all maximal patterns in G

A new pattern growth technique combining reverse search and closure extension

Page 32: 大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs

ConclusionConclusionThe class of geometric graphsMaximal pattern discovery

problemA polynomial space and

polynomial delay algorithm MaxGeo

Time and space complexityTechniques