nips2013読み会: scalable kernels for graphs with continuous attributes

22
Scalable kernels for graphs with continuous attributes A.Feragen, N.KasenBurg, J.Petersen, M.Bruijne, K.Borgwardt 読む人: 田部井 靖生 (JST) NIPS reading@Univ of Tokyo, Jan. 23, 2014

Upload: yasuo-tabei

Post on 28-Jun-2015

5.762 views

Category:

Education


0 download

TRANSCRIPT

Page 1: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

Scalable kernels for graphs with continuous attributes

A.Feragen, N.KasenBurg, J.Petersen, M.Bruijne, K.Borgwardt

読む人: 田部井 靖生 (JST)

NIPS reading@Univ of Tokyo, Jan. 23, 2014

Page 2: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

発表概要

•  連続値ベクトルをノード属性として, 長さをエッジとして持つグラフに対するカーネルを提案 –  カーネル≒類似度

•  GraphHopper Kernel –  2つのグラフ中に存在する同じ長さの最短路のノードの類似度を足し算

–  前向き後ろ向き計算(message-passing)で効率的に計算可能

•  結果: 高速, 分類性能は既存のカーネルと同じぐらい

Page 3: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

Graph kernels

•  グラフの類似度計算に関する機械学習の一分野

•  グラフG,G’のカーネルは特徴ベクトル      , の内積で定義

–  KはG,G’の類似度

•  応用 – 分類 (SVM), 回帰, 検定 etc

K(G, G�) =< �(G),�(G�) >

�(G�)

Page 4: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

Graph classification

+1 +1 -1

Training data

? ? ?

Test data

Page 5: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

Graph kernels

•  グラフカーネルは共通する部分グラフの数をカウントする –  = パス,木,サイクルetcを成分とするベクトル

•  すべての部分グラフをカウントするカーネルはNP困難 (Gartner et al., 2003)

•  これまでに多くのグラフカーネルが提案されている (主に離散ラベル) –  RW (ICML,2003, COLT,2003) – WLKernel (NIPS, 2009) –  NH Kernel (ICDM,2009)

�(G)

Page 6: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

Weisfeiler-Lehman procedure

6  

Page 7: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

連続値をエッジに, 連続値ベクトル をノードに持つグラフ

•  グラフG=(V,E) with lengths , node attributes

�1.22.3�0.1

��0.11.2�1.3

�1.51.2�1.3

��0.52.21.3

�1.2

0.8

3.5

1.2

� : E � �+

A : V � �d

•  タンパク質 •  化合物 •  AIRWAYS •  etc

Page 8: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

既存のGraph Kernels

•  Shortest path kernel (ICDM,2005) – ノード属性の類似度をknで計算, パスの類似度をノードペアーの最短距離dGをklで計算

•  O(n4)時間 (n:ノード数) – 大きいグラフに対しては有効でない

KSP (G, G�) =�

v,w�V

v�,w��V �

kn(v, v�) · kl(dG(v, w), dG�(v�, w�)) · kn(w,w�)

ノードの 類似度

ノードの 類似度

最短路距離の類似度

Page 9: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

Overview of proposed method

•  連続値ベクトルをノードに属性値として持つグラフに対するカーネルを提案 (GraphHopper Kernel)

•  類似度は, 2つのグラフ中に存在するすべての同じ長さの最短路に沿ったノードの類似度の和 – グラフを渡り歩きながら計算

•  計算を上手く分割することにより, あるノードを通過する個数を前向き後向き計算で計算可能             •  時間 - n: ノード数, m:エッジ数, δ:直径, d:ノード属性の次元

実際のグラフではO(n2)に振る舞う

O(n2(m + log n + �2 + d))

Page 10: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

GraphHopper Kernel •  すべての同じ長さの最短路上の対応する各ノードのカーネルの和

Kn( , ) +

Kn( , ) Kn( , ) +

順番が重要

K(G, G�) =�

��F,��F �

kp(�,��)

kp(�,��) =� �|�|

j=1 kn(�(j),��(j)) if |�| = |��|0 otherwise

Page 11: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

グラフ間からグラフ内計算 •  ノードカーネルの重み付き和として計算可能

–  wはグラフの構造をエンコード k(G, G�) =

v�V,v��V �

w(v, v�)kn(v, v�)

vv�

w(v, v�) =��

i=1

��

j=1

#{�|�(i) = v, |�| = j}#{��|��(i) = v�, |��| = j}

Gに長さjの最短路πでi番目にvが現れる回数

例: j=2の場合

・wが計算できれば,  kはn×n’で計算可

Page 12: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

vを経由する最短路を分割 : 距離jの最短路πでvがi番目に出現する回数 = ( から距離jの最短路πでvがi番目に出現する回数) •  あるノードから他のすべてのノードへの最短路はDAGとなる •  DAG 上で   ( からvの長さiの最短路の個数)×(vから長さj-iの最短路の個数)

#{�|�(i) = v, |�| = j}�

v�V

v

v

v�

v�V

v

v

v

G Gv v

v

Gv

前向き計算 後ろ向き計算

Page 13: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

DAG 上のvを前向き計算 •  から各ノードへの距離j(=1,…,δ)の最短路の個数を計算 •  各のノードvは距離j(=1,…,δ)の最短路の個数 1.  初期化: , それ以外を 2.  に対する子ノード(j=1のノード)に対して, 0 = 01を伝搬する

3.  から距離j=2,…,δのノードvに対して0c(v)を子ノードに 伝搬する 1

+01 01 +01

01 +001 +001

 002

+01

01

+001

 001

vc(v)

c(v) = 1 c(v) = 0

Gv

v c(v)

v

1

0

0

 0

0

 0

Page 14: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

•  vを始点とした, 距離j(=1,…,δ)の最短路の個数 1.  初期化: c(v)=1 2.  入次数0のノードvに隣接するノードに0c(v)=01を伝搬する

3.  vから距離j=2,…,δのノードv’に隣接するノードに0c(v’)を伝搬する

DAG 上の後ろ向き計算 Gv

1 1

1 1

1 1

1 1

12 11

1 133

+01 +01 +01

+01 +012

+011

Page 15: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

計算時間

•  時間 •  n: ノード数 •  m: エッジ数 •  d: ノードattributeの次元 •  δ: グラフ直径

•  現実世界のグラフでO(n2)に振る舞う

O(n2(m + log n + d + �2)

Page 16: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

実験

•  データ: ENZYMES, PROTEINS, AIRWAYS, SYNTHETIC •  評価尺度: 計算時間, 分類精度(10-fold cross validation)

•  比較相手: –  Propergation kernel (PROP)[ECML,12] –  Shortest pass kernel (SP) [ICDM,05] –  connected matching subgraph kernel (CSM) [ICML,12],

– Weisfeiler-Lehman kernel (WL)[NIPS,09]

Page 17: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

データーセット

•  ノード属性の類似度

•  ENZYMES [NAR,2004] •  PROTEINS [JMB,2003] •  AIRWAYS [IPMI,2011]

ENZYMES PROTEINS AIRWAYS SYNTHETICNumber of nodes 32.6 39.1 221 100

Number of edges 46.7 72.8 220 196

Graph diameter 12.8 11.6 21.1 7

Node attribute dimension 18 1 1 1

Dataset size 600 1113 1966 300

Class size 6⇥ 100 663/450 980/986 150/150

Table 1: Data statistics: Average node and edge counts and graph diameter, dataset and class sizes.

giving total complexity O(n

2(m+log n+�

2)) for computing M(v) for all v 2 V using Algorithm 4.

It follows that the total complexity of computing k(G,G

0) is

O(n

2+ n

2d+ n

2�

2+ n

2�

2+ n

2(m+ log n+ �

2)) = O(n

2(m+ log n+ d+ �

2)).

3 Experiments

Classification experiments were made with the proposed GraphHopper kernel and several alterna-tives: The propagation kernel PROP [16], the connected subgraph matching kernel CSM [14] andthe shortest path kernel SP [17] all use continuous-valued attributes. In addition, we benchmarkagainst the Weisfeiler-Lehman kernel WL [2], which only uses discrete node attributes. All ker-nels were implemented in Matlab, except for CSM, where a Java implementation was supplied byN. Kriege. For the WL kernel, the Matlab implementation available from [20] was used. For theGraphHopper and SP kernels, shortest paths were computed using the BGL package [21] imple-mented in C++. The PROP kernel was implemented in two different versions, both using the totalvariation hash function, as the Hellinger distance is only directly applicable to positive vector-valuedattributes. For PROP-diff, labels were propagated with the diffusion scheme, whereas in PROP-WLlabels were first discretised via hashing and then the WL kernel [2] update was used. The bin widthof the hash function was set to 10

�5 as suggested in [16]. The PROP-diff, PROP-WL and the WLkernel were each run with 10 iterations. In the CSM kernel, the clique size parameter was set tok = 5. Our kernel implementations and datasets (with the exception of AIRWAYS) can be found athttp://image.diku.dk/aasa/software.php.

Classification experiments were made on four datasets: ENZYMES, PROTEINS, AIRWAYS andSYNTHETIC. ENZYMES and PROTEINS are sets of proteins from the BRENDA database [22]and the dataset of Dobson and Doig [23], respectively. Proteins are represented by graphs as follows.Nodes represent secondary structure elements (SSEs), which are connected whenever they are neigh-bors either in the amino acid sequence or in 3D space [24]. Each node has a discrete type attribute(helix, sheet or turn) and an attribute vector containing physical and chemical measurements includ-ing length of the SSE in Angstrøm (A), distance between the C↵ atom of its first and last residuein A, its hydrophobicity, van der Waals volume, polarity and polarizability. ENZYMES comes withthe task of classifying the enzymes to one out of 6 EC top-level classes, whereas PROTEINS comeswith the task of classifying into enzymes and non-enzymes. AIRWAYS is a set of airway trees ex-tracted from CT scans of human lungs [25, 26]. Each node represents an airway branch, attributedwith its length. Edges represent adjacencies between airway bronchi. AIRWAYS comes with thetask of classifying airways into healthy individuals and patients suffering from Chronic ObstructivePulmonary Disease (COPD). SYNTHETIC is a set of synthetic graphs based on a random graph G

with 100 nodes and 196 edges, whose nodes are endowed with normally distributed scalar attributessampled from N (0, 1). Two classes A and B each with 150 attributed graphs were generated fromG by randomly rewiring edges and permuting node attributes. Each graph in A was generated byrewiring 5 edges and permuting 10 node attributes, and each graph in B was generated by rewiring10 edges and permuting 5 node attributes, after which noise from N (0, 0.45

2) was added to every

node attribute in every graph. Detailed metrics of the datasets are found in Table 1.

Both GraphHopper, SP and CSM depend on freely selected node kernels for continuous attributes,giving modeling flexibility. For the ENZYMES, AIRWAYS and SYNTHETIC datasets, a Gaussiannode kernel kn(v, v0) = e

��kA(v)�A(v0)k2

was used on the continuous-valued attribute, with � =

1/d. For the PROTEINS dataset, the node kernel was a product of a Gaussian kernel with � = 1/d

and a Dirac kernel on the continuous- and discrete-valued node attributes, respectively. For the WL

6

kn(v, v�) = exp(��|||A(v)�A(v�)|2)

Page 18: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

分類精度と計算時間

•  SPは精度が高いが遅い •  WLは離散ラベルに応用 •  PROPは, 属性に対してLSHをかけて, 既存のカーネルを使う(精度が悪い)

•  精度はSPと同程度でありながら高速

Kernel ENZYMES PROTEINS AIRWAYS SYNTHETICGraphHopper 69.6± 1.3 (1201000) 74.1± 0.5 (2.8 h) 66.8± 0.5 (1 d 7 h) 86.6± 1.0 (1201000)PROP-diff [16] 37.2± 2.2 (1300) 73.3± 0.4 (2600) 63.5± 0.5 (401200) 46.1± 1.9 (102100)PROP-WL [16] 48.5± 1.3 (10900) 73.1± 0.8 (204000) 61.5± 0.6 (801700) 44.5± 1.2 (105200)SP [17] 71.0± 1.3 (3 d) 75.5± 0.8 (7.7 d) OUT OF TIME 85.4± 2.1 (3.4 d)CSM [14] 69.4± 0.8 OUT OF MEMORY OUT OF MEMORY OUT OF TIMEWL [2] 48.0± 0.9 (1800) 75.6± 0.5 (205100) 62.0± 0.6 (704300) 43.3± 2.3 (20800)

Table 2: Mean classification accuracies with standard deviation for all experiments, significantlybest accuracies in bold. OUT OF MEMORY means that 100 GB memory was not enough. OUTOF TIME indicates that the kernel computation did not finish within 30 days. Runtimes are given inparentheses; see Section 3.1 for further runtime studies. Above, x0

y

00 means x minutes, y seconds.

Figure 2: Dependence of runtime on n, � and m on synthetic and real graph datasets.

kernel, discrete node labels were used when available (in ENZYMES and PROTEINS); otherwisenode degree was used as node label.

Classification was done using a support vector machine (SVM) [27]. The SVM slack parameter wastrained using nested cross validation on 90% of the entire dataset, and the classifier was tested on theremaining 10%. This experiment was repeated 10 times. Mean accuracies with standard deviationsare reported in Table 2. For each kernel and dataset, runtime is given in parentheses in Table 2.Runtimes for the CSM kernel are not included, as this implementation was in another language.

3.1 Runtime experiments

An empirical evaluation of the runtime dependence on the parameters n, m and � is found in Fig-ure 2. In the top left panel, average kernel evaluation runtime was measured on datasets of 10 randomgraphs with 10, 20, 30, . . . , 500 nodes each, and a density of 0.4. Density is defined as m

n(n�1)/2 ,i.e. the fraction of edges in the graph compared to the number of edges in the complete graph. Inthe top right panel, the number of nodes was kept constant n = 100, while datasets of 10 randomgraphs were generated with 110, 120, . . . , 500 edges each. Development of both average kernelevaluation runtime and graph diameter is shown. In the bottom panels, the relationship betweenruntime and graph diameter is shown on subsets of the real AIRWAYS and PROTEINS datasets. Inthis experiment, 100 and 200 graphs from AIRWAYS and PROTEINS, respectively, were used foreach diameter.

7

Page 19: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

 ノード数に対する計算時間

Kernel ENZYMES PROTEINS AIRWAYS SYNTHETICGraphHopper 69.6± 1.3 (1201000) 74.1± 0.5 (2.8 h) 66.8± 0.5 (1 d 7 h) 86.6± 1.0 (1201000)PROP-diff [16] 37.2± 2.2 (1300) 73.3± 0.4 (2600) 63.5± 0.5 (401200) 46.1± 1.9 (102100)PROP-WL [16] 48.5± 1.3 (10900) 73.1± 0.8 (204000) 61.5± 0.6 (801700) 44.5± 1.2 (105200)SP [17] 71.0± 1.3 (3 d) 75.5± 0.8 (7.7 d) OUT OF TIME 85.4± 2.1 (3.4 d)CSM [14] 69.4± 0.8 OUT OF MEMORY OUT OF MEMORY OUT OF TIMEWL [2] 48.0± 0.9 (1800) 75.6± 0.5 (205100) 62.0± 0.6 (704300) 43.3± 2.3 (20800)

Table 2: Mean classification accuracies with standard deviation for all experiments, significantlybest accuracies in bold. OUT OF MEMORY means that 100 GB memory was not enough. OUTOF TIME indicates that the kernel computation did not finish within 30 days. Runtimes are given inparentheses; see Section 3.1 for further runtime studies. Above, x0

y

00 means x minutes, y seconds.

0 50 100 150 200 250 300 350 400 450 5000

0.5

1

1.5

2

2.5

3

3.5

4

Figure 2: Dependence of runtime on n, � and m on synthetic and real graph datasets.

kernel, discrete node labels were used when available (in ENZYMES and PROTEINS); otherwisenode degree was used as node label.

Classification was done using a support vector machine (SVM) [27]. The SVM slack parameter wastrained using nested cross validation on 90% of the entire dataset, and the classifier was tested on theremaining 10%. This experiment was repeated 10 times. Mean accuracies with standard deviationsare reported in Table 2. For each kernel and dataset, runtime is given in parentheses in Table 2.Runtimes for the CSM kernel are not included, as this implementation was in another language.

3.1 Runtime experiments

An empirical evaluation of the runtime dependence on the parameters n, m and � is found in Fig-ure 2. In the top left panel, average kernel evaluation runtime was measured on datasets of 10 randomgraphs with 10, 20, 30, . . . , 500 nodes each, and a density of 0.4. Density is defined as m

n(n�1)/2 ,i.e. the fraction of edges in the graph compared to the number of edges in the complete graph. Inthe top right panel, the number of nodes was kept constant n = 100, while datasets of 10 randomgraphs were generated with 110, 120, . . . , 500 edges each. Development of both average kernelevaluation runtime and graph diameter is shown. In the bottom panels, the relationship betweenruntime and graph diameter is shown on subsets of the real AIRWAYS and PROTEINS datasets. Inthis experiment, 100 and 200 graphs from AIRWAYS and PROTEINS, respectively, were used foreach diameter.

7

Page 20: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

グラフ直径に対する計算時間 (AIRWAYS)

Kernel ENZYMES PROTEINS AIRWAYS SYNTHETICGraphHopper 69.6± 1.3 (1201000) 74.1± 0.5 (2.8 h) 66.8± 0.5 (1 d 7 h) 86.6± 1.0 (1201000)PROP-diff [16] 37.2± 2.2 (1300) 73.3± 0.4 (2600) 63.5± 0.5 (401200) 46.1± 1.9 (102100)PROP-WL [16] 48.5± 1.3 (10900) 73.1± 0.8 (204000) 61.5± 0.6 (801700) 44.5± 1.2 (105200)SP [17] 71.0± 1.3 (3 d) 75.5± 0.8 (7.7 d) OUT OF TIME 85.4± 2.1 (3.4 d)CSM [14] 69.4± 0.8 OUT OF MEMORY OUT OF MEMORY OUT OF TIMEWL [2] 48.0± 0.9 (1800) 75.6± 0.5 (205100) 62.0± 0.6 (704300) 43.3± 2.3 (20800)

Table 2: Mean classification accuracies with standard deviation for all experiments, significantlybest accuracies in bold. OUT OF MEMORY means that 100 GB memory was not enough. OUTOF TIME indicates that the kernel computation did not finish within 30 days. Runtimes are given inparentheses; see Section 3.1 for further runtime studies. Above, x0

y

00 means x minutes, y seconds.

12 14 16 18 20 22 24 260

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 2: Dependence of runtime on n, � and m on synthetic and real graph datasets.

kernel, discrete node labels were used when available (in ENZYMES and PROTEINS); otherwisenode degree was used as node label.

Classification was done using a support vector machine (SVM) [27]. The SVM slack parameter wastrained using nested cross validation on 90% of the entire dataset, and the classifier was tested on theremaining 10%. This experiment was repeated 10 times. Mean accuracies with standard deviationsare reported in Table 2. For each kernel and dataset, runtime is given in parentheses in Table 2.Runtimes for the CSM kernel are not included, as this implementation was in another language.

3.1 Runtime experiments

An empirical evaluation of the runtime dependence on the parameters n, m and � is found in Fig-ure 2. In the top left panel, average kernel evaluation runtime was measured on datasets of 10 randomgraphs with 10, 20, 30, . . . , 500 nodes each, and a density of 0.4. Density is defined as m

n(n�1)/2 ,i.e. the fraction of edges in the graph compared to the number of edges in the complete graph. Inthe top right panel, the number of nodes was kept constant n = 100, while datasets of 10 randomgraphs were generated with 110, 120, . . . , 500 edges each. Development of both average kernelevaluation runtime and graph diameter is shown. In the bottom panels, the relationship betweenruntime and graph diameter is shown on subsets of the real AIRWAYS and PROTEINS datasets. Inthis experiment, 100 and 200 graphs from AIRWAYS and PROTEINS, respectively, were used foreach diameter.

7

Page 21: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

グラフ直径に対する計算時間 (PROTEINS)

Kernel ENZYMES PROTEINS AIRWAYS SYNTHETICGraphHopper 69.6± 1.3 (1201000) 74.1± 0.5 (2.8 h) 66.8± 0.5 (1 d 7 h) 86.6± 1.0 (1201000)PROP-diff [16] 37.2± 2.2 (1300) 73.3± 0.4 (2600) 63.5± 0.5 (401200) 46.1± 1.9 (102100)PROP-WL [16] 48.5± 1.3 (10900) 73.1± 0.8 (204000) 61.5± 0.6 (801700) 44.5± 1.2 (105200)SP [17] 71.0± 1.3 (3 d) 75.5± 0.8 (7.7 d) OUT OF TIME 85.4± 2.1 (3.4 d)CSM [14] 69.4± 0.8 OUT OF MEMORY OUT OF MEMORY OUT OF TIMEWL [2] 48.0± 0.9 (1800) 75.6± 0.5 (205100) 62.0± 0.6 (704300) 43.3± 2.3 (20800)

Table 2: Mean classification accuracies with standard deviation for all experiments, significantlybest accuracies in bold. OUT OF MEMORY means that 100 GB memory was not enough. OUTOF TIME indicates that the kernel computation did not finish within 30 days. Runtimes are given inparentheses; see Section 3.1 for further runtime studies. Above, x0

y

00 means x minutes, y seconds.

0 5 10 15 20 250

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Figure 2: Dependence of runtime on n, � and m on synthetic and real graph datasets.

kernel, discrete node labels were used when available (in ENZYMES and PROTEINS); otherwisenode degree was used as node label.

Classification was done using a support vector machine (SVM) [27]. The SVM slack parameter wastrained using nested cross validation on 90% of the entire dataset, and the classifier was tested on theremaining 10%. This experiment was repeated 10 times. Mean accuracies with standard deviationsare reported in Table 2. For each kernel and dataset, runtime is given in parentheses in Table 2.Runtimes for the CSM kernel are not included, as this implementation was in another language.

3.1 Runtime experiments

An empirical evaluation of the runtime dependence on the parameters n, m and � is found in Fig-ure 2. In the top left panel, average kernel evaluation runtime was measured on datasets of 10 randomgraphs with 10, 20, 30, . . . , 500 nodes each, and a density of 0.4. Density is defined as m

n(n�1)/2 ,i.e. the fraction of edges in the graph compared to the number of edges in the complete graph. Inthe top right panel, the number of nodes was kept constant n = 100, while datasets of 10 randomgraphs were generated with 110, 120, . . . , 500 edges each. Development of both average kernelevaluation runtime and graph diameter is shown. In the bottom panels, the relationship betweenruntime and graph diameter is shown on subsets of the real AIRWAYS and PROTEINS datasets. Inthis experiment, 100 and 200 graphs from AIRWAYS and PROTEINS, respectively, were used foreach diameter.

7

Page 22: NIPS2013読み会: Scalable kernels for graphs with continuous attributes

まとめ

•  連続値の属性(ベクトル)をノード, 連続値をエッジに持つグラフのためのGraphHopper Kernel •  2つのグラフ中に存在するすべての同じ長さの最短路上のノードの類似度の足しあわせ – 前向き後向き計算計算

•  精度はshortest path kernelと同程度, 計算時間は高速