mathematical statistics teamstat.sys.i.kyoto-u.ac.jp/.../05/20180304e_poster.pdf · mathematical...

1
数理統計学チーム 下平 英寿 Mathematical Statistics Team Hidetoshi Shimodaira Motivation : Assessing the confidence of each obtained cluster Consider approx. unbiased p-values as frequentist confidence measures Null hypothesis = obtained cluster is NOT true Example : lung data (73 tumors, 916 genes; Garber et al., 2001) Two asymptotic theories Key point: Multiscale Bootstrap (Shimodaira, 2002; 2004) ü Low computational cost : ü Double bootstrap method has same accuracy but high comp. cost 1: Specify several n N values, and set σ 2 = n/n . Set the number of bootstrap replicates B, say, 1000. 2: For each n , perform bootstrap resampling to generate Y * for B times and compute α σ 2 (H |y)= C H /B and α σ 2 (S |y)= C S /B by counting the frequencies C H =#{Y * H } and C S =#{Y * S }. (We actually work on X * n instead of Y * .) Compute ψ σ 2 (H |y)= σ ¯ Φ -1 (α σ 2 (H |y)) and ψ σ 2 (S |y)= σ ¯ Φ -1 (α σ 2 (S |y)). 3: Estimate parameters β H (y) and β S (y) by fitting models ψ σ 2 (H |y)= ϕ H (σ 2 |β H ) and ψ σ 2 (S |y)= ϕ S (σ 2 |β S ), respectively. The parameter estimates are denoted as ˆ β H (y) and ˆ β S (y). If we have several candidate models, apply above to each and choose the best model based on AIC value. 4: Approximately unbiased p-values of selective inference (p SI ) and non-selective inference (p AU ) are computed by one of (A) and (B) below. (A) Extrapolate ψ σ 2 (H |y) and ψ σ 2 (S |y) to σ 2 = -1 and 0, respectively, by z H = ϕ H (-1| ˆ β H (y)) and z S = ϕ S (0| ˆ β S (y)), and then compute p-values by p SI (H |S, y)= ¯ Φ(z H ) ¯ Φ(z H + z S ) and p AU (H |y)= ¯ Φ(z H ). (B) Specify k N, σ 2 0 , σ 2 -1 > 0 (e.g., k = 3 and σ 2 -1 = σ 2 0 = 1). Extrapolate ψ σ 2 (H |y) and ψ σ 2 (S |y) to σ 2 = -1 and 0, respectively, by z H,k = ϕ H,k (-1| ˆ β H (y), σ 2 -1 ) and z S,k = ϕ S,k (0| ˆ β S (y), σ 2 0 ), where the Taylor polynomial approximation of ϕ H at τ 2 > 0 with k terms is: ϕ H,k (σ 2 | ˆ β H (y), τ 2 )= k-1 j =0 (σ 2 - τ 2 ) j j ! j ϕ H (σ 2 | ˆ β H (y)) (σ 2 ) j σ 2 =τ 2 , and that of ϕ S is defined similarly. Then compute p-values by p SI,k (H |S, y)= ¯ Φ(z H,k ) ¯ Φ(z H,k + z S,k ) and p AU,k (H |y)= ¯ Φ(z H,k ). Model fitting to psi Non-selective p-value Selective p-value For non-smooth surfaces (such as cones and polyhedra) Extrapolation The hypothesis is tested only for the clusters appeared in the obtained tree Need selective inference for pvclust Pvclust (Suzuki and Shimodaira, 2006) fetal_lung ïBQRrmal ïBQRrmal ïBQRrmal ïBQRrmal ïBQRrmal ïB$GHQR ïB$GHQR ïB$GHQR ïBQRGH ï37B$GHQR ï07B$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQR ïB$GHQRBF ïB$GHQRBS ïB$GHQR ïB$GHQR ïB$GHQR ï37B$GHQR ïB$GHQRBS ïB$GHQRBF ïB6&/& ïB6&/& ïB6&/& ïB6&/& ïBQRGH ïB$GHQR ïB$GHQR ïBFRPELQHG ïB$GHQR ïB6&& ïB6&& ïB6&& ï6&& ïB6&& ïB6&& ïBQRGH ïB6&& ïBQRGH ïB6&& ïB6&& ïB6&& ïBQRGH ïB6&&BS ïB6&&BF ïB$GHQR ïB$GHQR ïB/&/& ïB/&/& ïB/&/& ïB$GHQR ïB/&/& ïB6&& ïB$GHQR ïB$GHQR ïBQRGH ïB$GHQR ï07B$GHQR ï07B$GHQR ïB$GHQR ïB$GHQR ïB$GHQR 0.0 0.2 0.4 0.6 1.0 +HLJKW 100 100 100 100 100 51 100 100 100 100 100 100 100 100 3 100 100 100 34 53 100 14 100 60 100 5 5 0 15 2 10 63 100 42 0 24 0 20 0 0 0 0 22 10 11 23 VL 100 100 100 100 100 100 100 100 100 100 100 100 100 100 62 100 100 100 100 61 100 100 100 64 63 au 100 100 100 100 100 66 100 100 100 100 100 100 100 100 41 100 100 100 30 60 100 52 100 100 31 21 26 45 40 36 56 62 21 44 20 55 15 3 16 3 30 26 ES 1 2 3 4 5 6 10 11 12 13 14 15 16 20 21 22 23 24 25 26 30 31 32 33 34 35 36 40 41 42 43 44 45 46 50 51 52 53 54 55 56 60 61 62 63 64 65 66 HGJH ↑ Hypothesis region Selection region ↓ Theorem (large sample theory): Boundary surfaces of H and S are smooth and “nearly parallel”, The proposed p-value is second order accurate with error Theorem (nearly flat surfaces): Boundary surfaces are “nearly flat”, approaching flat but allowing non-smooth such as cones and polyhedra the proposed p-value is justified as unbiased with error O (λ 2 ) O (n -1 ) O (B ) O (B 2 ) Algorithm : Computing approx. unbiased selective p-values (1) Large sample theory with “nearly parallel surfaces(2) Asymptotic theory of “nearly flat surfacesh(u)= h 0 + h i u i + h ij u i u j + h ijk u i u j u k + ··· , h 0 = O(1),h i = O(n -1/2 ),h ij = O(n -1/2 ),h ijk = O(n -1 ),... (u, v ) are O( p n), but h 0 = O(n 1/2 ), h i = O(1) are multiplied by O(n -1/2 ) sup u2R m |h(u)| = O(λ), R R m |h(u)| du < 1, R R m |F h(! )| d! < 1 n !1 λ ! 0 (Shimodaira 2008) Smooth surface non-smooth surface (Pham, Sheridan and Shimodaira, arXiv:1704.06017) What drives the growth of real-world networks across diverse fields? Using two interpretable and universa l mechanisms: (1) Talent : the intrinsic ability of a node to attract connections (fitness ! ) and (2) Experience : the preferential attachment (PA) function " governing the extent to which having more connections makes a node more/less attractive for forming new connections in the future. Selective inference (Terada and Shimodaira, arXiv:1711.00949) Key finding : Although both talent and experience contribute to the growth process, the ratios of their contributions vary greatly. Probabilistic Multi-view Graph Embedding (PMvGE) Theorem (universal approximation): Inner product + Neural networks Arbitrary similarity + Arbitrary transformation (Mercer’s theorem + universal approximation of NN) Neural network data vector (in view-d) Strength of association (0) Advantages : (1) PMvGE with neural networks is proved to be highly expressive . (2) PMvGE approximately non-linearly generalizes CDMCA (Shimodaira 2016, Neural Networks) which already generalizes various existing methods such as CCA, LPP, and Spectral graph embedding. (3) Likelihood-based estimation of neural networks is efficiently computed by mini-batch Stochastic Gradient Descent. Probability that node ! gets a new edge at time = Multimodal Eigenwords (Fukui, Oshikiri and Shimodaira, Textgraphs 2017) - “brown” + “white” = - “day” + “night” = A multimodal word embedding that jointly embeds words and corresponding visual information We employed Cross-Domain Matching Correlation Analysis (CDMCA; Shimodaira 2016) for extending a CCA-based word embedding (Dhillon et al. 2015) to deal with complex associations Feature learning via graph embedding Feature vectors reflect both semantic and visual similarities The method enables multimodal vector arithmetic between images and words Our proposed method : Multimodal vector arithmetic : Mathematical Statistics Team Shimodaira Lab Graduate School of Informatics, Kyoto University Inductive inference, Resampling methods, Information geometry Generalization error under Missing, Covariate-shift, etc. Multi-modal data representation, Graph Embedding, and Multivariate Analysis Network growth mechanism (Preferential attachment and fitness) Phylogenetics, Gene expression (hierarchical clustering) Image, word embedding (search and reasoning) Hidetoshi Shimodaira (Team Leader @ Kyoto Univ), Kazuki Fukui (RA), Akifumi Okuno (RA), Tetsuya Hada (RA), Masaaki Inoue (RA), Geewook Kim (RA), Thong Pham (Postdoctoral Researcher @ AIP), Tomoharu Iwata (Visiting Scientist @ NTT), Yoshikazu Terada (Visiting Scientist @ Osaka Univ), Shinpei Imori (Visiting Scientist @ Hiroshima Univ), Kei Hirose (Visiting Scientist @ Kyushu Univ ) Statistics and Machine Learning: Methodology and Applications (Okuno, Hada and Shimodaira, arXiv:1802.04630) Multi-view feature learning with many-to-many associations via neural networks for predicting new associations PAFit : an R Package for Estimating Preferential Attachment and Node Fitness in Temporal Complex Networks Talent vs. Experience Experience APP APP Talent Experience Talent LFB LFB

Upload: others

Post on 25-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mathematical Statistics Teamstat.sys.i.kyoto-u.ac.jp/.../05/20180304e_poster.pdf · Mathematical Statistics Team Hidetoshi Shimodaira • Motivation: Assessing the confidence of each

数理統計学チーム 下平英寿Mathematical Statistics Team

Hidetoshi Shimodaira

• Motivation : Assessing the confidence of each obtained cluster• Consider approx. unbiased p-values as frequentist confidence measures

Null hypothesis = obtained cluster is NOT true• Example : lung data (73 tumors, 916 genes; Garber et al., 2001)

• Two asymptotic theories

Key point: Multiscale Bootstrap(Shimodaira, 2002; 2004)

ü Low computational cost : ü Double bootstrap method has same

accuracy but high comp. cost

Y. Terada and H. Shimodaira/Selective inference via multiscale bootstrap 5

Algorithm 1 Computing approximately unbiased p-values1: Specify several n! ! N values, and set !2 = n/n!. Set the number of bootstrap replicates

B, say, 1000.2: For each n!, perform bootstrap resampling to generate Y " for B times and compute"!2 (H|y) = CH/B and "!2 (S|y) = CS/B by counting the frequencies CH = #{Y " ! H}and CS = #{Y " ! S}. (We actually work on X "

n! instead of Y ".) Compute #!2 (H|y) =!!̄#1("!2 (H|y)) and #!2 (S|y) = !!̄#1("!2 (S|y)).

3: Estimate parameters $H(y) and $S(y) by fitting models

#!2 (H|y) = %H(!2|$H) and #!2 (S|y) = %S(!2|$S),

respectively. The parameter estimates are denoted as $̂H(y) and $̂S(y). If we have severalcandidate models, apply above to each and choose the best model based on AIC value.

4: Approximately unbiased p-values of selective inference (pSI) and non-selective inference(pAU) are computed by one of (A) and (B) below.

(A) Extrapolate #!2 (H|y) and #!2 (S|y) to !2 = "1 and 0, respectively, by

zH = %H("1|$̂H(y)) and zS = %S(0|$̂S(y)),

and then compute p-values by

pSI(H|S, y) =!̄(zH)

!̄(zH + zS)and pAU(H|y) = !̄(zH).

(B) Specify k ! N, !20 ,!

2#1 > 0 (e.g., k = 3 and !2

#1 = !20 = 1). Extrapolate #!2 (H|y)

and #!2 (S|y) to !2 = "1 and 0, respectively, by

zH,k = %H,k("1|$̂H(y),!2#1) and zS,k = %S,k(0|$̂S(y),!2

0),

where the Taylor polynomial approximation of %H at &2 > 0 with k terms is:

%H,k(!2|$̂H(y), &2) =

k#1!

j=0

(!2 " &2)j

j!

'j%H(!2|$̂H(y))

'(!2)j

"""""!2="2

,

and that of %S is defined similarly. Then compute p-values by

pSI,k(H|S, y) =!̄(zH,k)

!̄(zH,k + zS,k)and pAU,k(H|y) = !̄(zH,k).

Model fitting to psi

Non-selectivep-valueSelective

p-value

For n

on-s

moo

th su

rfac

es

(suc

h as

con

es a

nd p

olyh

edra

)

Extrapolation

The hypothesis is tested only for

the clusters appeared in the obtained tree

Need selective inference for pvclust

Pvclust (Suzuki and Shimodaira, 2006)

fetal_lung

����BQRrmal

����BQRrmal

����BQRrmal

����BQRrmal

����BQRrmal

����B$

GHQR

����B$

GHQR

����B$

GHQR

����BQRGH

����37

B$GHQR

����07B$G

HQR

����B$

GHQR

����B$

GHQR

����B$

GHQR

���B$G

HQR

����B$

GHQR

����B$

GHQR

����B$

GHQR

����B$

GHQR

����B$

GHQR

���B$G

HQR

����B$

GHQR

���B$G

HQR

����B$

GHQRBF

����B$

GHQRBS

����B$

GHQR

����B$

GHQR

����B$

GHQR

����37

B$GHQR

����B$

GHQRBS

����B$

GHQRBF

����B6

&/&

����B6

&/&

����B6

&/&

����B6

&/&

����BQRGH

����B$

GHQR

����B$

GHQR

���BFRPELQHG

���B$G

HQR

����B6

&&

����B6

&&

����B6

&&

�ï6&

&���B6&

&����B6

&&

����BQRGH

����B6

&&

����BQRGH

���B6&

&����B6

&&

����B6

&&

����BQRGH

����B6

&&BS

����B6

&&BF

����B$

GHQR

����B$

GHQR

��B/&/&

����B/&/&

����B/&/&

����B$

GHQR

����B/&/&

���B6&

&����B$

GHQR

����B$

GHQR

����BQRGH

����B$

GHQR

����07�B$

GHQR

����07�B$

GHQR����B$

GHQR

���B$G

HQR

����B$

GHQR

0.0

0.2

0.4

0.6

���

1.0

+HLJKW

100100 100

100 10051 100100100 �� 100 100100

��1001003100 �� 100

100�� �� ���� 34 �53 ����100 �� ���� �� ����

14 �� 10060 100 5� ��5 ��015 2 ���� 1063 10042 ��0 240 20����

00 00

22 101123

VL

100100 100

100 100�� 100100100 100 100 100100

��10010062100 �� 100

100�� �� ���� �� ���� ����100 �� ���� �� ����

61 �� 100�� 100 ���� ���� ������ �� ���� ���� 100�� ��64 ���� 63����

���� ����

�� ������

au

100100 100

100 10066 100100100 �� 100 100100

��10010041100 �� 100

100�� �� ���� �� 3060 ����100 �� ���� �� ����

52 �� 100�� 100 ��31 ��21 ��2645 �� ��40 3656 ��62 ��21 4420 55����

153 163

�� 30��26

ES

12 3

4 56 ��� 10 11 1213

141516���� �� 20

2122 23 2425 26 ���� ��3031 32 3334 35 36��

�� �� 4041 42 4344 4546 ������ 50 5152 5354 5556 ���� ��60 616263

6465 66��

�� ������

HGJH��

↑ Hypothesis region

Selection region ↓

Theorem (large sample theory): Boundary surfaces of H and S are smooth and “nearly parallel”, ⇒ The proposed p-value is second order accurate with error Theorem (nearly flat surfaces): Boundary surfaces are “nearly flat”, approaching flat but allowing non-smooth such as cones and polyhedra⇒ the proposed p-value is justified as unbiased with error O(�2)

O(n�1)

O(B)

O(B2)

Algorithm : Computing approx. unbiased selective p-values

(1) Large sample theory with “nearly parallel surfaces”

(2) Asymptotic theory of “nearly flat surfaces”

h(u) = h0 + hiui + hijuiuj + hijkuiujuk + · · · ,

h0 = O(1), hi = O(n�1/2), hij = O(n�1/2), hijk = O(n�1), . . .

(u, v) are O(pn), but h0 = O(n1/2), hi = O(1) are multiplied by O(n�1/2)

supu2Rm |h(u)| = O(�),RRm |h(u)| du < 1,

RRm |Fh(!)| d! < 1

n ! 1

� ! 0(Shimodaira 2008)

Smooth surface

non-smooth surface

(Pham, Sheridan and Shimodaira, arXiv:1704.06017)• What drives the growth of real-world networks across diverse fields? • Using two interpretable and universal mechanisms: (1) Talent: the intrinsic ability of a node

to attract connections (fitness 𝜂!) and (2) Experience: the preferential attachment (PA) function 𝐴" governing the extent to which having more connections makes a node more/less attractive for forming new connections in the future.

Selective inference (Terada and Shimodaira, arXiv:1711.00949)

Key finding: Although both talent and experience contribute to the growth process, the ratios of their contributions vary greatly.

Probabilistic Multi-view Graph Embedding (PMvGE)

Theorem (universal approximation): Inner product + Neural networks≈ Arbitrary similarity + Arbitrary transformation

(Mercer’s theorem + universal approximation of NN)

Neural network data vector (in view-d)Strength of association (≧0)

Advantages:(1) PMvGE with neural networks is proved to be highly expressive. (2) PMvGE approximately non-linearly generalizes CDMCA

(Shimodaira 2016, Neural Networks) which already generalizes various existing methods such as CCA, LPP, and Spectral graph embedding.

(3) Likelihood-based estimation of neural networks is efficiently computed by mini-batch Stochastic Gradient Descent.

Probability that node 𝑣! gets a new edge at time 𝑡 =

Multimodal Eigenwords (Fukui, Oshikiri and Shimodaira, Textgraphs 2017)

- “brown” + “white” =

- “day” + “night” =

• A multimodal word embedding that jointly embeds words and corresponding visual information

• We employed Cross-Domain Matching Correlation Analysis (CDMCA; Shimodaira 2016) for extending a CCA-based word embedding (Dhillon et al. 2015) to deal with complex associations

• Feature learning via graph embedding

• Feature vectors reflect both semantic and visual similarities

• The method enables multimodal vector arithmetic between images and words

Our proposed method:

Multimodal vector arithmetic:

Mathematical Statistics TeamShimodaira LabGraduate School of Informatics, Kyoto University

• Inductive inference, Resampling methods, Information geometry• Generalization error under Missing, Covariate-shift, etc.• Multi-modal data representation, Graph Embedding, and Multivariate Analysis• Network growth mechanism (Preferential attachment and fitness)• Phylogenetics, Gene expression (hierarchical clustering)• Image, word embedding (search and reasoning)

Hidetoshi Shimodaira (Team Leader @ Kyoto Univ), Kazuki Fukui (RA), Akifumi Okuno (RA), Tetsuya Hada (RA), Masaaki Inoue (RA), Geewook Kim (RA), Thong Pham (Postdoctoral Researcher @ AIP), Tomoharu Iwata (Visiting Scientist @ NTT), Yoshikazu Terada (Visiting Scientist @ Osaka Univ), Shinpei Imori (Visiting Scientist @ Hiroshima Univ), Kei Hirose (Visiting Scientist @ Kyushu Univ )

Statistics and Machine Learning: Methodology and Applications

(Okuno, Hada and Shimodaira, arXiv:1802.04630)Multi-view feature learning with many-to-many associations via neural networks for predicting new associations

PAFit: an R Package for Estimating Preferential Attachment and Node Fitness in Temporal Complex Networks

Talent vs. Experience

Experience

APPAPPTalent

Experience Talent

LFB LFB