ph.d. dissertation - integrative network analysis framework for multiple omics data using...

46
Integrative network analysis framework for multiple omics data using information-theoretic measure 정정정정 정정 정정 정정 정정정 정정정 정정 정정정정 정정 정정정정정 Department of Information and Computer Engineering 정정정 May 20th, 2015

Upload: hyun-hwan-jeong

Post on 05-Aug-2015

44 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

Integrative network analysis framework for multiple omics data

using information-theoretic measure

정보이론 척도 기반 다중 오믹스 데이터 통합 네트워크 분석 프레임워크

Department of Information and Computer Engineering

정현환

May 20th, 2015

Page 2: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

2

Outline

• Introduction• Motivation• Problem statement• Previous studies• Main purpose

• Proposed method• Mutual information• Outcome-guided network construction• Integrative network construction• Software (MINA)

• Experiments• Simulation study• Real data analysis

• Conclusion

Page 3: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

3

INTRODUCTION

Page 4: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

4

Motivation

• Interaction network• A representation of the

entities and their interactions of large-scale complex system

• Gene regulatory network, protein-protein interaction network, biological pathway, etc.

• Application• Protein function prediction• Disease gene prioritization• Clinical outcome prediction• …

A gene-gene interaction network of ovarian cancer subtype 1 (Hofree et al. 2013)

Page 5: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

5

Problem statement (1/2)• Construction of the interaction network from

high-dimensional ( features) omics data

Sample ID X1 X2 … XP

Diseasestatus

S001 +0.3 -2.0 -10.3 O

S002 -0.1 -7.0 -11.1 O

S003 +1.2 +3.0 -5.0 X

S004 +0.9 +0.5 -3.2 X

Gene expression data Interaction network

P genes(features)

Nsamples

Page 6: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

6

Problem statement (2/2)• Multi-omics data integration

“Importantly, integrative interpretation of the data will help identify how the consequences of mutations vary across tissues, with important therapeutic implications”

(TCGA et al., 2013)

(Kim et al. 2015)(TCGA et al., 2013)

Page 7: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

7

Previous study (1/2)

Taxonomy of the computational network construction methods

• Interaction network construction by computational approach

Page 8: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

8

Previous study (1/2) (Cont’d)• Interaction network construction methods

using mutual information measure

Page 9: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

9

Previous study (2/2)• Data integration

(Kim et al., 2015)

Page 10: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

10

Main Purpose (1/2)• Outcome-guided network construction using the mutual

information measure

MYO3A

SWI5

surv

ival

rate

survival month

Page 11: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

11

Main Purpose (2/2)• network analysis framework for data integration

• Module finding• Prediction• Topology analysis• Pathway-inference• ...

Multi-omics datawith outcome

population

Outcome-guided network

Application

Single profile networks

Integrative network

Page 12: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

12

Purpose of the study

• Outcome-guided network construction

• Integration of the outcome-guided network

• Utility of outcome-guided network and integrated network

Page 13: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

13

PROPOSED METHOD1. Mutual information2. Outcome-guided network construction3. Network integration4. MINA

Page 14: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

14

Mutual information• Association measure in information theory• Measure the linear/non-linear association between two

random variables

• The measure widely used in GWAS to measure strength of association between interaction of SNPs and traits(Leem et al., 2014, Hu et al., 2011)

𝐼 ( 𝑋1 , 𝑋 2;𝑌 )=𝐻 (𝑋 1 , 𝑋 2 )+𝐻 (𝑌 )−𝐻 (𝑋 1, 𝑋 2 ,𝑌 )

pair of features binary outcome : entropy of

Page 15: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

15

Outcome-guided mutual information network construction

Mutual information

# of

edg

es

𝜽θ=𝑚𝑎𝑥 𝑖≠ 𝑗 𝐼 avg (𝑖 , 𝑗 )

𝐼 avg (𝑖 , 𝑗 )= 130

∑𝑝=1

30

𝐼avg (𝑔𝑖 ,𝑔 𝑗 ;𝑌 𝑝 )

𝜃∗ (1+𝜶 )

𝐺𝛼𝑝𝑟𝑜𝑓𝑖𝑙𝑒= {(𝑔𝑖 ,𝑔 𝑗 )|𝑔𝑖 ,𝑔 𝑗∈𝑃 𝑎𝑛𝑑 𝐼 (𝑔𝑖 ,𝑔 𝑗 ;𝑌 )≥𝜃(1+𝛼)}

: threshold: additional parameter : gene: outcome : p-th permuted outcome

Page 16: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

16

Network integration (1/3)

• Integration based on edge occurrence

• Integration using network fusion technique

Page 17: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

17

Network integration (2/3)

• Integration based on edge occurrence

• : integrated network with co-occurrence edges• : integrated network with one-or-more occurrence

edges

Page 18: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

18

Network integration (3/3)• Integration using network fusion technique

Outcome-guided networks Fusion iterations Integrative network

Similarity network fusion (Wang et al., 2014)

𝑝(1)

𝑝(2 )𝑝(𝑐 )=

𝑃 𝑡(1)+𝑃𝑡

(2)+…+𝑃 𝑡(𝑚)

𝑚

𝑝𝑡(1)

𝑝𝑡(2 )

: affinity matrix : kernel matrix : final fused network

Page 19: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

19

Development of integrated network analysis framework

• MINA : Mutual Information-based integrative Network Analysis framework• Easy to use

• Preprocessing of input data• Construction of outcome-guided networks• Network integration

• Written in C++ with OpenMP library• Runs in less than 3 hours for features on desktop

• Open source program• Publically available at https://github.com/hhjeong/MINA

Page 20: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

20

Overview of MINA

Page 21: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

21

Experiments• Simulation study• Real data analysis

• KARE dataset• TCGA dataset

Page 22: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

22

Simulation study (1/2)• Five simulated models from Ma et al. 2011

• Expression data with 50 genes, 200 samples• Generated from multivariate normal distribution

• 2 disease status(affected/unaffected)

• Ground-truth of the simulated models• Fully connected sub-network of 25 genes (scenario 1~4)• Multiple sub-networks (mixed model, scenario 5)

• Performance measure (accuracy)• How many edges of the sub-network are correctly found?

• Comparisons with previous methods• ARACNE, CLR, MRNET, MRNETB

Page 23: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

23

Simulation study (2/2)

• Performance assessment (accuracy)

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁

𝑇𝑃+𝐹𝑃+𝐹𝑁 +𝑇𝑁

Number of edges correctly found+

Number of edges correctly not found

Page 24: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

24

Experiments• Simulation study• Real data analysis

• KARE dataset• TCGA dataset

Page 25: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

25

Real data analysis (1)Gastritis dataset in KARE project

• Korean Association REsource project• 185,426 Single Nucleotide Polymorphisms(SNPs)

• 3 types of genotype – AA, Aa and aa• 3,770 samples for gastritis

• Affected/unaffected by the disease history of the samples

Mapping the SNP-SNP associations onto the gene-gene interaction

Page 26: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

26

KARE dataset - significant pairs• There were approximately 2~4% significant pairs

among all possible pairs in the chromosomes.

Page 27: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

27

KARE dataset – network topology

Network centralization: 0.953Clustering coefficient : 0.848

Chromosome 18 Chromosome 20

Page 28: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

28

KARE dataset - Functionality assessment

• 18 enriched GO terms detected in the studychromosome TERM Term Fold Enrichment FDR

1 GOTERM_CC_FAT GO:0005886~plasma membrane 1.35 1.93E-02

1 GOTERM_CC_FAT GO:0044459~plasma membrane part 1.51 1.97E-02

2 GOTERM_MF_FAT GO:0004908~interleukin-1 receptor activity 34.22 8.88E-03

7 GOTERM_CC_FAT GO:0042995~cell projection 2.67 2.47E-02

12 GOTERM_CC_FAT GO:0005626~insoluble fraction 2.40 2.98E-02

12 GOTERM_BP_FAT GO:0006811~ion transport 2.65 3.23E-02

16 GOTERM_MF_FAT GO:0005509~calcium ion binding 2.80 5.14E-03

17 GOTERM_MF_FAT GO:0003774~motor activity 5.88 2.38E-02

19 GOTERM_BP_FAT GO:0006350~transcription 2.70 5.40E-14

19 GOTERM_BP_FAT GO:0051252~regulation of RNA metabolic process 2.79 1.61E-12

19 GOTERM_BP_FAT GO:0045449~regulation of transcription 2.35 5.17E-12

19 GOTERM_BP_FAT GO:0006355~regulation of transcription, DNA-dependent 2.77 8.93E-12

19 GOTERM_MF_FAT GO:0003677~DNA binding 2.14 1.27E-07

19 GOTERM_MF_FAT GO:0008270~zinc ion binding 2.10 7.07E-07

19 GOTERM_MF_FAT GO:0046914~transition metal ion binding 1.92 2.99E-06

19 GOTERM_MF_FAT GO:0046872~metal ion binding 1.64 1.98E-05

19 GOTERM_MF_FAT GO:0043169~cation binding 1.62 3.31E-05

19 GOTERM_MF_FAT GO:0043167~ion binding 1.60 7.36E-05

Page 29: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

29

Experiments• Simulation study• Real data analysis

• KARE dataset• TCGA dataset

Page 30: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

30

Real data analysis (2) – Ovarian cancer dataset in TCGA• Measurements for 10,022 genes of 340 cancer patients in

three different genomic level• Survival month classification:

• Short-term(<36 month), long-term(otherwise)

Genomic profile Platform Data Type

CNA Affymetrix SNP 6 Discrete(by GISTIC 2.0)

mRNA Agilent microarray Continuous

Methylation Illumina Infinium HumanMethylation27 Continuous

Page 31: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

31

TCGA dataset - Distribution of mutual information for different genomic levels• Cumulative distribution of mutual information value for each

genomic level.

Mutual information

log10

¿𝐸

∨¿

¿

Page 32: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

32

TCGA dataset - Significance for survivability of the association• Kaplan-Meier estimator used to verify statistical

significance of the associations• Effects of the extracted association with and shows higher

significance than the effects of single genes.

𝛼

−log10

(𝑝−𝑣𝑎𝑙𝑢𝑒

)

Page 33: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

33

TCGA dataset – survival analysis with the outcome-guided network

Network-based Cox-regression (Zhang 2013 et al.)Prediction of survivability

surv

ival

rate

survival month

Applying outcome-guided network

Page 34: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

34

TCGA dataset - Network-based Cox-regression

• Comparison prediction power between association network and interaction network for each profile

mea

n(tim

eAU

C))

CNA mRNA methylation

Page 35: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

35

TCGA dataset – network topology(Integration based on edge occurrence, )

• Integrated network construction scheme shows greatly enhanced level of scale-freeness.

networks

0.745

0.749

0.842

0.950

: coefficient of determination (model-fitting index)

Page 36: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

36

TCGA dataset - Functionality assessment(Integration based on edge occurrence, )

• Comparisons of enrichment GO terms between one-or-more occurrence network and single genomic level network

𝐼 ∃

𝐼 ∃

𝐼 ∃

Page 37: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

37

Spectral clustering

TCGA dataset – Spectral clustering(Integration using network fusion technique)

Integrative network Common modules

Page 38: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

38

• Enrichment test for co-expression terms in MSigDB

Cluster Number Number of genes Represented enrichment terms

1 3,041Genes whose expression in suboptimally debulked ovarian tumors is associated with survival prognosis.

2 1,675Genes up-regulated in epithelial ovarian cancer (EOC) biopsies: invasive (TOV) vs low malignant potential (LMP) tumors.

3 2,607 Genes up-regulated in SKOV3ip1 cells (ovarian cancer) upon knockdown of EZH2 by RNAi.

4 2,699 Genes down-regulated in SKOC-3 cells (ovarian cancer) after YB-1 (YBX1) knockdown by RNAi.

TCGA dataset – Spectral clustering(Integration using network fusion technique)

Page 39: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

39

Conclusion

• Contributions• Outcome-guided network construction

• Shows better statistical significance & biological functionality detection

• Improves survivability prediction power• Integration of the outcome-guided network

• Greatly enhances the biological significance• Development of software

• Future works• Outcome prediction with the integrative network• Software improvement• Application to other domains

Page 40: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

40

PUBLICATIONS

Page 41: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

41

Publications (1/2)

1. Hyun-hwan Jeong, So Yeon Kim, Kyubum Wee and Kyung-Ah Sohn. “Investigating the Utility of Clinical Outcome-guided Mutual Information Network in Network-based Cox Regression”. BMC systems Biology (2015).

2. Hyun-hwan Jeong and Kyung-Ah Sohn. “Relevance epistasis network of gastritis for intra-chromosomes in the KARE cohort study”. Genomics & Informatics (2014).

3. Sangseob Leem, Hyun-hwan Jeong, Jungseob Lee, Kyubum Wee, Kyung-Ah Sohn. “Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure”. Computational Biology and Chemistry (2014).

4. Kyung-Ah Sohn, Joshua Ho, Djordje Djordjevic, Hyun-hwan Jeong, Peter Park, Ju Han Kim. “hiHMM: Bayesian non-parametric joint inference of chromatin state maps”. Bioinformatics (advanced access) .

Page 42: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

42

Publications (2/2)

5. Hyun-hwan Jeong, Sangseob Leem, Kyubum Wee and Kyung-Ah Sohn. “Integrative network analysis for survival-associated gene-gene interactions across multiple genomic profiles in ovarian cancer” . Journal of ovarian research (in revision)

6. Sangseob Leem, Hyun-hwan Jeong, Jungseob Lee, Kyubum Wee, Kyung-Ah Sohn. “MIBE: a software package for fast detection and interpretation of high-order epistatic interactions in genome-wide association study”. BioData Mining (submitted)

7. Jaeyeon Lee, Ho-min Park, Hyun-hwan Jeong, Kyung-Ah Sohn. “RecPAL: Recommending Problems of Adequate Level for Personalized Learning”. Expert Systems (submitted)

Page 43: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

43

Conference presentations

1. Hyun-hwan Jeong, Garam-Lee, Kyung-Ah Sohn. “Integrative analysis for outcome-guided gene networks from multiple omics profiles”. ISMB 2015 (Poster, will present).

2. Hyun-hwan Jeong, So Yeon Kim, Kyubum Wee and Kyung-Ah Sohn. “Investigating the Utility of Clinical Outcome-guided Mutual Information Network in Network-based Cox Regression”. APBC 2015 (Oral).

3. Hyun-hwan Jeong, So Yeon Kim, Kyubum Wee, Kyung-Ah Sohn. “Outcome-guided mutual information networks for investigating gene-gene interaction effects on clinical outcomes”. ISB/TBC 2014 (Poster).

4. Hyun-hwan Jeong, Kyubum Wee, Kyung-Ah Sohn. “Detection of pair-wise genomic interactions associated with clinical outcome in ovarian cancer patients using information theoretic measure”. APBC 2014 (Poster).

5. Hyun-Hwan Jeong, Sangseob Leem, Kyubum Wee. “High-order epistatic interaction detection using clique finding algorithm in genome-wide association studies”. TBC/ISCB 2013 (Poster).

Page 44: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

44

References (1/2)

• Hofree et al., “Network-based stratification of tumor mutations”, Nature methods 2013, 10:1108-1115.

• TCGA et al., “The Cancer Genome Atlas Pan-Cancer analysis project”, Nature Genetics 2013, 45:1113-1120.

• Kim et al., “Methods of integrating data to uncover genotype-phenotype interactions”, Nature methods 2015, 16:85-97.

• Ma et al., “COSINE: Condition-Specific sub-network identification using a global optimization method”, Bioinformatics 2011, 27(9): 1290-1298.

• Wang et al., “Similarity network fusion for aggregating data types on a genomic scale”, Nature methods 2014, 11:333-337.

Page 45: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

45

References (2/2)

• Zhang et al. “Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment.”, PLoS ONE 2013, 9(3):e1002975.

• Butte a J, Kohane IS, “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements”, Pacific Symp Biocomput 2000:418–429.

• Margolin et al., “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context”, BMC Bioinformatics 2006, Suppl 1:S7.

• Meyer et al., “MINET: A R/Bioconductor package for inferring large transcriptional networks using mutual information”, BMC Bioinformatics 2008, 9:461.

Page 46: Ph.D. dissertation - Integrative network analysis framework for multiple omics data using information-theoretic measure

감사합니다