ph.d. dissertation - integrative network analysis framework for multiple omics data using...

Post on 05-Aug-2015

45 Views

Category:

Science

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Integrative network analysis framework for multiple omics data

using information-theoretic measure

정보이론 척도 기반 다중 오믹스 데이터 통합 네트워크 분석 프레임워크

Department of Information and Computer Engineering

정현환

May 20th, 2015

2

Outline

• Introduction• Motivation• Problem statement• Previous studies• Main purpose

• Proposed method• Mutual information• Outcome-guided network construction• Integrative network construction• Software (MINA)

• Experiments• Simulation study• Real data analysis

• Conclusion

3

INTRODUCTION

4

Motivation

• Interaction network• A representation of the

entities and their interactions of large-scale complex system

• Gene regulatory network, protein-protein interaction network, biological pathway, etc.

• Application• Protein function prediction• Disease gene prioritization• Clinical outcome prediction• …

A gene-gene interaction network of ovarian cancer subtype 1 (Hofree et al. 2013)

5

Problem statement (1/2)• Construction of the interaction network from

high-dimensional ( features) omics data

Sample ID X1 X2 … XP

Diseasestatus

S001 +0.3 -2.0 -10.3 O

S002 -0.1 -7.0 -11.1 O

S003 +1.2 +3.0 -5.0 X

S004 +0.9 +0.5 -3.2 X

Gene expression data Interaction network

P genes(features)

Nsamples

6

Problem statement (2/2)• Multi-omics data integration

“Importantly, integrative interpretation of the data will help identify how the consequences of mutations vary across tissues, with important therapeutic implications”

(TCGA et al., 2013)

(Kim et al. 2015)(TCGA et al., 2013)

7

Previous study (1/2)

Taxonomy of the computational network construction methods

• Interaction network construction by computational approach

8

Previous study (1/2) (Cont’d)• Interaction network construction methods

using mutual information measure

9

Previous study (2/2)• Data integration

(Kim et al., 2015)

10

Main Purpose (1/2)• Outcome-guided network construction using the mutual

information measure

MYO3A

SWI5

surv

ival

rate

survival month

11

Main Purpose (2/2)• network analysis framework for data integration

• Module finding• Prediction• Topology analysis• Pathway-inference• ...

Multi-omics datawith outcome

population

Outcome-guided network

Application

Single profile networks

Integrative network

12

Purpose of the study

• Outcome-guided network construction

• Integration of the outcome-guided network

• Utility of outcome-guided network and integrated network

13

PROPOSED METHOD1. Mutual information2. Outcome-guided network construction3. Network integration4. MINA

14

Mutual information• Association measure in information theory• Measure the linear/non-linear association between two

random variables

• The measure widely used in GWAS to measure strength of association between interaction of SNPs and traits(Leem et al., 2014, Hu et al., 2011)

𝐼 ( 𝑋1 , 𝑋 2;𝑌 )=𝐻 (𝑋 1 , 𝑋 2 )+𝐻 (𝑌 )−𝐻 (𝑋 1, 𝑋 2 ,𝑌 )

pair of features binary outcome : entropy of

15

Outcome-guided mutual information network construction

Mutual information

# of

edg

es

𝜽θ=𝑚𝑎𝑥 𝑖≠ 𝑗 𝐼 avg (𝑖 , 𝑗 )

𝐼 avg (𝑖 , 𝑗 )= 130

∑𝑝=1

30

𝐼avg (𝑔𝑖 ,𝑔 𝑗 ;𝑌 𝑝 )

𝜃∗ (1+𝜶 )

𝐺𝛼𝑝𝑟𝑜𝑓𝑖𝑙𝑒= {(𝑔𝑖 ,𝑔 𝑗 )|𝑔𝑖 ,𝑔 𝑗∈𝑃 𝑎𝑛𝑑 𝐼 (𝑔𝑖 ,𝑔 𝑗 ;𝑌 )≥𝜃(1+𝛼)}

: threshold: additional parameter : gene: outcome : p-th permuted outcome

16

Network integration (1/3)

• Integration based on edge occurrence

• Integration using network fusion technique

17

Network integration (2/3)

• Integration based on edge occurrence

• : integrated network with co-occurrence edges• : integrated network with one-or-more occurrence

edges

18

Network integration (3/3)• Integration using network fusion technique

Outcome-guided networks Fusion iterations Integrative network

Similarity network fusion (Wang et al., 2014)

𝑝(1)

𝑝(2 )𝑝(𝑐 )=

𝑃 𝑡(1)+𝑃𝑡

(2)+…+𝑃 𝑡(𝑚)

𝑚

𝑝𝑡(1)

𝑝𝑡(2 )

: affinity matrix : kernel matrix : final fused network

19

Development of integrated network analysis framework

• MINA : Mutual Information-based integrative Network Analysis framework• Easy to use

• Preprocessing of input data• Construction of outcome-guided networks• Network integration

• Written in C++ with OpenMP library• Runs in less than 3 hours for features on desktop

• Open source program• Publically available at https://github.com/hhjeong/MINA

20

Overview of MINA

21

Experiments• Simulation study• Real data analysis

• KARE dataset• TCGA dataset

22

Simulation study (1/2)• Five simulated models from Ma et al. 2011

• Expression data with 50 genes, 200 samples• Generated from multivariate normal distribution

• 2 disease status(affected/unaffected)

• Ground-truth of the simulated models• Fully connected sub-network of 25 genes (scenario 1~4)• Multiple sub-networks (mixed model, scenario 5)

• Performance measure (accuracy)• How many edges of the sub-network are correctly found?

• Comparisons with previous methods• ARACNE, CLR, MRNET, MRNETB

23

Simulation study (2/2)

• Performance assessment (accuracy)

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁

𝑇𝑃+𝐹𝑃+𝐹𝑁 +𝑇𝑁

Number of edges correctly found+

Number of edges correctly not found

24

Experiments• Simulation study• Real data analysis

• KARE dataset• TCGA dataset

25

Real data analysis (1)Gastritis dataset in KARE project

• Korean Association REsource project• 185,426 Single Nucleotide Polymorphisms(SNPs)

• 3 types of genotype – AA, Aa and aa• 3,770 samples for gastritis

• Affected/unaffected by the disease history of the samples

Mapping the SNP-SNP associations onto the gene-gene interaction

26

KARE dataset - significant pairs• There were approximately 2~4% significant pairs

among all possible pairs in the chromosomes.

27

KARE dataset – network topology

Network centralization: 0.953Clustering coefficient : 0.848

Chromosome 18 Chromosome 20

28

KARE dataset - Functionality assessment

• 18 enriched GO terms detected in the studychromosome TERM Term Fold Enrichment FDR

1 GOTERM_CC_FAT GO:0005886~plasma membrane 1.35 1.93E-02

1 GOTERM_CC_FAT GO:0044459~plasma membrane part 1.51 1.97E-02

2 GOTERM_MF_FAT GO:0004908~interleukin-1 receptor activity 34.22 8.88E-03

7 GOTERM_CC_FAT GO:0042995~cell projection 2.67 2.47E-02

12 GOTERM_CC_FAT GO:0005626~insoluble fraction 2.40 2.98E-02

12 GOTERM_BP_FAT GO:0006811~ion transport 2.65 3.23E-02

16 GOTERM_MF_FAT GO:0005509~calcium ion binding 2.80 5.14E-03

17 GOTERM_MF_FAT GO:0003774~motor activity 5.88 2.38E-02

19 GOTERM_BP_FAT GO:0006350~transcription 2.70 5.40E-14

19 GOTERM_BP_FAT GO:0051252~regulation of RNA metabolic process 2.79 1.61E-12

19 GOTERM_BP_FAT GO:0045449~regulation of transcription 2.35 5.17E-12

19 GOTERM_BP_FAT GO:0006355~regulation of transcription, DNA-dependent 2.77 8.93E-12

19 GOTERM_MF_FAT GO:0003677~DNA binding 2.14 1.27E-07

19 GOTERM_MF_FAT GO:0008270~zinc ion binding 2.10 7.07E-07

19 GOTERM_MF_FAT GO:0046914~transition metal ion binding 1.92 2.99E-06

19 GOTERM_MF_FAT GO:0046872~metal ion binding 1.64 1.98E-05

19 GOTERM_MF_FAT GO:0043169~cation binding 1.62 3.31E-05

19 GOTERM_MF_FAT GO:0043167~ion binding 1.60 7.36E-05

29

Experiments• Simulation study• Real data analysis

• KARE dataset• TCGA dataset

30

Real data analysis (2) – Ovarian cancer dataset in TCGA• Measurements for 10,022 genes of 340 cancer patients in

three different genomic level• Survival month classification:

• Short-term(<36 month), long-term(otherwise)

Genomic profile Platform Data Type

CNA Affymetrix SNP 6 Discrete(by GISTIC 2.0)

mRNA Agilent microarray Continuous

Methylation Illumina Infinium HumanMethylation27 Continuous

31

TCGA dataset - Distribution of mutual information for different genomic levels• Cumulative distribution of mutual information value for each

genomic level.

Mutual information

log10

¿𝐸

∨¿

¿

32

TCGA dataset - Significance for survivability of the association• Kaplan-Meier estimator used to verify statistical

significance of the associations• Effects of the extracted association with and shows higher

significance than the effects of single genes.

𝛼

−log10

(𝑝−𝑣𝑎𝑙𝑢𝑒

)

33

TCGA dataset – survival analysis with the outcome-guided network

Network-based Cox-regression (Zhang 2013 et al.)Prediction of survivability

surv

ival

rate

survival month

Applying outcome-guided network

34

TCGA dataset - Network-based Cox-regression

• Comparison prediction power between association network and interaction network for each profile

mea

n(tim

eAU

C))

CNA mRNA methylation

35

TCGA dataset – network topology(Integration based on edge occurrence, )

• Integrated network construction scheme shows greatly enhanced level of scale-freeness.

networks

0.745

0.749

0.842

0.950

: coefficient of determination (model-fitting index)

36

TCGA dataset - Functionality assessment(Integration based on edge occurrence, )

• Comparisons of enrichment GO terms between one-or-more occurrence network and single genomic level network

𝐼 ∃

𝐼 ∃

𝐼 ∃

37

Spectral clustering

TCGA dataset – Spectral clustering(Integration using network fusion technique)

Integrative network Common modules

38

• Enrichment test for co-expression terms in MSigDB

Cluster Number Number of genes Represented enrichment terms

1 3,041Genes whose expression in suboptimally debulked ovarian tumors is associated with survival prognosis.

2 1,675Genes up-regulated in epithelial ovarian cancer (EOC) biopsies: invasive (TOV) vs low malignant potential (LMP) tumors.

3 2,607 Genes up-regulated in SKOV3ip1 cells (ovarian cancer) upon knockdown of EZH2 by RNAi.

4 2,699 Genes down-regulated in SKOC-3 cells (ovarian cancer) after YB-1 (YBX1) knockdown by RNAi.

TCGA dataset – Spectral clustering(Integration using network fusion technique)

39

Conclusion

• Contributions• Outcome-guided network construction

• Shows better statistical significance & biological functionality detection

• Improves survivability prediction power• Integration of the outcome-guided network

• Greatly enhances the biological significance• Development of software

• Future works• Outcome prediction with the integrative network• Software improvement• Application to other domains

40

PUBLICATIONS

41

Publications (1/2)

1. Hyun-hwan Jeong, So Yeon Kim, Kyubum Wee and Kyung-Ah Sohn. “Investigating the Utility of Clinical Outcome-guided Mutual Information Network in Network-based Cox Regression”. BMC systems Biology (2015).

2. Hyun-hwan Jeong and Kyung-Ah Sohn. “Relevance epistasis network of gastritis for intra-chromosomes in the KARE cohort study”. Genomics & Informatics (2014).

3. Sangseob Leem, Hyun-hwan Jeong, Jungseob Lee, Kyubum Wee, Kyung-Ah Sohn. “Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure”. Computational Biology and Chemistry (2014).

4. Kyung-Ah Sohn, Joshua Ho, Djordje Djordjevic, Hyun-hwan Jeong, Peter Park, Ju Han Kim. “hiHMM: Bayesian non-parametric joint inference of chromatin state maps”. Bioinformatics (advanced access) .

42

Publications (2/2)

5. Hyun-hwan Jeong, Sangseob Leem, Kyubum Wee and Kyung-Ah Sohn. “Integrative network analysis for survival-associated gene-gene interactions across multiple genomic profiles in ovarian cancer” . Journal of ovarian research (in revision)

6. Sangseob Leem, Hyun-hwan Jeong, Jungseob Lee, Kyubum Wee, Kyung-Ah Sohn. “MIBE: a software package for fast detection and interpretation of high-order epistatic interactions in genome-wide association study”. BioData Mining (submitted)

7. Jaeyeon Lee, Ho-min Park, Hyun-hwan Jeong, Kyung-Ah Sohn. “RecPAL: Recommending Problems of Adequate Level for Personalized Learning”. Expert Systems (submitted)

43

Conference presentations

1. Hyun-hwan Jeong, Garam-Lee, Kyung-Ah Sohn. “Integrative analysis for outcome-guided gene networks from multiple omics profiles”. ISMB 2015 (Poster, will present).

2. Hyun-hwan Jeong, So Yeon Kim, Kyubum Wee and Kyung-Ah Sohn. “Investigating the Utility of Clinical Outcome-guided Mutual Information Network in Network-based Cox Regression”. APBC 2015 (Oral).

3. Hyun-hwan Jeong, So Yeon Kim, Kyubum Wee, Kyung-Ah Sohn. “Outcome-guided mutual information networks for investigating gene-gene interaction effects on clinical outcomes”. ISB/TBC 2014 (Poster).

4. Hyun-hwan Jeong, Kyubum Wee, Kyung-Ah Sohn. “Detection of pair-wise genomic interactions associated with clinical outcome in ovarian cancer patients using information theoretic measure”. APBC 2014 (Poster).

5. Hyun-Hwan Jeong, Sangseob Leem, Kyubum Wee. “High-order epistatic interaction detection using clique finding algorithm in genome-wide association studies”. TBC/ISCB 2013 (Poster).

44

References (1/2)

• Hofree et al., “Network-based stratification of tumor mutations”, Nature methods 2013, 10:1108-1115.

• TCGA et al., “The Cancer Genome Atlas Pan-Cancer analysis project”, Nature Genetics 2013, 45:1113-1120.

• Kim et al., “Methods of integrating data to uncover genotype-phenotype interactions”, Nature methods 2015, 16:85-97.

• Ma et al., “COSINE: Condition-Specific sub-network identification using a global optimization method”, Bioinformatics 2011, 27(9): 1290-1298.

• Wang et al., “Similarity network fusion for aggregating data types on a genomic scale”, Nature methods 2014, 11:333-337.

45

References (2/2)

• Zhang et al. “Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment.”, PLoS ONE 2013, 9(3):e1002975.

• Butte a J, Kohane IS, “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements”, Pacific Symp Biocomput 2000:418–429.

• Margolin et al., “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context”, BMC Bioinformatics 2006, Suppl 1:S7.

• Meyer et al., “MINET: A R/Bioconductor package for inferring large transcriptional networks using mutual information”, BMC Bioinformatics 2008, 9:461.

감사합니다

top related