microarrays: algorithms for knowledge discovery in oncology and molecular biology frank de smet...

26
Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen Departement Elektrotechniek (ESAT) Promotor: Prof. dr. ir. B. De Moor

Post on 19-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

Microarrays: algorithms for knowledge discovery in oncology and molecular biology

Frank De Smet

Katholieke Universiteit Leuven

Faculteit Toegepaste Wetenschappen

Departement Elektrotechniek (ESAT)

Promotor: Prof. dr. ir. B. De Moor

Page 2: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 2

Overview

• Introduction: basic concepts of microarray data

• Feature extraction: – Univariate analysis

– Multivariate analysis: PCA

• Classification

• Clustering

• Conclusions and future research

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 3: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 3

Transcription - Translation

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 4: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 4

Microarrays

Introduction

Feature extraction

Classification

Clustering

Conclusions

DB

6

Red4

Green

3

DNA-clones

1

cDNA-microarray

mRNA referencemRNA test (tumour)

2

5

Page 5: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 5

Importance

• Clinical (oncology)– Clinical management of cancer is in many cases empirical and

not all information that is clinically relevant can be extracted using the data that physicians have access to

– Fundamental mechanisms behind carcinogenesis are not always taken into account

But:

– Expression patterns measured with microarrays in malignant cells reflect the phenotype of the tumour

• Molecular biology– Study of the expression behaviour of genes can help to

determine their biological role or function

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 6: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 6

Data-mining framework

1

...

1

...

2

...

2

...

...

Patients

1

...

1

...

2

...

2

...

...

1

...

1

...

1

...

1

...

2

...

2

...

2

...

2

...

...Genes

1

...1 2 211

...11 22 22

Features

2

2

1

22

22

11

?

?

?

<1>

<2>

<1>

Classifier

??

??

??

<1>

<2>

<1>

1

...1 2 11

2

...1 2 22

Cluster 1

Cluster 2

Cluster algorithm

1

...1 2 11 1

...1 2 11 11

...11 22 1111

2

...1 2 22 2

...1 2 22 22

...11 2222

Cluster 1

Cluster 2

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 7: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 7

Expression matrix

Introduction

Feature extraction

Classification

Clustering

Conclusions

Condition 1 Condition 2

OR

time

Microarray experiments

Gene expression profiles

Page 8: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 8

Univariate analysis in microarray data

• Expression patterns measured under two different conditions

• Selection of the individual genes with the highest differential expression: p-values

• Rejection level – p : gene is declared

differentially expressed– p > : gene is declared not

differentially expressed

Introduction

Feature extraction

Classification

Clustering

Conclusions

Condition 1 Condition 2

pi (i = 1,...,n; p1<p2<...<pn)

Positive

Negative

Page 9: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 9

Multiple testing

Introduction

Feature extraction

Classification

Clustering

Conclusions

• Overlap of the p-values of the genes with and without actual differential expression: Type I and II errors

• In literature: control of the Type I error: too conservative for microarray data

• Here: balance of Type I and II error

Actually differentially expressed?

YES NO

YES (p )

TP FP Type I error

Pos D

ecla

red

dif

fere

ntia

lly

expr

esse

d?

NO (p > )

FN Type II error

TN Neg

n1 n0

Page 10: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 10

Estimation of Type I and II error

No real differential expressionRandomised data setUniform distribution

FN

TN

TP

FP

Rejection level

Non-accidental differential expressionSuperposition of two distribuions

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 11: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 11

Calculations

Introduction

Feature extraction

Classification

Clustering

Conclusions

i

ii p

npiV

1

.

1. Estimation of n1 and n0

Actually differentially expressed?

YES NO

YES (p pi)

TPi i - pi.n0

FPi pi.n0

Posi = i

Dec

lare

d d

iffer

entia

lly

expr

esse

d?

NO (p > pi)

FNi n1 - i + pi.n0

TNi (1-pi).n0

Negi = n-i

n1 n0

2. Estimation of TPi, TNi, FPi and FNi

4. ROC curve

1n

TP

FNTP

TPSENS i

ii

ii

0n

TN

FPTN

TNSPEC i

ii

ii

3. Estimation of sensitivity and specificity

Page 12: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 12

ROC curve

• Optimal balance between Type I and II errors

• Area under the curve – Quantifies how well the genes whose expression is and is not

affected by the difference between conditions can be discriminated using their p-values

– Quality measure for microarray data

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 13: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 13

Example: Acute leukemia

Go

lub

et a

l. A

LL

-AM

L

Arm

stro

ng

e

t a

l. A

LL

-AM

L

n 7129 12582

n0 3876 3084

n1 3253 9498

AUC (%) 91.39 95.13 opt 0.18 (= p3429) 0.11 (= p8633)

SENSopt (%) 84.03 87.26

SPECopt (%) 82.06 88.56

SENSopt + SPECopt (%) 166.09 175.82

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 14: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 14

Multivariate analysis in microarray dataPrincipal Component Analysis

Introduction

Feature extraction

Classification

Clustering

Conclusions

Acute leukemiaALL - AML

Breast cancerDegree of differentiation

Unsupervise

d

PC1PC2

PC1PC2

Page 15: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 15

Classification

Introduction

Feature extraction

Classification

Clustering

Conclusions

Acute leukemiaALL - AML

Breast cancerDegree of differentiation

Unsupervise

dSupervised

Page 16: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 16

Clustering: gene expression profiles

• Importance– Identification of groups of coexpressed genes

– Have a higher probability of having similar biological functions: e.g., might interact with the same transcription factors (coregulation)

• First generation algorithms: disadvantages– Parameter fine-tuning

– Assign each profile to a cluster

– Computational complexity

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 17: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 17

Quality-based clustering (Heyer et al.)

Algorithm produces clusters with – a quality guarantee (fixed and user-defined threshold for diameter D)– with a maximum number of profiles

DCandidate cluster 1: 3 profiles

...

Candidate cluster 5: 6 profiles

...

Candidate cluster 17: 2 profiles

Introduction

Feature extraction

Classification

Clustering

Conclusions

Still some disadvantages !

Page 18: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 18

Adaptive quality-based clustering (AQBC)

• A heuristic iterative two-step approach

– Step 1: Quality-based approach:

Find a cluster center in an area of the data set where the density of expression profiles, within a sphere with preliminary radius, is locally maximal

– Step 2: Adaptive approach:

Re-estimation of the radius

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 19: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 19

Step 1: Localization of a cluster center

Introduction

Feature extraction

Classification

Clustering

Conclusions

R

Page 20: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 20

Step 2: Re-calculation of the radius

Introduction

Feature extraction

Classification

Clustering

Conclusions

SBRpPCRpP

CRpPRCP

newBnewC

newCnew

)|(.)|(.

)|(.)|(

)|(.)|(.)( BrpPCrpPrp BC

Page 21: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 21

Comparison

AQBC QT_Clust (Heyer et al.)

User-defined parameters

1. Data set 2. Significance level S

3. Minimum number of genes

1. Data set 2. Radius R or diameter D

3. Minimum number of genes

Quality measure Significance level S: statistical parameter

Radius or diameter: arbitrary parameter

Cluster radius R Automatically calculated for each cluster separately - not constant

Constant and user-defined

Computational Complexity ~ O(n e VC) ~ O(n2 e VC)

Number of clusters Not predefined Not predefined

Inclusion of all genes in clusters No No

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 22: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 22

Validation

Introduction

Feature extraction

Classification

Clustering

Conclusions

Cluster number P-value (-log10)

AQBC

K-means

MIPS functional category AQBC K-means

1 1

ribosomal proteins organisation of cytoplasm protein synthesis cellular organisation translation organisation of chromosome structure

80 77 74 34 9 1

54 39 NR NR NR 4

2 4

mitochondrial organization energy proteolysis respiration ribosomal proteins protein synthesis protein destination

18 8 7 6 4 4 4

10 NR NR 5 NR NR NR

5 2

DNA synthesis and replication cell growth, cell division, DNA synthesis recombination and DNA repair nuclear organization cell-cycle control and mitosis

18 17 8 8 7

16 NR 5 4 8

Page 23: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 23

Availability

Introduction

Feature extraction

Classification

Clustering

Conclusions

0

50

100

150

200

250

300

350

Jan-

01

Apr-0

1

Jul-0

1

Oct-0

1

Jan-

02

Apr-0

2

Jul-0

2

Oct-0

2

Jan-

03

Apr-0

3

Jul-0

3

Oct-0

3

Jan-

04

Apr-0

4

Nu

mb

er o

f h

its

Himanen et al. (2004) Transcript profiling of early lateral root initiation. Proc Natl Acad Sci, 101, 5146-5151.

Page 24: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 24

Conclusions

Data-mining framework for microarray data• Feature extraction

– Univariate analysis• Estimation of n1 and n0

• ROC curves: optimal balance between Type I and II error + quality measure

– Multivariate analysis: PCA

• Classification: FDA and LS-SVM• Clustering

– Microarray experiments– Gene expression profiles: AQBC

Clinical data

Introduction

Feature extraction

Classification

Clustering

ConclusionsPC1

PC2

PC1PC2

Page 25: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 25

Selected publications

• De Smet, F., Marchal, K., Timmerman, D., Vergote, I., De Moor, B. and Moreau, Y. (2001) Gebruik van microroosters in de klinische oncologie, Tijdschr voor Geneeskunde, 57, 1225-1236.

• De Smet, F., Mathys, J., Marchal, K., Thijs, G., De Moor, B. and Moreau Y. (2002) Adaptive quality-based clustering of gene expression profiles. Bioinformatics, 18, 735-746.

• Moreau, Y., De Smet, F., Thijs, G., Marchal, K. and De Moor, B. (2002) Functional bioinformatics of microarray data: from expression to regulation. Proceedings of the IEEE, 90, 1722-1743.

• De Smet, F., Moreau, Y., Tmmerman, D., Vergote, I. and De Moor, B. (2004) Balancing false positives and false negatives for the detection of differential expression in malignancies. Br J Cancer, submitted.

• Epstein, E., Skoog, L., Isberg, P.E., De Smet, F., De Moor, B., Olofsson, P.A., Gudmundsson, S. and Valentin, L. (2002) An algorithm including results of gray-scale and power Doppler ultrasound examination to predict endometrial malignancy in women with postmenopausal bleeding. Ultrasound Obstet Gynecol, 20, 370-376.

Introduction

Feature extraction

Classification

Clustering

Conclusions

Page 26: Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

PhD defense Frank De Smet May 28, 2004 26

Future research

• Specific– Ovarian cancer: transcriptomics

• Prediction of chemosensitivity in stage III• Prediction of recurrence in stage I

– Endometriosis: proteomics and transcriptomics• Detection of endometriosis• Prediction of relapse after surgery

• General– Microarrays: number of patients - validation -

standardization– Proteomics– Combination and comparison of microarray,

proteomic and clinical data

Introduction

Feature extraction

Classification

Clustering

Conclusions