prof. ron shamir & prof. roded sharan school of computer...

39
Lecture 7: DNA chips and clustering 4,6/12/12 גנומיקה חישובית פרופ' רון שמיר ופרופ' רודד שרן ביה" ס למדעי המחשב, אוניברסיטת תל אביבComputational Genomics Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel Aviv University

Upload: others

Post on 04-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

Lecture 7: DNA chips and

clustering 4,6/12/12

חישובית גנומיקה רודד שרן' ופרופרון שמיר ' פרופ אוניברסיטת תל אביב ,ס למדעי המחשב"ביה

Computational Genomics Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel Aviv University

Page 2: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 1

Clustering gene expression data

Page 3: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 2

How Gene Expression Data Looks

Expression levels,

“Raw Data”

conditions

gene

s

Entries of the Raw Data matrix: • Ratio values • Absolute values • …

• Row = gene’s expression pattern / fingerprint vector

• Column = experiment/condition’s profile

Page 4: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 3

Data Preprocessing Expression

levels,

“Raw Data”

conditions

gene

s

•Input: Real-valued raw data matrix.

•Compute the similarity matrix (cosine angle/correlation/…)

• Alternatively – distances

10 20 30 40 50 60

10

20

30

40

50

60

From the Raw Data matrix we compute the similarity matrix S. Sij reflects the similarity of the expression patterns of gene i and gene j.

Page 5: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 4

DNA chips: Applications

• Deducing functions of unknown genes (similar expression pattern similar function) • Deciphering regulatory mechanisms (co-expression co-regulation). • Identifying disease profiles • Drug development •…

Analysis requires clustering of genes/conditions.

Page 6: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 5

Clustering: Objective Group elements (genes) to clusters satisfying:

• Homogeneity: Elements inside a cluster are highly similar to each other.

• Separation: Elements from different clusters have low similarity to each other.

•Unsupervised. •Most formulations are NP-hard.

Page 7: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 6

The Clustering Bazaar

Page 8: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 7

Hierarchical clustering

Page 9: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 8

An Alternative View

Instead of partition to clusters – Form a tree-hierarchy of the input elements satisfying:

• More similar elements are placed closer along the tree.

•Or: Tree distances reflect distance between elements

Page 10: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 9

Hierarchical Representation

1 3 4 2 1 3 4 2

2.8

4.5 5.0

Dendrogram: rooted tree, usually binary; all leaf-

root distances are equal. Ordinates reflect (avg.)

distances between the corresponding subtrees.

Page 11: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 10

Hierarchical Clustering: Average Linkage Sokal & Michener 58, Lance & Williams 67

• Input: Distance matrix (Dij)

• Iterative algorithm. Initially each element is a cluster. nr- size of cluster r

– Find min element Drs in D; merge clusters r,s

– Delete elements r,s; add new element t with Dit=Dti=nr/(nr+ns)•Dir+ ns/(nr+ns) • Dis

– Repeat

Page 12: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 11

Average Linkage (cont.)

• Claim: Drs is the average distance between elements in r and s.

• Proof by induction…

• Claim: Drs can only increase.

Page 13: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 12

A General Framework Lance & Williams 67

• Find min element Drs , merge clusters r,s

• Delete elems. r,s, add new elem. t with

Dit=Dti=rDir+ sDis + |Dir-Dis|

• Single-linkage: Dit=min{Dir,Dis}

• Complete-linkage: Dit=max{Dir,Dis}

• Note: analogous formulation in terms of similarity

matrix (rather than distance)

Page 14: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 13

Hierarchical clustering of GE data Eisen et al., PNAS 1998

• Growth response: Starved human fibroblast cells, added serum

• Monitored 8600 genes over 13 time-points

• tij - fluorescence level of gene i in condition j; rij – same for reference (time=0).

• sij= log(tij/rij)

• Skl=(jskj •slj)/[|sk||sl|] (cosine of angle)

• Applied average linkage method

• Ordered leaves by increasing element weight: average expression level, time of maximal induction, or other criteria

Page 15: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 14

Page 16: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 15

“Eisengrams” for same data randomly permuted within rows (1), columns (2) and both(3)

Page 17: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 16

Comments

• Distinct measurements of same genes cluster together

• Genes of similar function cluster together

• Many cluster-function specific insights

• Interpretation is a REAL biological challenge

Page 18: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 17

More on hierarchical methods • Agglomerative vs. the “more natural”

divisive. • Advantages:

– gives a single coherent global picture – Intuitive for biologists (from phylogeny)

• Disadvantages: – No single partition; no specific clusters – Forces all elements to fit a tree

hierarchy

Page 19: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 18

Non-Hierarchical Clustering

Page 20: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 19

K-means (Lloyd’ 57, Macqueen ’67)

• Input: vector vi for each element i; #clusters=k

• Define a centroid cp of a cluster Cp as its average vector.

• Goal: minimize clusters pi in cluster pd(vi,cp)

• Objective = homogeneity only (k fixed)

• NP-hard already for k=2.

Page 21: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 20

K-means alg.

• Initialize an arbitrary partition P into k clusters.

• Repeat the following till convergence:

– Update centroids (max c, P fixed)

– Assign each point to its closest centroid (max P, c fixed)

• Can be shown to have poly expected time under various assumptions on data distribution.

• A variant: perform a single best modification (that decreases the score the most).

Page 22: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 21

Page 23: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 22

Page 24: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 23

A Soft Version

• Based on a probabilistic model of data as coming from a

mixture of Gaussians:

• Goal: evaluate the parameters θ (assume σ is known).

• Method: apply EM to maximize the likelihood of data.

2

( )

( | ) ~ ( , )

i j

i i j

P z j

P x z j N I

2

2

( , )( ) exp( )

2

i j

j

ji

d xL

Page 25: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 24

EM, soft version • Iteratively, compute soft assignment and use it to

derive expectations of π, μ:

j

i

Page 26: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 25

Soft vs. hard k-means

Soft EM optimizes: Hard EM optimizes: If we use uniform mixture probs then k-means is an application of hard EM since:

2

( )log ( , | ) ( , )i z i

i

P x z d x

Page 27: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 26

Expectation-Maximization & Baum-Welch

Page 28: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

The probabilistic setting

Input: data x coming from a probabilistic model with hidden

information y

Goal: Learn the model’s parameters so that the likelihood of the

data is maximized.

Example: a mixture of two Gaussians

2 1

2

2

1( 1) ; ( 2) 1

( )1( | ) exp

22

j

i i

i

i i

P y P y p p

xP x y

p

j

Page 29: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

The likelihood function

1 2 1

2

2

2

2

( 1) ; ( 2) 1

( )1( | ) exp

22

( ) ( | ) ( , | )

( )log ( ) log exp

22

i i

i j

i i

i i i

ji i

j j

i j

P y p P y p p

xP x y j

L P x P x y j

p xL

Page 30: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

The EM algorithm Goal: max logP(x|θ)=log (Σ P(x,y|θ))

Assume we have a model θt which we wish to improve.

Note: P(x|θ) = P(x,y|θ) / P(y|x,θ)

( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )

( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )

log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )

log ( | )

t t t

t t t

y y y

t t

y y

t

P y x P x P y x P x y P y x P y x

P y x P x P y x P x y P y x P y x

P x P y x P x y P y x P y x

P x

( | , ) log ( , | ) ( | , ) log ( | , )

( | , ) = ( | ) ( | ) ( | , ) log

( | , )

t t t t

y y

tt t t t

y

P y x P x y P y x P y x

P y xQ Q P y x

P y x

Constant >=0

Page 31: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

The EM algorithm (cont.)

Main component:

is the expectation of logP(x,y|θ) over the distribution of y given

by the current parameters θt

The algorithm:

• E-step: Calculate the Q function

• M-step: Maximize Q(θ|θt) with respect to θ

• Stopping criterion: improvement in log likelihood ≤ ε

( | ) ( | , ) log ( , | )t t

y

Q P y x P x y

Page 32: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 31

Application to the mixture model

( | ) ( | , ) log ( , | )t t

y

Q P y x P x y

( , | ) ( , | ) ( , | )

1

0

ijy

i i i i

i i j

i

ij

i

P x y P x y P x y j

y jy

y j

log ( , | ) log ( , | )

( | ) ( | , ) log ( , | )

( | , ) log ( , | )

ij i i

i j

t t

ij i i

y i j

t

ij i i

i j y

P x y y P x y j

Q P y x y P x y j

P y x y P x y j

Page 33: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG 32

Application (cont.)

( | ) ( 1| , ) log ( , | )

( , | ): ( 1| , )

( , | )

t t

ij i i i

i j

tt t i iij ij i t

i i

j

Q P y x P x y j

P x y jw P y x

P x y j

2

2

( )1( | ) log log log

22

i jt t

ij j

i j

xQ w p

Page 34: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

Baum-Welch: EM for HMM

y=π, i.e. the log likelihood is

And the Q function is:

log ( | ) log ( , | )P x P x

( | ) ( | , ) log ( , | )t tQ P x P x

Page 35: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

Baum-Welch (cont.)

( , ) ( )

1 1 1

( , | ) [ ( )]k kl

M M ME b A

k kl

k b k l

P x e b a

Emission probability, state k

character b

Transition probability, state k

to state l

Number of times we saw b from k at π

Page 36: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

Baum-Welch (cont.)

1 1 1

1 1 1

( | ) ( | , ) ( , ) log( ( )) ( ) log

( | , ) ( , ) log( ( )) ( | , ) ( ) log

M M Mt t

k k kl kl

k b k l

M M Mt t

k k kl kl

k b k l

Q P x E b e b A a

P x E b e b P x A a

( | , ) ( )t

kl klP x A A

( | , ) ( , ) ( )t

k kP x E b E b

value probability expectation value probability expectation

Page 37: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

•So we want to find a set of parameters θt+1 that maximizes:

•Ek(b), Akl can be computed using forward/backward:

•For maximization, select:

Baum-Welch (cont.)

1 1 1

( ) log( ( )) logM M M

k k kl kl

k b k l

E b e b A a

'

( ) , ( )

( ')

ij kij k

ik k

k b

A E ba e b

A E b

P(i=k, i+1=l | x, t) = [1/P(x)]·fk(i)·akl·el(xi+1)·bl(i+1)

Akl = [1/P(x)]· i fk(i) · akl · el(xi+1) · bl(i+1) similarly, Ek(b) = [1/P(x)] · {i|xi=b}

fk(i) · bk(i)

Page 38: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

Relative entropy is positive

log(x)≤x-1

i

ii

i

i

i

i

i

i

i

i

xPxQ

xP

xQxP

xP

xQxP

0)()(

1)(

)()(

)(

)(log)(

Page 39: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel

CG

Maximize:

Baum-Welch: EM for HMM

1 1 1

( ) log( ( )) logM M M

k k kl kl

k b k l

E b e b A a

chosen

ij

'

( ) (denote as a ), ( )

( ')

ij kij k

ik k

k b

A E ba e b

A E b

always positive

'

1 1 1 ' 1 '

'

'

1 ' 1

log log

log

chosen chosenM M M Mkl kl kl

kl ikother otherk l k k lkl ik kl

k

chosenM Mchosen kl

ik kl otherk k l kl

a A aA A

a A a

aA a

a

Difference between chosen set and some other:

Multiply and divide by same factor