prof. ron shamir & prof. roded sharan school of computer...

Lecture 7: DNA chips and

clustering 4,6/12/12

חישובית גנומיקה רודד שרן' ופרופרון שמיר ' פרופ אוניברסיטת תל אביב ,ס למדעי המחשב"ביה

Computational Genomics Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel Aviv University

http://www.tau.ac.il/

CG 1

Clustering gene expression data


CG 2

How Gene Expression Data Looks

Expression levels,

“Raw Data”

conditions

gene

s

Entries of the Raw Data matrix: • Ratio values • Absolute values • …

• Row = gene’s expression pattern / fingerprint vector

• Column = experiment/condition’s profile


CG 3

Data Preprocessing Expression

levels,

“Raw Data”

conditions

gene

s

•Input: Real-valued raw data matrix.

•Compute the similarity matrix (cosine angle/correlation/…)

• Alternatively – distances

10 20 30 40 50 60

10

20

30

40

50

60

From the Raw Data matrix we compute the similarity matrix S. Sij reflects the similarity of the expression patterns of gene i and gene j.


CG 4

DNA chips: Applications

• Deducing functions of unknown genes (similar expression pattern similar function) • Deciphering regulatory mechanisms (co-expression co-regulation). • Identifying disease profiles • Drug development •…

Analysis requires clustering of genes/conditions.


CG 5

Clustering: Objective Group elements (genes) to clusters satisfying:

• Homogeneity: Elements inside a cluster are highly similar to each other.

• Separation: Elements from different clusters have low similarity to each other.

•Unsupervised. •Most formulations are NP-hard.


CG 6

The Clustering Bazaar


CG 7

Hierarchical clustering


CG 8

An Alternative View

Instead of partition to clusters – Form a tree-hierarchy of the input elements satisfying:

• More similar elements are placed closer along the tree.

•Or: Tree distances reflect distance between elements


CG 9

Hierarchical Representation

1 3 4 2 1 3 4 2

2.8

4.5 5.0

Dendrogram: rooted tree, usually binary; all leaf-

root distances are equal. Ordinates reflect (avg.)

distances between the corresponding subtrees.


CG 10

Hierarchical Clustering: Average Linkage Sokal & Michener 58, Lance & Williams 67

• Input: Distance matrix (Dij)

• Iterative algorithm. Initially each element is a cluster. nr- size of cluster r

– Find min element Drs in D; merge clusters r,s

– Delete elements r,s; add new element t with Dit=Dti=nr/(nr+ns)•Dir+ ns/(nr+ns) • Dis

– Repeat


CG 11

Average Linkage (cont.)

• Claim: Drs is the average distance between elements in r and s.

• Proof by induction…

• Claim: Drs can only increase.


CG 12

A General Framework Lance & Williams 67

• Find min element Drs , merge clusters r,s

• Delete elems. r,s, add new elem. t with

Dit=Dti=rDir+ sDis + |Dir-Dis|

• Single-linkage: Dit=min{Dir,Dis}

• Complete-linkage: Dit=max{Dir,Dis}

• Note: analogous formulation in terms of similarity

matrix (rather than distance)


CG 13

Hierarchical clustering of GE data Eisen et al., PNAS 1998

• Growth response: Starved human fibroblast cells, added serum

• Monitored 8600 genes over 13 time-points

• tij - fluorescence level of gene i in condition j; rij – same for reference (time=0).

• sij= log(tij/rij)

• Skl=(jskj •slj)/[|sk||sl|] (cosine of angle)

• Applied average linkage method

• Ordered leaves by increasing element weight: average expression level, time of maximal induction, or other criteria


CG 14


CG 15

“Eisengrams” for same data randomly permuted within rows (1), columns (2) and both(3)


CG 16

Comments

• Distinct measurements of same genes cluster together

• Genes of similar function cluster together

• Many cluster-function specific insights

• Interpretation is a REAL biological challenge


CG 17

More on hierarchical methods • Agglomerative vs. the “more natural”

divisive. • Advantages:

– gives a single coherent global picture – Intuitive for biologists (from phylogeny)

• Disadvantages: – No single partition; no specific clusters – Forces all elements to fit a tree

hierarchy


CG 18

Non-Hierarchical Clustering


CG 19

K-means (Lloyd’ 57, Macqueen ’67)

• Input: vector vi for each element i; #clusters=k

• Define a centroid cp of a cluster Cp as its average vector.

• Goal: minimize clusters pi in cluster pd(vi,cp)

• Objective = homogeneity only (k fixed)

• NP-hard already for k=2.


CG 20

K-means alg.

• Initialize an arbitrary partition P into k clusters.

• Repeat the following till convergence:

– Update centroids (max c, P fixed)

– Assign each point to its closest centroid (max P, c fixed)

• Can be shown to have poly expected time under various assumptions on data distribution.

• A variant: perform a single best modification (that decreases the score the most).


CG 21


CG 22


CG 23

A Soft Version

• Based on a probabilistic model of data as coming from a

mixture of Gaussians:

• Goal: evaluate the parameters θ (assume σ is known).

• Method: apply EM to maximize the likelihood of data.

2

( )

( | ) ~ ( , )

i j

i i j

P z j

P x z j N I

2

2

( , )( ) exp( )

2

i j

j

ji

d xL


CG 24

EM, soft version • Iteratively, compute soft assignment and use it to

derive expectations of π, μ:

j

i


CG 25

Soft vs. hard k-means

Soft EM optimizes: Hard EM optimizes: If we use uniform mixture probs then k-means is an application of hard EM since:

2

( )log ( , | ) ( , )i z i

i

P x z d x


CG 26

Expectation-Maximization & Baum-Welch


CG

The probabilistic setting

Input: data x coming from a probabilistic model with hidden

information y

Goal: Learn the model’s parameters so that the likelihood of the

data is maximized.

Example: a mixture of two Gaussians

2 1

2

2

1( 1) ; ( 2) 1

( )1( | ) exp

22

j

i i

i

i i

P y P y p p

xP x y

p

j


CG

The likelihood function

1 2 1

2

2

2

2

( 1) ; ( 2) 1

( )1( | ) exp

22

( ) ( | ) ( , | )

( )log ( ) log exp

22

i i

i j

i i

i i i

ji i

j j

i j

P y p P y p p

xP x y j

L P x P x y j

p xL


CG

The EM algorithm Goal: max logP(x|θ)=log (Σ P(x,y|θ))

Assume we have a model θt which we wish to improve.

Note: P(x|θ) = P(x,y|θ) / P(y|x,θ)

( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )

( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )

log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )

log ( | )

t t t

t t t

y y y

t t

y y

t

P y x P x P y x P x y P y x P y x

P y x P x P y x P x y P y x P y x

P x P y x P x y P y x P y x

P x

( | , ) log ( , | ) ( | , ) log ( | , )

( | , ) = ( | ) ( | ) ( | , ) log

( | , )

t t t t

y y

tt t t t

y

P y x P x y P y x P y x

P y xQ Q P y x

P y x

Constant >=0


CG

The EM algorithm (cont.)

Main component:

is the expectation of logP(x,y|θ) over the distribution of y given

by the current parameters θt

The algorithm:

• E-step: Calculate the Q function

• M-step: Maximize Q(θ|θt) with respect to θ

• Stopping criterion: improvement in log likelihood ≤ ε

( | ) ( | , ) log ( , | )t t

y

Q P y x P x y


CG 31

Application to the mixture model

( | ) ( | , ) log ( , | )t t

y

Q P y x P x y

( , | ) ( , | ) ( , | )

1

0

ijy

i i i i

i i j

i

ij

i

P x y P x y P x y j

y jy

y j

log ( , | ) log ( , | )

( | ) ( | , ) log ( , | )

( | , ) log ( , | )

ij i i

i j

t t

ij i i

y i j

t

ij i i

i j y

P x y y P x y j

Q P y x y P x y j

P y x y P x y j


CG 32

Application (cont.)

( | ) ( 1| , ) log ( , | )

( , | ): ( 1| , )

( , | )

t t

ij i i i

i j

tt t i iij ij i t

i i

j

Q P y x P x y j

P x y jw P y x

P x y j

2

2

( )1( | ) log log log

22

i jt t

ij j

i j

xQ w p


CG

Baum-Welch: EM for HMM

y=π, i.e. the log likelihood is

And the Q function is:

log ( | ) log ( , | )P x P x

( | ) ( | , ) log ( , | )t tQ P x P x


CG

Baum-Welch (cont.)

( , ) ( )

1 1 1

( , | ) [ ( )]k kl

M M ME b A

k kl

k b k l

P x e b a

Emission probability, state k

character b

Transition probability, state k

to state l

Number of times we saw b from k at π


CG

Baum-Welch (cont.)

1 1 1

1 1 1

( | ) ( | , ) ( , ) log( ( )) ( ) log

( | , ) ( , ) log( ( )) ( | , ) ( ) log

M M Mt t

k k kl kl

k b k l

M M Mt t

k k kl kl

k b k l

Q P x E b e b A a

P x E b e b P x A a

( | , ) ( )t

kl klP x A A

( | , ) ( , ) ( )t

k kP x E b E b

value probability expectation value probability expectation


CG

•So we want to find a set of parameters θt+1 that maximizes:

•Ek(b), Akl can be computed using forward/backward:

•For maximization, select:

Baum-Welch (cont.)

1 1 1

( ) log( ( )) logM M M

k k kl kl

k b k l

E b e b A a

'

( ) , ( )

( ')

ij kij k

ik k

k b

A E ba e b

A E b

P(i=k, i+1=l | x, t) = [1/P(x)]·fk(i)·akl·el(xi+1)·bl(i+1)

Akl = [1/P(x)]· i fk(i) · akl · el(xi+1) · bl(i+1) similarly, Ek(b) = [1/P(x)] · {i|xi=b}

fk(i) · bk(i)


CG

Relative entropy is positive

log(x)≤x-1

i

ii

i

i

i

i

i

i

i

i

xPxQ

xP

xQxP

xP

xQxP

0)()(

1)(

)()(

)(

)(log)(


CG

Maximize:

Baum-Welch: EM for HMM

1 1 1

( ) log( ( )) logM M M

k k kl kl

k b k l

E b e b A a

chosen

ij

'

( ) (denote as a ), ( )

( ')

ij kij k

ik k

k b

A E ba e b

A E b

always positive

'

1 1 1 ' 1 '

'

'

1 ' 1

log log

log

chosen chosenM M M Mkl kl kl

kl ikother otherk l k k lkl ik kl

k

chosenM Mchosen kl

ik kl otherk k l kl

a A aA A

a A a

aA a

a

Difference between chosen set and some other:

Multiply and divide by same factor


prof. ron shamir & prof. roded sharan school of computer...

Documents