prof. ron shamir & prof. roded sharan school of computer...
TRANSCRIPT
Lecture 7: DNA chips and
clustering 4,6/12/12
חישובית גנומיקה רודד שרן' ופרופרון שמיר ' פרופ אוניברסיטת תל אביב ,ס למדעי המחשב"ביה
Computational Genomics Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel Aviv University
CG 2
How Gene Expression Data Looks
Expression levels,
“Raw Data”
conditions
gene
s
Entries of the Raw Data matrix: • Ratio values • Absolute values • …
• Row = gene’s expression pattern / fingerprint vector
• Column = experiment/condition’s profile
CG 3
Data Preprocessing Expression
levels,
“Raw Data”
conditions
gene
s
•Input: Real-valued raw data matrix.
•Compute the similarity matrix (cosine angle/correlation/…)
• Alternatively – distances
10 20 30 40 50 60
10
20
30
40
50
60
From the Raw Data matrix we compute the similarity matrix S. Sij reflects the similarity of the expression patterns of gene i and gene j.
CG 4
DNA chips: Applications
• Deducing functions of unknown genes (similar expression pattern similar function) • Deciphering regulatory mechanisms (co-expression co-regulation). • Identifying disease profiles • Drug development •…
Analysis requires clustering of genes/conditions.
CG 5
Clustering: Objective Group elements (genes) to clusters satisfying:
• Homogeneity: Elements inside a cluster are highly similar to each other.
• Separation: Elements from different clusters have low similarity to each other.
•Unsupervised. •Most formulations are NP-hard.
CG 8
An Alternative View
Instead of partition to clusters – Form a tree-hierarchy of the input elements satisfying:
• More similar elements are placed closer along the tree.
•Or: Tree distances reflect distance between elements
CG 9
Hierarchical Representation
1 3 4 2 1 3 4 2
2.8
4.5 5.0
Dendrogram: rooted tree, usually binary; all leaf-
root distances are equal. Ordinates reflect (avg.)
distances between the corresponding subtrees.
CG 10
Hierarchical Clustering: Average Linkage Sokal & Michener 58, Lance & Williams 67
• Input: Distance matrix (Dij)
• Iterative algorithm. Initially each element is a cluster. nr- size of cluster r
– Find min element Drs in D; merge clusters r,s
– Delete elements r,s; add new element t with Dit=Dti=nr/(nr+ns)•Dir+ ns/(nr+ns) • Dis
– Repeat
CG 11
Average Linkage (cont.)
• Claim: Drs is the average distance between elements in r and s.
• Proof by induction…
• Claim: Drs can only increase.
CG 12
A General Framework Lance & Williams 67
• Find min element Drs , merge clusters r,s
• Delete elems. r,s, add new elem. t with
Dit=Dti=rDir+ sDis + |Dir-Dis|
• Single-linkage: Dit=min{Dir,Dis}
• Complete-linkage: Dit=max{Dir,Dis}
• Note: analogous formulation in terms of similarity
matrix (rather than distance)
CG 13
Hierarchical clustering of GE data Eisen et al., PNAS 1998
• Growth response: Starved human fibroblast cells, added serum
• Monitored 8600 genes over 13 time-points
• tij - fluorescence level of gene i in condition j; rij – same for reference (time=0).
• sij= log(tij/rij)
• Skl=(jskj •slj)/[|sk||sl|] (cosine of angle)
• Applied average linkage method
• Ordered leaves by increasing element weight: average expression level, time of maximal induction, or other criteria
CG 14
CG 15
“Eisengrams” for same data randomly permuted within rows (1), columns (2) and both(3)
CG 16
Comments
• Distinct measurements of same genes cluster together
• Genes of similar function cluster together
• Many cluster-function specific insights
• Interpretation is a REAL biological challenge
CG 17
More on hierarchical methods • Agglomerative vs. the “more natural”
divisive. • Advantages:
– gives a single coherent global picture – Intuitive for biologists (from phylogeny)
• Disadvantages: – No single partition; no specific clusters – Forces all elements to fit a tree
hierarchy
CG 19
K-means (Lloyd’ 57, Macqueen ’67)
• Input: vector vi for each element i; #clusters=k
• Define a centroid cp of a cluster Cp as its average vector.
• Goal: minimize clusters pi in cluster pd(vi,cp)
• Objective = homogeneity only (k fixed)
• NP-hard already for k=2.
CG 20
K-means alg.
• Initialize an arbitrary partition P into k clusters.
• Repeat the following till convergence:
– Update centroids (max c, P fixed)
– Assign each point to its closest centroid (max P, c fixed)
• Can be shown to have poly expected time under various assumptions on data distribution.
• A variant: perform a single best modification (that decreases the score the most).
CG 21
CG 22
CG 23
A Soft Version
• Based on a probabilistic model of data as coming from a
mixture of Gaussians:
• Goal: evaluate the parameters θ (assume σ is known).
• Method: apply EM to maximize the likelihood of data.
2
( )
( | ) ~ ( , )
i j
i i j
P z j
P x z j N I
2
2
( , )( ) exp( )
2
i j
j
ji
d xL
CG 24
EM, soft version • Iteratively, compute soft assignment and use it to
derive expectations of π, μ:
j
i
CG 25
Soft vs. hard k-means
Soft EM optimizes: Hard EM optimizes: If we use uniform mixture probs then k-means is an application of hard EM since:
2
( )log ( , | ) ( , )i z i
i
P x z d x
CG
The probabilistic setting
Input: data x coming from a probabilistic model with hidden
information y
Goal: Learn the model’s parameters so that the likelihood of the
data is maximized.
Example: a mixture of two Gaussians
2 1
2
2
1( 1) ; ( 2) 1
( )1( | ) exp
22
j
i i
i
i i
P y P y p p
xP x y
p
j
CG
The likelihood function
1 2 1
2
2
2
2
( 1) ; ( 2) 1
( )1( | ) exp
22
( ) ( | ) ( , | )
( )log ( ) log exp
22
i i
i j
i i
i i i
ji i
j j
i j
P y p P y p p
xP x y j
L P x P x y j
p xL
CG
The EM algorithm Goal: max logP(x|θ)=log (Σ P(x,y|θ))
Assume we have a model θt which we wish to improve.
Note: P(x|θ) = P(x,y|θ) / P(y|x,θ)
( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
log ( | )
t t t
t t t
y y y
t t
y y
t
P y x P x P y x P x y P y x P y x
P y x P x P y x P x y P y x P y x
P x P y x P x y P y x P y x
P x
( | , ) log ( , | ) ( | , ) log ( | , )
( | , ) = ( | ) ( | ) ( | , ) log
( | , )
t t t t
y y
tt t t t
y
P y x P x y P y x P y x
P y xQ Q P y x
P y x
Constant >=0
CG
The EM algorithm (cont.)
Main component:
is the expectation of logP(x,y|θ) over the distribution of y given
by the current parameters θt
The algorithm:
• E-step: Calculate the Q function
• M-step: Maximize Q(θ|θt) with respect to θ
• Stopping criterion: improvement in log likelihood ≤ ε
( | ) ( | , ) log ( , | )t t
y
Q P y x P x y
CG 31
Application to the mixture model
( | ) ( | , ) log ( , | )t t
y
Q P y x P x y
( , | ) ( , | ) ( , | )
1
0
ijy
i i i i
i i j
i
ij
i
P x y P x y P x y j
y jy
y j
log ( , | ) log ( , | )
( | ) ( | , ) log ( , | )
( | , ) log ( , | )
ij i i
i j
t t
ij i i
y i j
t
ij i i
i j y
P x y y P x y j
Q P y x y P x y j
P y x y P x y j
CG 32
Application (cont.)
( | ) ( 1| , ) log ( , | )
( , | ): ( 1| , )
( , | )
t t
ij i i i
i j
tt t i iij ij i t
i i
j
Q P y x P x y j
P x y jw P y x
P x y j
2
2
( )1( | ) log log log
22
i jt t
ij j
i j
xQ w p
CG
Baum-Welch: EM for HMM
y=π, i.e. the log likelihood is
And the Q function is:
log ( | ) log ( , | )P x P x
( | ) ( | , ) log ( , | )t tQ P x P x
CG
Baum-Welch (cont.)
( , ) ( )
1 1 1
( , | ) [ ( )]k kl
M M ME b A
k kl
k b k l
P x e b a
Emission probability, state k
character b
Transition probability, state k
to state l
Number of times we saw b from k at π
CG
Baum-Welch (cont.)
1 1 1
1 1 1
( | ) ( | , ) ( , ) log( ( )) ( ) log
( | , ) ( , ) log( ( )) ( | , ) ( ) log
M M Mt t
k k kl kl
k b k l
M M Mt t
k k kl kl
k b k l
Q P x E b e b A a
P x E b e b P x A a
( | , ) ( )t
kl klP x A A
( | , ) ( , ) ( )t
k kP x E b E b
value probability expectation value probability expectation
CG
•So we want to find a set of parameters θt+1 that maximizes:
•Ek(b), Akl can be computed using forward/backward:
•For maximization, select:
Baum-Welch (cont.)
1 1 1
( ) log( ( )) logM M M
k k kl kl
k b k l
E b e b A a
'
( ) , ( )
( ')
ij kij k
ik k
k b
A E ba e b
A E b
P(i=k, i+1=l | x, t) = [1/P(x)]·fk(i)·akl·el(xi+1)·bl(i+1)
Akl = [1/P(x)]· i fk(i) · akl · el(xi+1) · bl(i+1) similarly, Ek(b) = [1/P(x)] · {i|xi=b}
fk(i) · bk(i)
CG
Relative entropy is positive
log(x)≤x-1
i
ii
i
i
i
i
i
i
i
i
xPxQ
xP
xQxP
xP
xQxP
0)()(
1)(
)()(
)(
)(log)(
CG
Maximize:
Baum-Welch: EM for HMM
1 1 1
( ) log( ( )) logM M M
k k kl kl
k b k l
E b e b A a
chosen
ij
'
( ) (denote as a ), ( )
( ')
ij kij k
ik k
k b
A E ba e b
A E b
always positive
'
1 1 1 ' 1 '
'
'
1 ' 1
log log
log
chosen chosenM M M Mkl kl kl
kl ikother otherk l k k lkl ik kl
k
chosenM Mchosen kl
ik kl otherk k l kl
a A aA A
a A a
aA a
a
Difference between chosen set and some other:
Multiply and divide by same factor