Download - פרויקט בתכנות מתקדם – 512 236 פונקציות מרחק אופטימליות לשיחזור עצי אבולוציה סמסטר אביב 2010
.
512236פרויקט בתכנות מתקדם – פונקציות מרחק אופטימליות לשיחזור עצי
אבולוציה2010סמסטר אביב
http://webcourse.cs.technion.ac.il/236512/
דואר אלקטרוני
חדר טלפון
moran@cs 4363 639 טאוב שלמה מורן
ddoerr@cs 4319 534טאוב דניאל דור
.
ההשפעה של פונקציות מרחק על שיחזור עצי אבולוציה
לאחר שלב ההודעות, נעביר היום קורס בזק מקוצר על:DNAעצי אבולוציה: הגדרות ומודלים מבוססי 1.שיטות מבוססות מרחקים לבניית עצי אבולוציה2.פונקציות מרחק למודלים אבולוציוניים3.הערכת מרחקים בין זנים על סמך השוני בין סדרות 4.
DNAה
לאחר ה"קורס המזורז" עדיין תזדקקו להשלמות מסוימות בהמשך הסמסטר.
במהלך "קורס הבזק" יוצגו הפרויקטים.
נושא הפרויקט
.
, הסתברות1אלגוריתמים : דרישות קדם אלגוריתמים בביולוגיה רצוי )אך לא הכרחי(:
חישובית
.ככלל, הפרויקטים יעשו בזוגותתוך שבוע הודיעונו על החלוקה לזוגות )בדוא"ל(
.בחירת פרוייקט: יהיו שני כיוונים עיקריים .השלב הראשוני דומה בשני הכיוונים
התמקדות בכיוון מסוים תעשה בהמשך )תוך .כחודש(
)מכאן והלאה שקפים באנגלית(
אדמיניסטרציה
4
Crash course on evolutionary distances
5
ThePhylogenetic
Reconstrutction Problem
6
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGA
ACGGTCA
ACGGATA
ACGGGTA
ACCCGTG
ACCGTTG
TCTGGTA
TCTGGGATCCGGAA AGCCGTG
GGGGATT
AAAGTCA
AAAGGCG AAACACAAAAGCTG
Evolution is modeled by DNA sequences which evolve along an Evolution Tree (Phylogeny)
)All our sequences are DNA sequences, consisting of {A,G,C,T}(
7
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGAACCGTTGTCTGGGA
TCCGGAA AGCCGTG
GGGGATT
Phylogenetic Reconstruction
8
B : AATCCTGC : ATAGCTG
A : AATGGGC
D : GAACGTAE : AAACCGA
J : ACCGTTG
G : TCTGGGAH : TCCGGAAI : AGCCGTG
F : GGGGATT
Goal: reconstruct the ‘true’ tree as accurately as possible
reconstruct
AB
C
FG
IH J
D
E
AB
C F
G
I
H
J
D
E
(root)
Phylogenetic Reconstruction
9
Three Methods of Tree Construction
Parsimony – A tree with minimum number of mutations.
Maximum likelihood - Finding the “most probable” tree.
Distance- A weighted tree that realizes the distances between the species.
10
A
C
B
D
F
G
E
edge-weighted ‘true’ tree
,) , (T T i j S
D d i j
reconstructed tree
reconstruction
B
C
A
D
F
G
E
,
ˆˆ ) , (i j S
D d i j
noise α
ˆ) , ( ) , ( ) , (Td i j d i j i j
5
6
0.4
6
3 0.32 2
4
5
Major problem: sensitivity to noise
reconstruction
in O(n2)
Distance Based Reconstruction:Exact vs. approximate distances
Exactdistances
11
A
C
B
D
F
G
E
edge-weighted ‘true’ tree
,) , (T T i j S
D d i j
1
5
6
0.4
6
3 0.32 2
4
5
reconstruction
in O(n2)
The Algorithmic Aspect
Exactdistances
Many algorithms can reconstruct a weighted tree from the exact distances.In this project we will use the “Saitou&Nei Neighbor Joining algorithm”, or simply the “NJ algorithm”.
12
Evolutionary Distances:- How are they defined?- How are they extracted from the DNA sequences?
We’ll show this on a specific model the Kimura 2 Parameters (K2P) model
The Distance Estimation Aspect
,) , (T T i j S
D d i j
,
ˆˆ ) , (i j S
D d i j
noise α
ˆ) , ( ) , ( ) , (Td i j d i j i j
13
The Kimura 2 Parameter )K2P( model [Kimura80]:each edge corresponds to a “Rate Matrix”
{ }A G
{ }C T
Transitions
Transversions
Transitions
Transitions/Transversions ratio = / 2 1R
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
K2P generic rate matrixu
v
14
K2P standard distance: Δtotal = Total substitution rate
u v w
The total substitution rate of a K2P rate matrix R is
This is the expected number of mutations per site. It is an additive distance.
+
) ( 2 : ~ total expected number of mutationstotal uvR
α + 2β α’ + 2β’
)α+α’( + 2)β+ β’(
15
The distance Δtotal(Ruv) = dK2P(u,v) is estimated from the aligned sequences
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
2ˆ ˆˆ) , ( 2K Pd u v
K2P total rate“distance correction”
procedure
since mutations may overwrite each other,
this is a “noisy” process
A basic question:How good is a reconstruction method which uses K2P distances?
A C
B D
wsep
The performance of tree reconstructions method is often tested on quartets, which are trees with 4 taxa.A quartet contains a single internal edge, which defines the quartet-split.
17
A correct reconstruction of the quartet requires finding of the true quartet-split
A C
B D
A B
C D
A C
D B
Distance methods reconstruct the true split by the4-point condition:
There are 3 possible splits:
wsep
The 4-point condition for noisy distances is:
2 2 2 2 2 2) , ( ) , ( min ) , ( ) , ( , ) , ( ) , (K P K P K P K P K P K Pd d d d d d A B C D A C B D A D B C
2 2 2 2 2 2) , ( ) , ( ) , ( ) , ( ) , ( ) , (2K P K P K P K P K Pse K Ppd d dwd d d A B C D A C B D A D B C
18
We evaluate the accuracy of the K2P distance estimation
by Split Resolution Test:
root
D
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t
10t
C AB
10t 10t10t
t-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t is “evolutionary time”
The diameter of the quartet is 22t
19
Phase A: simulate evolution
DC
AB
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
20
Phase B: reconstruct the split by the 4p condition
DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
2ˆˆ ) , ( ) , (K P i jD i j d s s
Apply the 4p condition.Was the correct split found?
compute distances between sequences,
Repeat this process 10,000 times, count number of failures
21
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
the split resolution test was applied on the model quartet with various diameters
For each diameter, mark the fraction )percentage( of the simulations in which the 4p condition failed )next slide(
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
troot
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t … …
22
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
quartet diameter )total rate between furthest leaves(
Frac
tion
of fa
ilure
s ou
t of 1
0000
exp
erim
ents
performance of K2P standard distance method in resolving quartets, R=10
Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
Templatequartet
23
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter (=mutations rate between furthest leaves)
Frac
tion
of fa
ilure
s out
of 1
0000
sim
ulat
ions
performance of K2P standard distance method in resolving quartets, For quartet ratio 0.1, R=10
Performance for larger diameters
“site saturation”
24
{ }A G
{ }C T
Transitions
Transversions
Transitions
When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions:
{0}
{1}
This is the CFN model
[Cavendar78, Farris73, Neymann71]
α
α
β
25
Apply the same split resolution test on the transversions only distance:
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
ˆ ˆ) , (trd u v
Transversions onlyDistance correction
procedure
26
transversions only performs better on large, worse on small rates
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Frac
tion o
f Fai
lures
out
of 1
0000 e
xper
imen
ts
performance of distance methods in resolving quartets, R=10
Transversions only
total K2P rate
.
4 5
7 21
210 6 1
Conclusion: Distance based reconstruction methods should be
adaptive:
Find a distance function d which is good for the input
ˆˆ ) , ( ) , (D u v d u vD
Projects goal: Evaluate the performance of distance
functions in reconstructing phylogenies
28
1st step in finding good distance functions )for the K2P model(:
Characterize the available distance functions.Ideally, we would like to use the K2P distance associated
with the rate matrix of each edge, but...
29
Rate matrices are hard to observe, hence we use Substitution matrices
A A C A … G T C T T C G A G G C C Cu
v A G C A … G C C T A T G C G A C C T
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2p p
uvP uvR
Evolution of a finite sequence by unknown model parameters α, β
A stochastic substitution matrix Puv
30
Subtitution matrices are extended to paths:
uvP
vwP
u
v
w
uw uv vwP P P
31
Substitution matrices are converted to distances by a Substitution Rate function
uvP
vwP
u
v
w
SR function need to satisfy the following for all
substitution matrices P,Q in K2P:1. Δ)PQ( = Δ)P(+ Δ)Q( )additivity(2. Δ)P(>0 )positivity(
32
To define SR functions which are additive:Δ)PQ( = Δ)P(+ Δ)Q(
We use some linear algebra
33
Lemma: There is a matrix U which diagonalizes each K2P Substitution Matrix P:
1 0 0 0
0 λP 0 0
0 0 μP 0
0 0 0 μP
Where:
U-1 PU =
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
P =
0 < λP <10 < μP < 1
4
2 2
1 4
1 2 2P
P
p e
p p e
34
μP0000μP0000λP00001
U-1 P U =
μQ0000μQ0000λQ00001
U-1 Q U =
U-1 PQ U =
Let P,Q be two matrices in K2P. Then:
μP μQ000
0μP μQ00
00λP λQ0
0001
U-1 PQ U =
35
Proof: Dλ (PQ) = -ln)λPλQ( = -ln)λP( -ln)λQ( = Dλ (P)+ Dλ (Q)
And the same for Dμ (P (= -ln)μP(
Hence, the functions:Dλ)P(= -ln)λP( , Dμ (P)=-ln)μP(are additive distance functions
For the K2P model
36
Moreover, Each positive linear combination of Dλ and Dμ
is an additive distance function
uvP
vwP
u
v
w
1 2
1 2
1 2
) , ( ln) ( ln) (
) , ( ln) ( ln) (
) , ( [ln) ( ln) (] [ln) ( ln) (]
) , ( ) , (
uv uv
vw vw
uv vw uv vw
P P
P P
P P P P
D u v c c
D v w c c
D u w c c
D u v D v w
Our goal: given set of input sequences, select D which guarantees best reconstruction of the true tree.
37
ACGGTCA
ACGGATA
GGGGATT
The approximate distance function is defined by the observable noisy version of the substitution matrices
w
v
u uvP
vwP
We would like to use functions which minimize the influence of the “noise” on the reconstruction.Such a function can be defined&computed analytically for a single distance . Computing it for even small trees looks hard.
uvP
vwP
38
Summary • We have infinitely many additive distance functions
for the K2P model.• Which one should we use for the given input DNA
sequences?• If we have the exact substitution matrices for all
pairs of taxa, then all functions are equally good.• But we have only finite sequences, whose
alignments provide only estimations of the true substitution matrices
39
3 phases of the project
• Phase 1: Distance functions on simulated quartets :1 month• Phase 2: Distance functions on larger simulated trees: )1+( month• Phase 3: Extensions to real data and/or different models: 1 month
Phase 2 and 3 are flexible
40
Phase I: Quartets (~one month)
• Study the relevant info in “Towards Optimal....”http://webcourse.cs.technion.ac.il/236512/Spring2010/ho/WCFiles/optimal_distance_functions.pdf.
• Write a program )in MATLAB or C..( which compute optimal
distance functions as in the above paper
• Repeat the “quartet resolution test” given in this presentation, and
extend it to include optimal distance functions.
• Feel free modify the simulation by your judgment.
41
Phase II: Reconstructing Larger Trees using the Neighbor Joining Algorithm
1. Study the Neighbor Joining algorithm
2. Newick trees representations, and Robinson Fould measure.
3. Make similar tests, but this time on larger trees.
4. Implementation of NJ, and “Tree Templates” can be
downloaded from the www.
More information will be given later, either via the course site
or in a meeting.
42
Phase III: Trees from Real Data
1. Get Homologeous DNA sequences from existing databases.
2. Align the sequences using public domain software.
3. Select appropriate distance functions, and estimate
distances between the aligned sequences, using appropriate
distance functions
4. Use the various distance functions to reconstruct the trees,
and compare their perfomance.