מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח website: ...
DESCRIPTION
236503 פרויקט בתכנות מחקר השוואתי בשחזור עצי אבולוציה: אלגוריתמים קיימים מול תכנות בשלמים אביב 2013. מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website: http://webcourse.cs.technion.ac.il/236503/. Evolution. Evolution of new organisms is driven by Diversity - PowerPoint PPT PresentationTRANSCRIPT
.
בתכנות 236503 פרויקט
: אבולוציה עצי בשחזור השוואתי מחקר
תכנות מול קיימים אלגוריתמים
בשלמים2013אביב
מורן: שלמה מרצה : שילוח יוסי חיצוני מנחה
Website: http://webcourse.cs.technion.ac.il/236503/
2
Evolution
Evolution of new organisms is driven by
Diversity Different individuals
carry different variants of the same basic blue print
Mutations The DNA sequence
can be changed due to single base changes, deletion/insertion of DNA segments, etc.
Selection bias
3
The
Phylogenetic Reconstrutction
Problem
MPI, June 2012
4
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGA
ACGGTCA
ACGGATA
ACGGGTA
ACCCGTG
ACCGTTG
TCTGGTA
TCTGGGA
TCCGGAA AGCCGTG
GGGGATT
AAAGTCA
AAAGGCG AAACACAAAAGCTG
Evolution is modeled by a Tree
(Species represented by their DNA sequences, consisting of {A,G,C,T})
MPI, June 2012
5
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGAACCGTTGTCTGGGA
TCCGGAA AGCCGTG
GGGGATT
Phylogenetic Reconstruction
MPI, June 2012
6
B : AATCCTG
C : ATAGCTG
A : AATGGGC
D : GAACGTAE : AAACCGA
J : ACCGTTG
G : TCTGGGAH : TCCGGAA
I : AGCCGTG
F : GGGGATT
Goal: reconstruct the ‘true’ tree as accurately as possibleDistance Methods: use “evolutionary distances” between sequences
reconstruct
AB
C
FG
IH J
D
E
A
B
C F
G
I
H
J
D
E
(root)
Phylogenetic Reconstruction
MPI, June 2012
7
Reconstructing weighted treeFrom exact interleaf distances
,
( , )u v S
D d u v
Exact (additive) distances
Between leaves
Reconstruction
(linear-tim
e)
Algorithm
MPI, June 2012
A
C
B
D
F
G
E
edge-weighted unknown tree
5
6
0.4
6
3 0.32 2
4
5
A
C
B
D
F
G
E
Reconstructed tree
5
6
0.4
6
3 0.32 2
4
5
8
Formal statement of the problemfor exact distances
Input: an n×n distance matrix D=(d(i,j)): d(i,i)=0, and for i≠j, d(i,j)>0d(i,j)=d(j,i). For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k).
Output: If the distances can be realized by a weighted tree (i.e., the distances are additive) – return that tree.
Else – return nothing.
4 5
7 21
210 61
Distance based reconstruction methods:)since the 60’s(:
÷÷÷÷÷÷÷÷
ø
ö
çççççççç
è
æ
= ˆˆ ( , ) ( , )D u v d u vD̂
MPI, June 2012
10
Solution for 3 objects
For n=3: Each distance metric can be realized by a )unique( tree with one internal node.
( , )( , )( , )
d i j a bd i k a cd j k b c
ab
c
i
j
k
v
i j k
i 0 a+b a+c
j 0 b+c
k 0
Distance metrics on 4 objects may not have a tree.
11
The Four Points Condition
Definition: A distance metric on n objects satisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that:
d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l)
ik
lj
Theorem: A distance metric is additive iff it satisfies the four points condition
12
Neighbor Joining Let i, j be neighboring leaves in a tree, let v be their parent, and let k
be any other leaf.
The formula
shows that we can compute the distances of v to all other leaves.
1
2( , ) [ ( , ) ( , ) ( , )]d k v d k i d k j d i j
d(k,v)
i
j
k
v
13
Reconstructing trees byNeighbor Joining Algorithms
This suggest the following method to construct tree from a distance
matrix:
1. Find neighboring leaves i,j in the tree,
2. Replace i,j by their parent v and recursively construct a tree T
for the smaller set.
3. Add i,j as children of v in T.
15
Neighbor Finding: Seitou&Nei method
Theorem (Saitou&Nei) Assume all internal edge weights are positive. If Q(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.
is a leaf
For a leaf , let
For leaves
2
( , ).
, :
( , ) ( ) ( , ) ( )
i
u
i j
i r d i u
i j
Q i j n d i j r r
Definitions
16
S&N Neighbor Joining Algorithm If n =3, return tree of three vertices Compute Q(i,j) for all i,j Choose i,j such that Q(i,j) is minimal Create new vertex v, and set
ij
v
k
1 (for some
2 // or could be 0
1for each vertex ,
2
( , ) [ ( , ) ( , ) ( , )] )
( , ) ( , ) ( , ) ( , ) ( , )
( , ) [ ( , ) ( , ) ( , )]
d i v d i j d i r d j r r
d j v d i j d i v d i v d j v
k d v k d i k d j k d i j
remove i,j, and add v to the set of objectsRecursively construct a tree on the smaller set, then add i,j as children of v, at distances d(i,v) and d(j,v).
d(k,v)
17
Initialization: θ(n2) to compute r(i) and Q(i,j) for all i,jL.
Each Iteration: O(n2) to find the maximal Q(i,j). O(n) to compute {D(v,k):k L} for the new node v,
and to update the matrix. O(n2) to update the values Q(i,j).
Total of O(n3).
Complexity of S&N Neighbor Joining Algorithm
ij
k
D(v,k)
18
NEEDED:
Additive Distances
Between DNA Sequences
MPI, June 2012
Additive Evolutionary distance :The number of substitutions which occurred
during the sequence evolution
AC
CC
C G T A1 2 3
1
site 1
site 2
substitutions
Some substitutions are hidden, due to overwriting.Therefore, the exact number of subst. is usually larger than the number of observed changes.
site 30
20
Edge weight = Expected number of substit’s per site
A A C A … G T C T T C G A G G C C Cu
v A G C A … G C C T A T G C G A C C T
MPI, June 2012
0 1 0 0 … 0 2 0 0 1 1 0 1 2 1 0 0 1
0.321 Number of substitutions per site
21
When the exact number of substitutions between any two
sequences is known,
any algorithm which reconstructs trees from the exact
distances returns the correct evolutionary tree
Interleaf distances: sum of edge weights
vu0.5
0.42
0.3
d(u,v) = 1.12
22
The expected number of substitutions
is estimated from the
observed number of substitutions
What we see is only the observed number of substitutions between pairs of leaf sequences.
23
The estimation is based onSubstitution Model
The simplest model: Juke Cantor Model On each tree edge e, each letter is mutated to any other later by the same ratio re. The length of an edge is the expected number of mutations per site, i.e. t=3r
u
v
t
MPI, June 2012
T C G A
r r r - A
r r - r G
r - r r C
- r r r T
27
The expected number of substitutions is estimated from
the observed changes by a correction formula
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
ˆ ˆ( , )d u v t
MPI, June 2012
29
A
C
B
D
F
G
E
edge-weighted ‘true’ tree reconstructed tree
reconstruction
B
C
A
D
F
G
E
,
ˆˆ ( , )u v S
D d u v
5
6
0.4
6
3 0.32 2
4
5
Reconstruction from estimated distances:
Estimated distances
,
( , )u v S
D d u v
Exact (additive) distances
Between species
Distance estimationAssuming DNA
substitution model
MPI, June 2012
Challenge: minimizeReconstruction errors
A
C
B
D
F
G
E
edge-weighted ‘true’ tree T
5
6
0.4
6
3 0.32 2
4
5
30
reconstructed tree T’
B
C
A
D
F
G
E
Correct and incorrect reconstruction of edges
MPI, June 2012
Each (internal) edge defines a split of the leaves:The edge {ABC | DEFG} is correctly reconstructedThe edge {ABCD | EFG} is false negativeThe edge {AC | BDEFG} is false positive.
31
Robinson Foulds Distance
MPI, June 2012
False positives + false negativesTotal number of internal edges
Robison Foulds distance =
A
C
B
D
F
G
E
edge-weighted ‘true’ tree T
5
6
0.4
6
3 0.32 2
4
5
reconstructed tree T’
B
C
A
D
F
G
E
=
32
Formal statement of the problemfor estimated distances
Input: an n×n distance matrix, which are estimations of tree (additive) distances.
Output: return a tree with small Robinson Foulds distance from the true tree.
33
Project’s Goal Practice current algorithm (NJ) of phylogenetic reconstruction by
distance methods. Simulate evolutions of DNA sequences, and generate evolutionary
distances. Study a new method for tree reconstruction, based on mixed
integer programming with CPLEX. Compare the accuracy of this new method with that of Neighbor
Joining. You should use the PHYLIP phylogenetic package for most of the required tasks:
http://evolution.genetics.washington.edu/phylip.html
Time Line
35
Mathematical programming
Algorithmicweek
construct distance matrix of a given tree )10-20 leaves(. Generate noisy distance matrices from exact ones.
1
general introduction to Mathematical programming
2
3
4
5
6789
1011121314
prepare final report and presentation.
Reconstruct trees from exact distances using the NJ algorithm. Repeat this for Noisy distances, and compute the RF )Robinson-Foulds( distance of the true tree from the reconstructed trees.
Study the reconstruction model, and use it to reconstruct trees from accurate distances
Construct evolutionary-distance matrix by simulating evolution using the Jukes Cantor Model, and then estimating the distances between the sequences using DNADIST
Reconstruct trees from noisy distances
Reconstruct trees from evolutionary distances using both reconstruction methods. Adjust the Integer Programming model to improve accuracy.
37
Grading Scheme
10% - work plan 60% - final report + submitted code
Rough distribution of grade: 40% - meeting project requirements 10% - code organization and documentation 10% - innovation and creativeness
30% - final presentation