phylogeny definition and assumptions input data for computing phylogenies character-based approaches...

79
Phylogeny • Definition and Assumptions • Input data for computing phylogenies • Character-based Approaches • Distance-based Approaches

Post on 19-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Phylogeny

• Definition and Assumptions

• Input data for computing phylogenies

• Character-based Approaches

• Distance-based Approaches

Page 2: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Phylogeny

• Definition and Assumptions

• Input data for computing phylogenies

• Character-based Approaches

• Distance-based Approaches

Page 3: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Definition

• Assumption – All organisms on Earth have a common ancestor

– This implies that any set of species is related.

• Phylogeny– The relationship between any set of species.

• Phylogenetic tree– Usually, the relationship can be represented by a tree

which is called a phylogenetic tree • this is not always true

Page 4: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

giantpanda lesser

panda

moose

goshawk vulture

duck

alligator

Time

Page 5: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Phylogenetic Inference

Apply algorithm to calculate tree(s)

Based loosely on paper from Hillis, Allard, and Miyamoto 1993

Align Sequences (Last week)

Assess phylogenetic signal (Skip)

Choose character or distance approach (Assumptions)

Choose distance measure (Assumptions)

Choose optimality criterion

Choose algorithm (Assumptions)

Test reliability (Skip)

Sequence Data (Input)

Page 6: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Comments

• Most work focuses on binary or bifurcating trees• Nodes correspond to organisms at a bifurcation or

splitting event• Edges represent time/evolutionary distance

between the ancestor/descendant nodes• Existing organisms are always placed at the leaves

– The organisms corresponding to an internal node may be identical to an organism at a leaf

Page 7: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

What algorithms do

• Root location– Some algorithms attempt to recreate the

topology of the tree with a root– Many create unrooted topologies

• Edge lengths– Some algorithms attempt to estimate edge

lengths (evolutionary divergence)– Others focus only on topology

Page 8: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Key Point

• Almost every step of the process involves assumptions

• It is important to understand these assumptions

• I’ll try to highlight some of them along with the main algorithmic ideas

Page 9: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Phylogeny

• Definition and Motivation

• Input data for computing phylogenies

• Character-based Approaches

• Distance-based Approaches

Page 10: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Input Data

• Two main types– Distance data

• Estimate of “distance” between all pairs of organisms

– “Character” data• A set of features with a defined set of feature values

• A feature value for each organism

Page 11: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Distance Data

• Distances ideally should reflect the amount of time between when organisms had a common ancestor– This is typically not true– We’ll talk more about distance data when we

get to algorithms that work with distance data

Page 12: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Character Data

• Historically– morphological (form and structure) data

• e.g., vertebrate versus invertebrate

• Currently– Gene sequence data

• DNA sequence of a gene

• Amino acid sequence of a specific protein

• Rarely an entire genome

Page 13: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Alignments and Sequence Data

• When working with sequence data, current techniques ignore order– One sequence per organism– Perform a multiple sequence alignment– Each position is now treated independently of

others– In many cases, screening is performed to select

“most informative” positions

Page 14: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Phylogeny

• Definition and Motivation• Input data for computing phylogenies• Character-based Approaches

– Maximum Parsimony definition• Heuristics• Upper bound on maximum parsimony

– Maximum likelihood

• Distance-based Approaches

Page 15: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Maximum Parsimony

• Assumption– We have correctly aligned sequence data, so we

don’t have to worry about insertions/deletions

• Goal– Find a phylogenetic tree that explains the

observed sequences with a minimal number of substitutions

Page 16: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

• Aligned input– AAAG– AAAC– AGGG– AGGT

• Screened input– Position 1 is identical

in all organisms– AAG– AAC– GGG– GGT

Page 17: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Possible trees

GGG GGT AAC

GGG (1)

AAG(2)AAG(1)

AAG

GGG (3)

AAG(2)AAG(3)

GGG AAC GGT AAG

GGG

(2)

Page 18: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Brute Force Algorithm

• Generate all possible trees for the given number of organisms– Suppose there are n taxa.

• How many binary rooted trees are possible?• How many binary unrooted trees are possible?

• For each possible tree, consider all possible assignments of the n taxa to the leaves of the tree– How many possible assignments are there?

• For each tree and assignment, calculate best possible assignment of characters to the internal nodes of the tree and calculate resulting score– Each position can be calculated independently

• Save most best scoring trees (and potentially assignments)

Page 19: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Computing cost

• Treat each character independently• Bottom up processing

– post-order traversal of the tree

• Data needed– At each node v, store a set of possible values R(v) such

that any one of these would be minimal cost

– Global variable C for cost initialized to 0

Page 20: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Computing internal nodes and cost

• At leaf node v: R(v) = the value of the taxa at v• Internal node v with children w and x:

– If R(w) intersect R(x) is not empty, R(v) = R(w) intersect R(x)

– Otherwise, R(v) = R(w) union R(x) and increment C by 1

• Traceback:– At root r, choose any value in R(r)

– At node v, choose value at parent if in R(v). Else choose anything

Page 21: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

A B

B

A

C = 0

Page 22: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

A B

B

A

C = 1

{A,B}

Page 23: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

A B

B

A

C = 1

{A,B}

{A}

Page 24: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

A B

B

A

C = 2

{A,B}

{A}

{A,B}

Page 25: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

A B

B

A

C = 2

{A,B}

{A}

B(1)

Page 26: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

A B

B

A

C = 2

{A,B}

A

B(1)

Page 27: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

A B

B

A

C = 2

A

A

B(1)

(1)

Page 28: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

• Brute force– (Number of trees) * (Number of assignments) * (cost to compute

internal nodes)– Very large

• Is there a better algorithm?• Yes, but the problem is NP-hard

– This means that the best known solution for computing a phylogenetic tree of n taxa has a worst-case running time that is not polynomial in n

– In practice, this means computing the optimal phylogenetic tree is extremely time-consuming for relatively small numbers of taxa

• (17 was limit according to a paper in 1997)

Running time

Page 29: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

• Weighted parsimony– The basic approach can be extended to allow for non-equal

substitution probabilities

– For example, replacing an A with a G may be more or less costly than replacing an A with a T

– Basic procedure outline is the same, but now we must consider all possible character values at each internal node

• Root of tree– We can search for unrooted trees as root values will be identical to

one of its children in all cases (assuming triangle inequality on costs in the weighted parsimony case)

Comments

Page 30: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

• Heuristics with non-optimal guarantees– Stochastic local search

• Start with a tree and an assignment• Stochastically search through space of all possible trees by making

local changes and retaining value if there is improvement

– Incremental addition of taxa• Start with tree for any three taxa• Incrementally add a new taxa at best possible edge• Different orderings lead to different final trees

• Branch and bound– Search through all possible trees and assignments but keep track of

current best and eliminate possibilities as they provably cannot be optimal

Heuristics

Page 31: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

• Assumption– Triangle inequality in scoring function

• S(i,j) + S(j,k) >= S(i,k)

• Definition– Given a set of species S– Let G(S) be the weighted complete graph

• nodes represent species in S• edges represent distance between two species

• Theorem– Any minimum spanning tree on G(S) has total length at most twice

that of the most parsimonious tree of the species in T– Minimum weight spanning trees can be computed efficiently

Upper bound on parsimony

Page 32: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Proof

Suppose the above is a most parsimonious treeT* for the set of species represented by the greennodes at the leaves

Page 33: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Double edges on graph

Parsimony weight is now twice optimal value

Page 34: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Create Eulerian Tour

Eulerian tour traverses each edge exactly once andis guaranteed to exist once we double edges.Cost of traversing all edges is exactly twice that ofoptimal tree T*

1

2

3 4 5 6

7 8

9 10 11 12

13

14 15

16 17 18 19

20

Page 35: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Focus on green nodes

A B C D E F

1

2

3 4 5 6

7 8

9 10 11 12

13

14 15

16 17 18 19

20

A to B: Edges 4-5B to C: Edges 6-9C to D: Edges 10-11D to E: Edges 12-16E to F: Edges 17-18F to A: Edges 19-20 and 1-3

Page 36: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Tour in graph G(S)

A B C D E F

1

2

3 4 5 6

7 8

9 10 11 12

13

14 15

16 17 18 19

20

A B C D E F

S(A,B) <= distance on edge 4 + distance on edge 5

Page 37: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Final result

A B C D E F

1

2

3 4 5 6

7 8

9 10 11 12

13

14 15

16 17 18 19

20

A B C D E F

Weight of all edges on path <= twice weight of T*.This path is one possible spanning tree of G(S).Therefore, result follows.

Page 38: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Comments about Parsimony

• Assumptions– Sequence data has limited homoplasy– Substitution scheme encodes assumptions about evolutionary

process• (example: 3rd codon substitution frequencies higher than at other

positions)

– Minimum number of changes is best explanation

• Other comments– There are probabilistic models where parsimony will converge on

the wrong tree even given infinite data– Differing rates of evolution in different parts of the tree can cause

problems

Page 39: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Maximum Likelihood

• Assumption– We have correctly aligned sequence data, so we

don’t have to worry about insertions/deletions– We have a model of evolution

• Goal– Find a phylogenetic tree that would have the

highest probability (subject to our model of evolution) of generating the observed sequences

Page 40: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Formalizing max likelihood

• We want to find a phylogenetic tree that maximizes P(data | tree)

• Data– set of n aligned sequences s = s1, s2, …, sn

• Tree– Topology T with n leaves– set of edges lengths t = t1, t2, …, t2n-2

• There are 2n-2 edges in a rooted binary tree with n leaves

• We want to find (T, t) such that P(s | (T, t)) is maximized

Page 41: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Example

• Let P(x | y, t) denote the probability that ancestral sequence y evolves into sequence s along an edge of length t– Assume t is proportional to mutation rate * evolutionary time

• P(s | (T,t)) = sum over all x5, x6, x7

– p(x7) P(x5|x7,t5) P(x6|x7,t6) P(s1|x5,t1) P(s2|x5,t2) P(s3|x6,t3) P(s4|x6,t4)

s3 s4s1 s2

t5 t6

t1 t2 t3 t4x5 x6

x7

Page 42: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Models of Evolution

• What should P(x|y, t) be?• Two assumptions of commonly used

models– Each site evolves independently– There are only substitutions, no

insertions/deletions

• P(x|y, t) = i=1 to m P(x(i) | y(i), t)– m is sequence length

Page 43: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Jukes-Cantor Model [1969]

• What should P(x(i)|y(i), t) be?• Jukes-Cantor Model [1969]

– parameter

rt st st st

st rt st st

st st rt st

st st st rt

A

C

G

T

A C G Trt = 1/4 (1 + 3e-4t)st = 1/4 (1 - e-4t)

Limit values whent = 0 or t = infinity?

Page 44: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Kimura Model [1980]

• What should P(x(i)|y(i), t) be?• Kimura Model [1980]

– parameters

rt st ut st

st rt st ut

ut st rt st

st ut st rt

A

C

G

T

A C G Tst = 1/4 (1 - e-4t)ut = 1/4 (1 + e-4t -2e-2(+)t)rt = 1 - 2st - ut

Limit values whent = 0 or t = infinity?

Page 45: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Properties of Models of Evolution

• Assumptions– Substitution process is Markovian and

stationary• probabilities do not change over time

• length of time interval is all that matters

• Substitution matrix is multiplicative– Matrix(t) * Matrix (s) = Matrix (t+s)

– Sb P(a|b, t)P(b|c, s) = P(a|c, s+t)

Page 46: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Computing Likelihood

• P(Lv|a) = probability of all leaves below node v having their values given the residue at node v is a

• Recursive algorithm for computing P(Lv|a)– Base Case: v is a leaf node

• P(Lv|a) = 1 if a is the value of residue at that leaf, 0 otherwise

– Recursive case: v is an internal node• Compute P(Lu|x), P(Lw|x) for all x at daughter nodes v and w

– 2 || values total

• Set P(Lv|a) = x,y P(x|a, tu)P(Lu|x) P(y|a, tw)P(Lw|y)

– ||2 distinct products to compute

Page 47: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Brute Force Algorithm

• Generate all possible topologies for the given number of organisms– For each possible tree, consider all possible assignments of the n taxa

to the leaves of the tree

• Compute likelihood of tree topology generating data– For each tree and assignment, consider all possible interior tree node

assignments– Generate likelihood for topology as a function of edge length variables– Solve equations to determine best edge lengths for given topology

• Save the tree that has the resulting data with highest probability

• More complex than computing maximum parsimony

Page 48: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Comments about Max Likelihood

• Accuracy of tree is obviously highly dependent on the accuracy of the model of evolution that is assumed

• If substitution matrices are multiplicative and a “reversibility” constraint holds, then max likelihood cannot predict position of root

• Extremely slow in the general case for even relatively small numbers of taxa (depending on the model of evolution assumed)

Page 49: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Posterior distribution

• Max likelihood: – Finds phylogenetic tree that maximizes P(data | tree)

• Posterior distribution is even better: – Find phylogenetic tree such that maximizes P(tree | data)

• Bayes Theorem– P(tree | data) = [P(data | tree) P(tree)] / P(data)

• If we know prior distribution of P(tree), then we can do some sampling techniques to estimate posterior distribution P(tree | data)– There are ways to finesse not knowing P(data)

Page 50: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Phylogeny

• Definition and Motivation• Input data for computing phylogenies• Character-based Approaches• Distance-based Approaches

– Data assumptions• Molecular clock and ultrametric properties

– Simple clustering algorithms

• Additivity properties– Neighbor joining

Page 51: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Distance data

• For each pair of taxa, we will have a single number representing the “distance” between these two organisms

• Question:– What do we expect this distance data to look like?

• Desired answers– Ultrametric data

– Additive data

Page 52: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Time scale: Origin to now

• Ideally, we would know the exact time when all divergence events occurred

0

2

13Goshawk

13Alligator

7

13Panda

13Moose

8

Page 53: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Time Scale: Now to common ancestor

• We won’t know the ancestor (or time 0), but we can hope to infer the following tree

11

nowGoshawk

nowAlligator

6

nowPanda

nowMoose

5

?

Page 54: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Ultrametric Data

• Data which can help us reproduce the (now to common ancestor) tree is termed ultrametric data

• What should the ultrametric data be in this case?– D(Panda, Moose) = 10

– D(Goshawk, Alligator) = 12

– D(Panda, Goshawk) = D(Panda, Alligator) = D(Moose, Goshawk) = D(Moose, Alligator) = 22

?11

nowGoshawk

nowAlligator

6 nowPanda

nowMoose

5

Page 55: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Ultrametric Matrix

Panda Moose Goshawk Alligator

A 0 10 22 22

B 10 0 22 22

C 22 22 0 12

D 22 22 12 0

Comment: Need to be careful about what distances mean

Page 56: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Comments about ultrametric data

• Assumes a molecular clock theory– Divergence on all paths in the tree occurs at the

same rate– This is typically NOT a valid assumption

• Tests for ultrametric data– A symmetric distance matrix D defines an

ultrametric distance iff for every three indices i, j, and k, the maximum of these three distances is NOT unique.

Page 57: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Illustration of Test Condition

A B C D

A 0 10 22 22

B 10 0 22 22

C 22 22 0 12

D 22 22 12 0

{A, B, C}

Page 58: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Illustration of Test Condition

A B C D

A 0 10 22 22

B 10 0 22 22

C 22 22 0 12

D 22 22 12 0

{A, C, D}

Page 59: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Algorithms for ultrametric data

• Unweighted Pair Group Method Using Arithmetic Averages (UPGMA)– [ Sokal & Michener 1958 ]

• Clustering method

• Key idea– Distance between cluster Ci and Cj is the

average distance between all pairs of sequences in the clusters

Page 60: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

UPGMA

• Initialization– Initially each sequence is a cluster

– Each leaf of T is a sequence at height 0

• Iteration– Find two closest clusters i and j

– Combine to form a new cluster k

– Create a new node in T with height D(i,j)/2 to correspond to cluster k

• Final cluster denotes root of tree

Page 61: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Varying rates of evolution

• Constant rate of divergence

• Non-constant rate of divergence

0

2

13Goshawk

13Alligator

7

13Panda

13Moose

8(6) (5)

(5) (5) (6) (6)

(9) (3)

(7) (4) (6) (8)

Page 62: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Additive Tree

• Edge-weighted tree

• Distance between organisms at any two leaves is the sum of the corresponding path length

Goshawk AlligatorPanda Moose

9 3

7 4 6 6

Page 63: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Illustration

• D(Panda, Goshawk) must equal 7 + 9 + 3 + 6 = 25• D(Moose, Goshawk) must equal 4 + 9 + 3 + 6 = 22

Goshawk AlligatorPanda Moose

9 3

7 4 6 6

Page 64: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Comments about additive data

• Does not assume molecular clock theory– Divergence on different edges in tree can be at

different rates

• Test for additive data: 4 point condition– A symmetric distance matrix D defines an

additive distance iff for every four indices i, j, k, and l, two of the sums of

• D(i,j) + D(k,l), D(i,k)+D(j,l), D(i,l)+D(j,k)

– must be equal and larger than the third

Page 65: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Intuition

• D(1,2) + D(3,4), D(1,3)+D(2,4), D(1,4)+D(2,3)

1 2

34

1 2

34

1 2

34

1 2

34

Page 66: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Algorithms for additive data

• Neighbor-Joining – [Saitou & Nei, 1987] and [Studier and Keppler,

1988]

• Key idea– Find a pair of neighboring leaves

• Not necessarily the closest two taxa to each other

1 2

34

Page 67: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Finding neighboring leaves

• Definitions– Let L be the number of leaf nodes – Define r(i) = 1/(|L| - 2) k in L D(i,k)

• Roughly average distance to all other active nodes

– Define d(i,j) = D(i,j) - (r(i) + r(j))• Real distance - averaged distances to all other active nodes

negates effect of long edges as in example

• Claim– Pair of leaves for which d(i,j) is minimal are

neighboring leaves

Page 68: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Proof

• Let i and j be leaves with minimal d(i,j)

• Suppose they are not neighbors– There must be at least two nodes on the path

connecting them– Label two closest nodes to i and j nodes k and l

jil k

Page 69: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Proof

• Define Lk and Ll to be the leaves that are “away” from nodes i and j

Leaves here are in LkLeaves here are in Ll

jil k

Page 70: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Proof

• Assume there is a pair of neighboring nodes m and n with parent p in Lk

jil k

m

np

Page 71: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Proof

• For all y in Lk except m and nD(i,y) + D(j,y) - D(m,y) - D(n,y) = D(i,j) + 2 D(k,y) - 2 D(p,y) - D(m,n)

jil k

m

npy

Page 72: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Proof

• For all z in Ll

D(i,z) + D(j,z) - D(m,z) - D(n,z) = D(i,j) - D(m,n) - 2D(p,k) - 2D(l,k)

jil k

m

npz

Page 73: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Proof

• For all z “in between” Ll and Lk

D(i,z) + D(j,z) - D(m,z) - D(n,z) = D(i,j) - D(m,n) - 2D(p,k) - 2D(k,w)

jil k

m

np

z

w

Page 74: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Bounding d(i,j) - d(m,n)

• d(i,j) - d(m,n) = – D(i,j) - D(m,n) - r(i) - r(j) + r(m) + r(n) =

– D(i,j) - D(m,n) - 1/(N-2) all leaves u D(i,u) + D(j,u) - D(m,u) - D(n,u)

• Applying the previous 3 inequalities and reorganizing terms, we get– d(i,j) - d(m,n) > 2D(p,k)(|Ll| - |Lk|)/(N-2)

– Since d(i,j) is minimal, this leads to the fact that |Lk| > |Ll|

– Symmetrical argument gives us |Ll| > |Lk|

• Contradiction implies i and j must be neighbors

Page 75: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Bounding d(i,j) - d(m,n)• all leaves u D(i,u) + D(j,u) - D(m,u) - D(n,u)• Previous 3 inequalities

– For all y in Lk except m and nD(i,y) + D(j,y) - D(m,y) - D(n,y) = D(i,j) + 2 D(k,y) - 2 D(p,y) - D(m,n)

– For all z in LlD(i,z) + D(j,z) - D(m,z) - D(n,z) = D(i,j) - D(m,n) - 2D(p,k) - 2D(l,k)

– For all z “in between” Ll and LkD(i,z) + D(j,z) - D(m,z) - D(n,z) = D(i,j) - D(m,n) - 2D(p,k) - 2D(k,w)

– For y = m or nD(i,y) + D(j,y) - D(m,y) - D(n,y) = D(i,j) + 2 D(k,y) - D(m,n) -2 D(p,y) + 2 D (p,y)

– For y = i or jD(i,y) + D(j,y) - D(m,y) - D(n,y) = D(i,j) - 2 D(p,y) - D(m,n) - 2 D(k,y) + 2 D(k,y)

• This simplifies the sum to be– N-2 (D(i,j) - D(m,n)) + all leaves in Lk (2 D(k,y) - 2D(p,y)) + all leaves in Ll (-2D(l,k) - 2D(p,k)) - C

• C term is for in between nodes including i and j

Page 76: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Bounding d(i,j) - d(m,n)

• d(i,j) - d(m,n) = D(i,j) - D(m,n) - 1/(N-2)

all leaves u D(i,u) + D(j,u) - D(m,u) - D(n,u)

• all leaves u D(i,u) + D(j,u) - D(m,u) - D(n,u) =N-2 (D(i,j) - D(m,n)) + all leaves in Lk (2 D(k,y) - 2D(p,y)) +

all leaves in Ll (-2D(l,k) - 2D(p,k)) - C

• d(i,j) - d(m,n) = 1/(N-2) all leaves in Lk (2 D(p,y) - 2D(k,y)) + all leaves in Ll (2D(l,k) + 2D(p,k)) + C

Page 77: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Bounding d(i,j) - d(m,n)

• d(i,j) - d(m,n) = 1/(N-2) all leaves in Lk (2 D(p,y) - 2D(k,y)) + all

leaves in Ll (2D(l,k) + 2D(p,k)) + C• Observe

– D(p,y) + D(p,k) > D(k,y) which implies that– D(p,y) - D(k,y) > - D(p,k)

• This implies that– d(i,j) - d(m,n) > 1/(N-2) 2 D(p,k)[|Ll| - |Lk|] + positive term

• Contradiction– The minimality of d(i,j) implies then that the rhs of the above inequality

must be nonpositive– This implies that |Lk| > |Ll|– Symmetry of argument gives us the reverse inequality which is a

contradiction for case where both Lk and Ll have > 1 leaf node

Page 78: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Neighbor-Joining

• Initialization– Define T to be the set of leaf nodes, one per sequence– Make list L of active nodes = T

• Iteration– Find two nodes i and j where d(i,j) is minimal– Combine to form a new node k and

• set D(k,m) = 1/2(D(i,m) + D(j,m) - D(i,j) for all m in L

– Add k to T with edges of length• D(i,k) = 1/2(D(i,j) + r(i) - r(j)) and D(j,k) = D(i,j) - D(i,k)

– Remove i and j from L and add node k

• Comments– There is no explicit root node– Can be applied with non-additive data and some edge lengths may be

negative in this case

Page 79: Phylogeny Definition and Assumptions Input data for computing phylogenies Character-based Approaches Distance-based Approaches

Distances

• Weaknesses of Hamming or Edit Distance– Length

– Homoplasy