phylogenetic trees (2) lecture 13

41
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield 17.1-17.3, Setubal&Meidanis 6.1

Upload: topaz

Post on 06-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Phylogenetic Trees (2) Lecture 13. Based on: Durbin et al 7.4, Gusfield 17.1-17.3, Setubal&Meidanis 6.1. Character-based methods for constructing phylogenies. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Phylogenetic Trees (2) Lecture 13

.

Phylogenetic Trees (2)Lecture 13

Based on: Durbin et al 7.4, Gusfield 17.1-17.3, Setubal&Meidanis 6.1

2004-5 2005-6: שיניתי הצגת האלגוריתמים, קצת ב "מקסימום פרסימוני" וגם ב"פילוגניה מושלמת"התקדמתי לאט עם הרבה שאלות. עד ההפסקה: שקף 20 (אמצע סנקוף), סיימתי בהוכחת הצד הפשוט של האלגוריתם לפילוגניה מושלמת בינארית שקף 39.
Page 2: Phylogenetic Trees (2) Lecture 13

2

Character-based methodsfor constructing phylogenies

In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures) or molecular (nucleotides in homologous DNA sequences). One common approach is Maximum Parsimony

Common Assumptions:Independence of characters (no interactions)Best tree is one where minimal changes take place

Page 3: Phylogenetic Trees (2) Lecture 13

3

Character based methods: Input data

species C1 C2 C3 C4 … Cm

dog A A C A G G T C T T C G A G G C C C

horse A A C A G G C C T A T G A G A C C C

frog A A C A G G T C T T T G A G T C C C

human A A C A G G T C T T T G A T G A C C

pig A A C A G T T C T T C G A T G G C C

* * * * * * * * * * *

• Each character (column) is processed independently.

• The green character will separate the human and pig from frog, horse and dog.

• The red character will separate the dog and pig from frog, horse and human.

• We seek for a tree that will best explain all characters simultaneously.

Page 4: Phylogenetic Trees (2) Lecture 13

4

1. Maximum Parsimony

A Character-based method

Input:

h sequences (one per species), all of length k.

Goal:

Find a tree with the input sequences at its leaves,

and an assignment of sequences to internal nodes,

such that the total number of substitutions is minimized.

Page 5: Phylogenetic Trees (2) Lecture 13

5

ExampleInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

By the parsimony principle, we seek a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. Here is one possible tree.

Page 6: Phylogenetic Trees (2) Lecture 13

6

Example ContinuedThere are many assignments for this tree. For example:

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Total #substitutions = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Total #substitutions = 4

The left tree is preferred over the right tree.

The total number of changes is called the parsimony score.

Page 7: Phylogenetic Trees (2) Lecture 13

7

Example With One Letter Sequences

Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position

Minimal tree has only one evolutionary change:

C

C

CC

C

T

T

T

T C

Page 8: Phylogenetic Trees (2) Lecture 13

8

Parsimony Based Reconstruction

Two separate components:

1. A procedure to find the minimum number of changes needed to explain the data for a given tree topology, where species are assigned to leaves.

2. A search through the space of trees.

3. We will see efficient algorithms for (1). (2) is hard.

Page 9: Phylogenetic Trees (2) Lecture 13

9

Example of Input for a Given Tree

Aardvark Bison Chimp Dog Elephant

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

The tree and assignments of strings to the leaves is given, and we need only to assign strings to internal vertices.

Page 10: Phylogenetic Trees (2) Lecture 13

10

Fitch Algorithm:Maximum Parsimony for a Given Tree

Input: A rooted binary tree with characters at the leaves

Output: Most parsimonious assignment of states to internal vertices

Work on each position independently. Make one pass from the leaves to the root, and another pass from the root to the leaves.

A

A/T

A A C T A

AA/C

Page 11: Phylogenetic Trees (2) Lecture 13

11

Fitch’s Algorithm, More detailed

traverse tree from leaves to root, fix a set of possible states (e.g. nucleotides) for each internalvertex

traverse tree from root to leaves, pick a unique state for each internal vertex

Page 12: Phylogenetic Trees (2) Lecture 13

12

Fitch’s Algorithm – Phase 1

Do a post-order (from leaves to root) traversal of tree, assign to each vertex a set of possible states. Each leaf has a unique possible state, given by the input.

The possible states Ri of internal node i with children j and k is given by:

otherwiseRR

RRifRRR

kj

kjkj

i

Page 13: Phylogenetic Trees (2) Lecture 13

13

Fitch’s Algorithm – Phase 1

Claim (to be proved soon):# of substitutions in optimal solution = # of union operations

TC

T

CT

C

C T AG C

AGC

GC

Page 14: Phylogenetic Trees (2) Lecture 13

14

Fitch’s Algorithm – Phase 2

do a pre-order (from root to leaves) traversal of tree

select state rj of internal node j with parent i as follows:

otherwiseRstatearbitrary

Rrifrr

j

jii

j

Page 15: Phylogenetic Trees (2) Lecture 13

15

Fitch’s Algorithm – Phase 2

TC

T

CT

C

C T AG C

AGC

GC

The algorithm could also select C as the assignment to the root. All other assignment are unique.

Complexity: O(nk), where n is the number of leaves and k is the number of states. For m characters the complexity is O(nmk).

Page 16: Phylogenetic Trees (2) Lecture 13

16

Proof of Fitch’s Algorithm

We’ll show that Fitch minimizes the parsimony score at every character.

Definitions:

For a leaf-labeled tree T, let T* be an optimal

assignment of labels to internal nodes of T. T*(v)

be the assignment at internal node.

Let Tv be the tree rooted at v.

Page 17: Phylogenetic Trees (2) Lecture 13

17

Claim: The first phase of Fitch keeps at v the set of states S(v) such that s S(v) iff there exists an optimal assignment Tv* with Tv* (v) = s.

Proof: By induction of the tree height h. Basis: h=1

I. If both children have the same state – zero change.

II. Otherwise – exactly one change.

A A

A

A B

A B

subtle point: it is possible to have optimal trees in which non-root vertex v is labeled by character not in S(v) (and thus the labeling of the resulted subtree is not optimal)
Page 18: Phylogenetic Trees (2) Lecture 13

18

• Induction step: Assume correctness for height k and will prove for k+1. Let p1 and p2 be the optimal costs of the subtrees of v’s children.

• If the intersection of v’s children lists is not empty, then the optimal score is p1+p2 and it can be achieved by labeling v with any member in the intersection, and only in this way.

• Otherwise, the optimal score is p1+p2+1, and it can be achieved by labeling v with any member in the union of the lists, and only in this way.

A,B C,D

A,B,C,D

A,B B,C

B

in this case there may be optimal labeling in which the label of a child u of v is not in S(u).
Page 19: Phylogenetic Trees (2) Lecture 13

19

Generalization: Weighted Parsimony(Sankoff’s algorithm)

Weighted Parsimony score: Each change is weighted by a score c(a,b). The weighted parsimony score reduces to

the parsimony score when c(a,a)=0 and c(a,b)=1 for all b other than a.

Page 20: Phylogenetic Trees (2) Lecture 13

20

Weighted Parsimony on a Given Tree

Each position is independent and computed by itself.

Use Dynamic programming. if i is a node with children j and k, then

S(i,a) = minb(S(j,b)+c(a,b)) + minb’(S(k,b’)+c(a,b’))

i

jk

S(j,b)

S(j,b)the optimal score of a subtree rooted at j when j has the character b.S(k,b’)

S(i,a)

Page 21: Phylogenetic Trees (2) Lecture 13

21

Evaluating Parsimony ScoresDynamic programming on a given treeInitialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise

S(i,a) = Iteration: if i is node with children j and k, then

S(i,a) = minx(S(j,x)+c(a,x)) + miny(S(k,y)+c(a,y))Termination: cost of tree is minxS(r,x) where r is the root

Comment:

To reconstruct an optimal assignment, we need to keep in each node i and for each character a two characters x, y that minimize the cost when i has character a.

Page 22: Phylogenetic Trees (2) Lecture 13

22

Cost of Evaluating Parsimony for binary trees

For a tree with n nodes and a single character with k values, the complexity is O(nk2). When there are m such characters, it is O(nmk2).

Page 23: Phylogenetic Trees (2) Lecture 13

23

2. Finding the right tree:The Perfect Phylogeny Problem

Recall the general problem:Input: A set of species, specified by strings of characters.Output: A tree T, and assignment of species to the leaves

of T, with minimum parsimony score.

A restricted variant of this problem is the Perfect Phylogeny problem.

The algorithms of Fitch and Sankoff assume that the tree is known. Finding the optimal tree is harder.

Page 24: Phylogenetic Trees (2) Lecture 13

24

2. The Perfect Phylogeny Problem

Basic assumption for the perfect phylogeny problem:

A character is a significant property, which distinguishes between species (e.g. dental structure).

Hence, characters in evolutionary trees should be “Homoplasy free”, as we define next.

Page 25: Phylogenetic Trees (2) Lecture 13

25

Homoplasy-free characters 1

Characters in Phylogenetic Trees should avoid:

reversal transitions

A species regains a state it’s direct ancestor has lost.

Famous known reversals: Teeth in birds. Legs in snakes.

experiment reported in science 80: producing teeth in chickens
Page 26: Phylogenetic Trees (2) Lecture 13

26

Homoplasy-free characters 2

…and also avoid convergence transitions

Two species possess the same state while their least common ancestor possesses a different state.

Famous known convergence: The marsupials.

Page 27: Phylogenetic Trees (2) Lecture 13

27

היונקים מימין הם יונקי כיס. קודם היתה התפצלות של כל היומקי כיס, ולאחר מכן התכנסות לכל מיני תכונות דומות ליונקים "רגילים".
Page 28: Phylogenetic Trees (2) Lecture 13

28

Characters as Colorings

A coloring of a tree T=(V,E) is a mapping C:V [set of colors]

A partial coloring of T is a mapping defined on a subset of the vertices U V:

C:U [set of colors]

U=

Page 29: Phylogenetic Trees (2) Lecture 13

29

Each character defines a (partial) coloring of the corresponding phylogenetic tree:

Characters as Colorings (2)

Species ≡ VerticesStates ≡ Colors

Page 30: Phylogenetic Trees (2) Lecture 13

30

Convex Colorings (and Characters)

C

Definition: A (partial/total) coloring of a tree is convex iff all d-carriers are disjoint

Let T=(V,E) be a colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d

Page 31: Phylogenetic Trees (2) Lecture 13

31

A character is Homoplasy free (avoids reversal and convergence transitions)

The corresponding (partial) coloring is convex

Convexity Homoplasy Freedom

Page 32: Phylogenetic Trees (2) Lecture 13

32

The Perfect Phylogeny Problem

Input: a set of species, and many characters. Question: is there a tree T containing the species

as vertices, in which all the characters (colorings) are convex?

Page 33: Phylogenetic Trees (2) Lecture 13

33

Input: Partial colorings (C1,…,Ck) of a set of vertices U (in the example: 3 total colorings: left, center, right, each by two colors).

Problem: Is there a tree T=(V,E), s.t. UV and for i=1,…,k,, Ci is a convex (partial) coloring of T?

RBRRRRBBRRRB

The Perfect Phylogeny Problem(pure graph theoretic setting)

NP-Hard In general, in P for some special cases. Next we show a polynomial time algorithm for the case of binary characters.

Page 34: Phylogenetic Trees (2) Lecture 13

34

Perfect Phylogeny for directed binary characters

Input: a matrix where rows correspond to objects (species), columns to characters.

Each character has two states: 0 (non exists) or 1 (exists). WLOG for each character there is a species which possesses it.

Question: Is there a perfect phylogeny tree for the given species, in which all the characters have value 0 at some specified internal vertex (the root).

C1 C2 C3 C4 C5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 0 1

D 0 0 1 1 0

E 0 1 0 0 0A

E

D

C

B

(11000)

(00100)

(01000)

(00110)

(11001)

(00000)

look at changes inserted on this subject in Lecture 14 (jan06)
Page 35: Phylogenetic Trees (2) Lecture 13

35

Perfect Phylogeny for directed binary characters

By the definition, for each character C there is one edge in which it is converted from 0 to 1. In the below tree, the edge on which character C2 is converted to 1 is marked. The resulted tree is convex for this character.

C1 C2 C3 C4 C5

A 1

B 0

C 1

D 0

E 1A

E

D

C

B

C2

1

1

1

0 0

0

Page 36: Phylogenetic Trees (2) Lecture 13

36

Directed Perfect Phylogeny for a 0-1 Matrix

Proof of the observation (sketch): we need to show that:

[I and II hold] [each character is convex on T].

[I and II hold] for each character C there is one edge in which it is converted from 0 to 1 the species of each character C induces a connected subtree of T.

C1 C2 C3 C4 C5

A 1

B 0

C 1

D 0

E 1A

E

D

C

B

C2

the edge on which character C2 is converted to 1

Page 37: Phylogenetic Trees (2) Lecture 13

37

The directed, binary Perfect Phylogeny Problem

C1 C2 C3 C4 C5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 0 1

D 0 0 1 1 0

E 0 1 0 0 0A

ED

C

B

C4

C3 C2

C1

C5

A tree is a directed perfect phylogeny for a given 0-1 matrix M iff we can map each character to an edge s.t. edge labeled by Ci represent changing character Ci’s state from 0 to 1. Below we show such a tree for the given matrix:

Page 38: Phylogenetic Trees (2) Lecture 13

38

Efficient algorithm for the Binary Perfect Phylogeny Problem

Definition: Given a 0-1 matrix M, Ok={j:Mjk=1}, ie: Ok is the set of objects that have character Ck.

Theorem: M has a directed perfect phylogenetic tree iff the sets {Oi} are laminar, ie: for all i, j, either Oi and Oj are disjoint, or one includes the other.

C1 C2 C3 C4 C5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 0 1

D 0 0 1 1 0

E 0 1 0 0 0

C1 C2 C3 C4 C5

A 1 1 0 0 0

B 0 0 1 0 1

C 1 1 0 0 1

D 0 0 1 1 0

E 0 1 0 0 1

Laminar Not Laminar

Page 39: Phylogenetic Trees (2) Lecture 13

39

Proof

: Assume M has a directed perfect phylogeny, and let i, j be given.

Consider the edges labeled i and j.

Case 1: There is a root to leaf path containing both edges. Then one is included in the other (C2 and C1 below).

Case 2: not case 1. Then they are disjoint (C2 and C3).

A

ED

C

B

C4

C3 C2

C1

C5

Page 40: Phylogenetic Trees (2) Lecture 13

40

Proof (cont.)

: Assume for all i, j, either Oi and Oj are disjoint, or one includes the other. We prove by induction on the number of characters that M has a perfect phylogenetic tree for the matrix.

Basis: one character. Then there are at most two objects, one with and one without this character.

C1

A 1

B 0

C1

AB

Page 41: Phylogenetic Trees (2) Lecture 13

41

Proof (cont.): Induction step: Assume correctness for n-1 characters, and consider a matrix

with n characters (non-zero columns). WLOG assume that O1 is not contained in Oj

for j > 1.

Let S1 be the set of objects j for which Mj1= 1, and S2 be the remaining objects. Then

for each character C, either all the objects possessing C are contained in S1, or all of

them are contained in S2 (prove!).

By induction there are trees T1 and T2 for S1 and S2. Combining them as below gives

the desired tree.C1 C2 C3 C4 C5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 0 1

D 0 0 1 1 0

E 1 0 0 0 0

T1 T2

1S1={A,C,E}S2={B,D}