greedy method for inferring tandem duplication history louxin zhang, bin ma, lusheng wang and ying...

42
Greedy method for inferring tan dem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002) Reconstructing the dupl ication history of tandemly repeated gen e, Mol. Biol. Evol 2.Tang,M., Waterman M,(2001) Zinc finger gen e clusters and tandem gene duplication, RE COMB reporter: r92922054 李李 b885 06020 李李李 b909

Post on 15-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Greedy method for inferring tandem duplication historyLouxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003

reference:1.Elemento,O.,(2002) Reconstructing the duplication history of tandemly repeated gene, Mol. Biol. Evol

2.Tang,M., Waterman M,(2001) Zinc finger gene clusters and tandem gene duplication, RECOMB

reporter: r92922054 李明翰 b88506020 黃寶萱 b90902020 蔡明潔

Page 2: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Outline

Duplication model

Constructing duplication model from phylogeny Double duplication model Arbitrary duplication model

Discussion

Page 3: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Duplication

A duplication replaces a stretch of DNA containing several repeats with two identical and adjacent copies of itself.

If the stretch contain k repeats, the duplication is called a k-duplication.

Page 4: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 5: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 6: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DM ( duplication model )

A duplication model M for tandemly repeated sequence is a directed graph.

A duplication model contains nodes, edges and blocks.

Page 7: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Phylogeny & DM

Page 8: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Node & Edge

A node in DM represents a repeat.A directed edge (u,v) indicates that v is a c

hild of u. Also means that u is an ancestor of v.Root & Leaf & Internal node.

Page 9: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Block

A block in DM represents a duplication.Each internal node appears in a unique

block.No node is an ancestor of another in a

block.We draw a block representing a k-

duplication only when the k>2.

Page 10: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 11: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Block (Cont.)

lc(v) means the left child of v. rc(v) means the right child of v.If the block corresponds to a k-duplication,

then it contains k nodes v1 , v2 ,…… vk from l

eft to right.Then

lc(v1),lc(v2),…,lc(vk),rc(v1),rc(v2),…,rc(vk)

Page 12: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Cont.

Hence ,for any i and j, 1 ≤ i < j ≤ k, the edge( vi , rc(vi)) and edge( vj , lc(vj)) cross each other.

The left-to-right order of leaves in the model is identical to the order of the sequences on a chromosome.

Page 13: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Example

lc(v1),lc(v3),lc(v4),rc(v1),rc(v3),rc(v4).

An ordered phylogenetic tree for sequence {1,2,…,n} is a rooted phylogeny in which its leaves are listed from left to right in the increasing order.

Page 14: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

LEMMA 1:

l*c(u),r*c(u) denote the leftmost and the rightmost leaf in the subtree TM(u) rooted at u respectively.

For each internal node u in TM ,

r*c(u)> r*c(lc(u)) and l*c(u)<l*c(rc(u)).r*c(lc(u)) and l*c(rc(u)) are the biggest and

smallest labels in the subtree TM(u).

Page 15: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Constructing a duplication model from a phylogeny

Page 16: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Features:

A duplication model M has a unique associated phylogeny TM.

A phylogeny is not necessarily associated with a duplication model.

Page 17: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Problem:Reconstruct the Duplication model M in linear time

Input: a phylogeny T

Output: reconstruct the duplication model M

Page 18: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Problem (Cont.)Note: To represent a duplication model, we

only need to list all non-single duplication blocks on the associated phylogeny

[V1, V3, v 4] [V5 V6] [V7 V8]

Page 19: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Double duplication models

Given a phylogeny T on sequence family F = {1,2,…,n}. Associate a pair (Lv, Rv) of indices with each node v in T as follows:

1. The i th leaf node: (Lv,Rv) = (i, i)

2. The internal node: (Lv,Rv) = (l*c(v), r *c(v))

Page 20: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

r (1,10)

1(1,1)

6(6,6) 2(2,2)5(5,5)

8(8,8)10(10,10)3(3,3)9(9,9)7(7,7)

4(4,4)

V1(1,6)

V5(2,4) V7(7,9)

V3(2,9)

V6(3,5) V8(8,10)

V4(3,10)

V2(2,10)

Page 21: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Bottom up fashion for (Lv, Rv)

Lv = min {Llc(v), Lrc(v)}

Rv = max {Rlc(v), Rrc(v)}

Recursively bottom upSince T contains 2n-1 nodes linear time

Page 22: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Constructing DDM from phylogenyDouble duplication model: A duplication

model with all duplication in it are 1(or 2)-duplcation.

By Lemma1 the leftmost and rightmost leaves in T are 1 and n respectively.

Where does 2 locate?2 must just next to 1 on the DDM

Page 23: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Let v0 = r, v1, v2, · · · , vp−1, vp = 1

u1 = rc(vi ), u2, · · · , uq−1, uq = 2, where q ≥ p – i

LEMMA 2. M must contain p-i-1 double duplications

[vi+1, u j1 ], [vi+2, u j2 ], · · · , [vp−1, u jp−i−1 ],

i=2

P=5

q= 6

Page 24: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

LEMMA 2. (Cont.)

Since jp-i-1 ≤ q -1 q ≥ p – I

PROOF. If vi+k does not belong to a double duplication block in M, the leaf labeled with 2 cannot be placed before the leftmost leaf in the subtree rooted at rc(vi+k), contradicting the fact that 2 is right next to 1 in M. Hence, vi+k must appear in a double duplication block for each k, 1 ≤ k ≤ p − i − 1. This finishes the proof.

Page 25: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Note:Ru1 > Ru2 > · · · > Ruq−1 > Ruq and

Rvi+1 > Rvi+2 > · · · > Rvp−1

Rvi+k appears between Ru jk and Ru jk+1 for [Vi

+k, ujk]

We can determine all ujk’s in p – i +q ≤ 2q

Page 26: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

After all the duplication blocks [vi+k , u jk ] are placed on T , the leaf 2 should be right next to the leaf 1

Page 27: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Derive a rooted binary tree T’’ from the subtree of T(u1) by inserting a new node by

inserting a new node vk in the edge (u jk , u

jk+1) for each 1 ≤ k ≤ p − i − 1

assigning the subtree T(rc(vi+k)) rooted at rc(vi+k) as the right subtree of vk

Note : left child of vk is u jk+1 in T now.

Then, form the new phylogeny T’ from T by replacing subtree T(vi) with T’’

Page 28: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 29: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 30: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Linear time (Analysis)

Since we can charge the number of comparisons taken in different recursive steps to disjoint left paths in the input tree T , the whole algorithm takes at most 2×2n comparisons for determining all the duplication blocks. linear time algorithm.

Each internal node will be compared in q (next to leftmost path) once and then be in p (leftmost path) once. And each internal node will be compared with its (Rv,Lv). Therefore, 2x2n comparisons.

Page 31: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Arbitrary duplication models

Page 32: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Now, we generalize the above algorithm into arbitrary duplication models.

Again, we assume the leftmost paths leading to leaf 1 and leaf 2 in T are given in (1) and (2) respectively.

Page 33: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Observation:

Assume a phylogeny T is associated with a duplication model M. Then, there exist p − i − 1 double duplication blocks [vi+k , ujk ] (1≤k≤ p − i − 1) such that, after these duplications are placed in T , the leaf 2 is right next to the leaf 1. But, these double duplication blocks may not be in M.

Page 34: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Recall that there are two types of nodes on the leftmost path of T’. Some nodes are original ones in the input tree T ; some are inserted due to duplication blocks we have examined so far.

To extend the existing duplication blocks to larger ones, we associate a flag to each original node on the leftmost path of T’ , which indicates whether the node is in an existing duplication block or not.

Page 35: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Let x be an original node on the leftmost path P of T’ appearing in a duplication block [x1, x2, · · · , xt , x] of size t + 1 so far, then, there are t inserted nodes x’i right below x on the path P, which correspond to xi for i ≤ t.

To determine whether [x1, x2, · · · , xt , x] can be extended to a large duplication block in the model with which the original tree T is associated, we need to consider x and all the x’i s (1≤i≤ t) simultaneously.

For this purpose, we introduce the concept of hyper-double (duplication) blocks.

Page 36: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

We say that x and y form a hyper-double block [x, y] in T’ if the following three conditions hold:

(i) x is a node in some non-single duplication block that we have obtained so far;

(ii) x and y are not an ancestor of each other;

(iii) the block [x1, x2, · · · , xt , x] can be extended to a block [x1, x2, · · · , xt , x, y] of size t + 2 in the original tree T .

Page 37: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Hence, when we place a hyper-double block [x, y] in the current tree T’ , the edge (y, l(y)) crosses not only the edge (x, r(x)), but also the edges (x’i , r (x’i )), 1≤ i ≤ t.

So, we have that a phylogeny T is associated with a model if and only if:

(i) there exist p − i − 1 double duplication blocks [vi+k , ujk ] (1≤k≤p − i − 1) in T such that, after these duplication blocks are placed in T, leaf 2 is right next to leaf 1, and

(ii) T’ constructed above is associated to ‘a duplication model’ with introducing hyper-double duplication blocks.

Page 38: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

To make the algorithm run in linear time, we refine the algorithm in two aspects.

First, we assign a pair (R’x , R”x ) of indices to a node x on the leftmost path of T in each recursive step: if x is in a duplication block [x1, x2, · · · , xt , x] in the current stage, we set R’x = Rx1 and R”x = Rx , which are defined in Section 2.2.1. Since R’x < Rxi < R”x for 2≤i≤t, only R’x and R”x will be examined for determining if x is in a hyper-double block in next step.

Page 39: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Secondly, if the duplication block [x1, x2, · · · , xt , x] is extended into a larger hyper-double block [x1, x2, · · · , xt , x,y] in a step, the binary tree T’ for next step is constructed by inserting the right subtrees of xi ’s and x into the edge between y and its left child lc(y).

To do these insertions, we need to point the left child of x1 to l(y), and then point the left child of y to x.

In this way, we are able to insert all the subtrees in only two pointer operations.

Page 40: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DS: [v1,v2][v3,v5][v8,v6]

Page 41: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DS: [v1,v2][v3,v5,v4][v8,v6]

Page 42: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DS: [v1,v2][v3,v5,v4,v7][v8,v6]