a simple algorithm to infer gene duplication and speciation events on a gene tree...

102
A simple algorithm to infer gene duplication and speciation events on a gene tree 生生生生生生生生生 生生生生 生生生生生 生生生 生生生 生生生 生生生 生生1 生 21 生 生生 生生生生生生 C. M. Zmasek and S. R. Eddy, 2001 Bioinformatics, 17(9): 821--828,

Upload: blaze-clark

Post on 14-Jan-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A simple algorithm to infer gene duplication and speciation

events on a gene tree

生物資訊相關演算法 期末報告

學生:陳智豪 王秀綾 王緯誠 江志民 侯藹玲時間: 1 月 21 日 地點:中研院資訊所

C. M. Zmasek and S. R. Eddy, ( 2001 )Bioinformatics, 17(9): 821--828,

Page 2: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

INTRODUCTION

The enormous amount of sequence data currently produced by the various genome project.

Many proteins belong to large superfamilies that consist of subfamilies with different biological function complicates.

Page 3: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Superfamily and subfamilySuperfamily 以物種的來源區分,也就是說,雖然是不同蛋白質間的胺基酸序列相似性程度不高,但是它們的結構與功能相近,顯示它們可能有共同的演化來源,便視為同一個Superfamily 。Subfamily 是用演化的相關性來歸類,通常蛋白質之間的胺基酸序列有大於 30% 的相同,便可視為有明顯的演化關連而屬於同一個subfamily 。值得注意的是,胺基酸序列的同質性高並不等於結構和/或功能的相似 。

Page 4: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Method for automated sequence function prediction

Pairwise sequence similarity such as BLAST (basic local alignment search tool).

Analyses using profile search algorithms such as HMMER.

Protein family databases such as Pfam and InterPro.

Page 5: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

What is “phylogenomics”

To use this multiple alignment to infer amore specific function,as input for a phylogenetic tree analysis, and from the placement of the new sequence in the tree of known sequences.

Page 6: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Why using phylogenomics ? In many cases , the identification of homologs is not sufficient to make specific functional predictions , because not all homologs have the same function.

Page 7: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Paralogous vs orthologous

Two genes are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event.

Page 8: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Paralogous vs orthologous

Page 9: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Paralogous vs orthologous

Page 10: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Gene trees vs species tree

Page 11: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

COG database

Although the COG method is clear a major advance in identifying orthologous groups of genes , it is limited in its power because clustering is a way of classifying levels of similarity and is not an accurate method of inferring evolutionary relationships.

Page 12: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Phylogenomics and COG database

Phylogenomics : 是利用多基因序列比對,在找出基因

在演化樹中所在的位置。COG database :

基因序列分類方式是依據演化關係,而間接推論基因序列的相似度。

Page 13: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Algorithm

A simple algorithm to infer gene duplication and speciation on agene tree

Report:Wang Wei-Cheng

Page 14: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Gene duplication can be trivially inferred when a species contains two or more homologs belonging to the same gene family(fig1.G1)

Duplication

Duplication

G1

Hu

man

a H

um

an

rHu

man

bN

ematod

e a N

ematod

e rN

ematod

e bY

east a Y

east rY

east b

a subfamily b subfamily r subfamily

Trivial case

Trivial case

Algorithm

Page 15: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Due to gene loss or incomplete sampling of genes in partially sequenced genomes , not all duplications are detectable by simple redundancy in a gene tree(fig1.G2)

Duplication

G2

Hu

man

a N

ematod

e rN

ematod

e bY

east a

a subfamily b subfamily r subfamily

Trivial caseDuplication

nontrivial case

Algorithm

Page 16: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Notation define: G:gene tree S:species tree For any gene g in G,γ(g) sub-tree of G from

g For any specie s in S,σ(s) sub-tree of S

from s

G S

γ(g) σ(s)

Algorithm

Page 17: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Definition1: M:Mapping function from G to S…

Algorithm

Page 18: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Definition1: (G,S) is a rooted binary tree (gene,species)

1.   g ∊ G , let γ(g) a set of species occur from g.2.   s ∊ S , let σ(s) a set of species occur from s.3.   g ∊ G, x=M(g) ∊S, x satisfying

a.smallest (lowest)b.γ(g) ∊σ(x)……….= σ(M(g))  G S

σ(x)γ(g)

x

σ(x)γ(g)

g

γ(g) ∊σ(x)

Algorithm

Page 19: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

•Example of γ(g) & σ(s) :

G1

Hu

man

a H

um

an

rHu

man

bN

ematod

e a N

ematod

e rN

ematod

e b Y

east r

Hu

ma

n Nem

atod

e Yea

st

Sg1

g3

g2 s2

s1γ(g2)={Nematode , Human} σ(s2)={Human, Nematode}

γ(g1)={Nematode , Human, Yeast }

σ(s2)={Human, Nematode, Yeast}

Algorithm

Page 20: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

•Definition2:{g1,g2} is g’s child, if g is duplication if and only if M(g)=M(g1) or M(g)=M(g2)

Duplication

Duplication

G1

Hu

man

a H

um

an

rHu

man

bN

ematod

e a N

ematod

e rN

ematod

e bY

east a Y

east rY

east b

a subfamily b subfamily r subfamily

g= g1= g2= M(g)= M(g1)= M(g2)= duplication

Hu

ma

n Nem

atod

e Yea

st

S

Algorithm

Page 21: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

•If we know all M(),this task take linear time O(n),traversal all tree with n genes.•Page first implement M(),he use brute force find all γ(g) and σ(s) ,then compare them , it takes O(n3)•To speed up, observe M(g)=LCA( M(g1) , M(g2) )

Algorithm

Page 22: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.•Initialization:

1.Number nodes of S in preorder traversal 2.For each external node g of G,set M(g) to the number of external node in S with the matching species name.•Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Algorithm

Page 23: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.

A C B D A B C D

S1G1

Algorithm

Page 24: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A C B D A B C D

g3

g2

g1 S1G1

s3

s2

s1

•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.

<duplication>

< speciation >

< speciation >

Algorithm

Page 25: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Initialization: 1.Number nodes of S in preorder traversal

2.For each external node g of G,set M(g) to the number of external node in

S with the matching species name.

A C B D A B C D

g3

g2

g1 S1G1

s3

s2

s1

Algorithm

Page 26: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Initialization: 1.Number nodes of S in preorder traversal

2.For each external node g of G,set M(g) to the number of external node in

S with the matching species name.

A C B D

g3

g2

g1G1

1

2

3

4 5 6 7

Algorithm

S1

s3

s2

s1 1

2

3

4 5 6 7

A B C D

Page 27: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Initialization: 1.Number nodes of S in preorder traversal

2.For each external node g of G,set M(g) to the number of external node in

S with the matching species name.

A C B D A B C D

g3

g2

g1 S1G1

s3

s2

s11

2

3

4 5 6 7

1

2

3

4 5 6 7

Algorithm

Page 28: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

A C B D

Algorithm

Status:

Postorder

Starting node is ……. 3

g

4 2 7 1635

internal internal internal

Find LCA

Page 29: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Starting node is ……. g = g1= g2=

33345

g

g2g1

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 30: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

45

g

g2g1

64

M(g1)

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

M(g2)

Find LCA

Page 31: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

(4!=6) AND (4<6) then….

45

g

g2g1

64

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 32: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

(4!=6) AND (4<6) then…. b=parent(b)=parent( )= =2

45

g

64

6 2

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 33: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2

(4!=2) AND (4>2) then….

4

g

64

2

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 34: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2

(4!=2) AND (4>2) then…. a=parent(a)=parent( )= =3

4

g

64

2

a

b

4 3

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 35: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(3!=2) AND (3>2) then….

g

6 2

a

b4 3

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 36: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(3!=2) AND (3>2) then…. a=parent(a)=parent( )= =2

g

6 2a

b4 3

3 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 37: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(2= =2) then….

g

6 2a

b3 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

Page 38: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(2= =2) then….Set M(g)=M( )=a =

g

6 2a

b3 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

3 2

M(g)

g2

g1

Find LCA

Page 39: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

M(g)= M( )= M(g1)=M( )= M(g2)=M( )=

M(g)!=M(g1) AND M(g)!=M(g2)

g

46

2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciationM(g)

g2

g1

M(g1)M(g2)

543

Find LCA

Page 40: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

g is a speciation

Set speciation tag.3

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

Find LCA

Page 41: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Next node is 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

Find LCA

Page 42: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Next node is g= g1= g2=

2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

63

2

g2

g1

Find LCA

Page 43: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

g2

g1

a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5

36 5

2

ab

Find LCA

Page 44: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

g2

g1

a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5

(2!=5) AND (2<5) then…. b=parent(b)=parent( )= =3

36 5

2

ab

5 3

b

Find LCA

Page 45: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =3

(2!=3) AND (2<3) then….

352

ab 3

Find LCA

Page 46: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =3

(2!=3) AND (2<3) then…. b=parent(b)=parent( )= =2

352

ab 3

3 2

b

Find LCA

Page 47: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =2

(2= =2) then….

3 2

a 3 2

b

Find LCA

Page 48: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =2

(2= =2) then…. set M(g)=M( )=a= =2

3 2

a 3 2

b

3 2

Find LCA

Page 49: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

M(g)= M( )= M(g1)=M( )= M(g2)=M( )=

M(g)= =M(g1) AND M(g)!=M(g2)

2 225

36

g2

g1

Find LCA

Page 50: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

g is a duplication

Set duplication tag.2

< duplication >Find LCA

Page 51: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

Next node is 1

< duplication >Find LCA

Page 52: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

Next node is

by the same algorithm we got…….

1

< duplication >

< speciation >

Find LCA

Page 53: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Result…

g3

g2

g1

s3

s2

s11

2

3

4 5 6 7

1

2

3

4 5 6 7

G1 S1

< speciation >

< duplication >

< speciation >

Algorithm

Page 54: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Algorithmcomplexity analysis and implementation

A simple algorithm to infer gene duplication and speciation on agene tree

Report:Wang Shiou-Ling

Page 55: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Complexity Analysis

Input size: n, the number of genes species tree

Space complexity : O(n)Assumption: all of tree are binary tree

O(n), nodes of species treeStoring 2 trees(gene tree and species tree)

Internal nodes +leaves=(n-1)+n=2n-1Space complexity:O(4n) O(n)

Storing auxiliary variables (a,b):constant

Page 56: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Complexity AnalysisTime complexity : Given M only in O(n), but what about calculating MBrute force for overall time complexity:O(n3)Traverse node g=1:n

for node g=1:nfor node s=1:n

if γ(g) is in σ(s) then

assign M(g)=s

O(n2)

Page 57: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Complexity AnalysisLCA AlgorithmTime complexity would be reduced to O(n2)Initialization: O(n)

Initializing M for leaves O(n), using Hash Table to look up species name.Initializing S: O(n).

Recursion: W(n2)Best case: O(n)Worst case: balanced S, unbalanced S

Page 58: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case Study

Case A:O(1) for M(g3) assignment

Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Page 59: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case StudyCase A:O(1) for M(g3) assignment

2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)Speciation

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Page 60: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case StudyCase A:O(1) for M(g2) assignment

Start:M(g3)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6 b>a,b=2

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Page 61: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case StudyCase A:O(1) for M(g2) assignment

2nd a=3,b=2 a>b,a=2a=2,b=2Break!M(g2)=2

M(g2) M(A) M(B)SpeciationSo as g1

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Page 62: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case StudyCase A: Another G for finding M in O(1)

1

2

3

4 5 6 74567

g=g3Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3

Page 63: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case Study

Case A: Another GDefinition : topology of G and S

1

2

3

4 5 6 74567

2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)SpeciationSo as g2,g1

Page 64: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case StudyCase B:O(log n) for M(g3) assignment

g1

g2

g3

1

2

3 45

6 7

Start:M(A)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6b>a,b=52nd a=3,b=5b>a,b=1

3 46 7

Balanced S

Page 65: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case Study

Case B:O(log n) for M(g3) assignment

g1

g2

g3

1

3 45

6 7

3rd a=3,b=1a>b,a=2,14th a=2,b=1a>b,a=1a=1,b=1Break!M(g3)=1

M(g3) M(A) M(B)speciation

3 46 7

2

Page 66: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case StudyCase B:O(log n) for M(g2) assignment

Start:M(g3)=1 ,M(B)=4a=1,b=4While:1st a=1,b=4b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)duplication

g1

g2

g3

1

3 45

6773 46

2

So as g1

Page 67: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case Study

Case C:O(n) for M(g) assignment

g1

g2

g3

1

2

3

4 5 6 7

•Unbalanced S Observation:For every gene in G should climb up to the root.So time complexity=O(n)

4567

Start:M(D)=7 ,M(C)=6a=7,b=6While:1st a=7,b=6a>b,a=1

Page 68: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case Study

Case C:O(n) for M(g3) assignment

g1

g2

g3

1

3

4 5 6 7

2nd a=1,b=6b>a,b=23rd a=1,b=2b>a,b=24th a=1,b=2b>a,b=1a=1,b=1 Break!M(g3)=1M(g3) M(D) M(C)speciation

4567

2

Page 69: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case Study

Case C:O(n) for M(g2) assignment

g1

g2

g3

1

3

4 5 6 7

Start:M(g3)=1 ,M(B)=6a=1,b=6While:1st a=1,b=6b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)Duplication

4567

2

Page 70: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Case Study

Case C:O(n) for M(g1) assignment

g1

g2

g3

1

4 5 6 7

Start:M(g2)=1 ,M(A)=4a=1,b=4While:1st a=1,b=4b>a,b=32nd a=1,b=3b>a,b=2a=1,b=1Break!M(g1)=1M(g1)=M(g2)Duplication

4567

23

Page 71: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

g1

g2

g3

1

2

3

4 5 6 77 6 5 4

3

2

1

Tracing parent

Improvement

Little Trick:Would not have crossly Mapping

If one of the children maps to root…

mapping while initialization

Table

LCA

2 3 4 5 6 7

1 1 1 1 1 1 1

2 2 2 2 2 1

3 3 3 2 1

4 3 2 1

5 2 1

6 1

Page 72: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

ImprovementPreprocessing

Find LCA in O(1):

Schieber and Vishkin /ja’ja’By direct arithmetic.

Preprocessing in O(n).

Calculating M in in O(nα(n,n)):α(n,n): inverse of Ackermann function

Eulenstein algorithm:Using data structure similar to disjoint-set forest.

Page 73: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Implementation

A Tree Viewer (ATV)

duplications

speciation

numbers bootstrap

values

numbers EC numbers

Page 74: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Implementation

Material:

gene tree: fibrinogen beta and gamma chain

Pfam AC:PF00147

species tree:the Tree of Life project

Run

Page 75: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

ImplementationBoth algorithms were implemented in Java.

SDI (Speciation vs Duplication Inference)Eulenstein’s algorithm

PreprocessingDeleting external nodes in S that have no genes in G

Timings reportedAverage of three runs on a single processor 500MHz P-III system running Red Hat Linux 6.0 and Sun Microsystems’ Java 1.2 SDK for Linux

Page 76: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Results – Synthetic DataSynthetic Data Sets– exercise the worst-case behavior.

Synthesized gene trees with n genes

M(g) for every internal node would map to the root of the corresponding species tree with n species.

The situation in Fig. 3B and 3C.

Page 77: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Balanced SO(n logn)

Unbalanced SO(n2)

worst-case behavior

Page 78: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Syn. Data with Balanced S Using SDI algorithm

Syn. Data with UnBalanced S

Using SDI algorithm

Syn. Data with Balanced/UnBalanced S

Using Eulenstein’s algorithm

Page 79: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Results – Synthetic DataFor a balanced species tree, Fig. 3B, both algorithms have running times that scale nearly linearly in tree size. O(n logn)For maximally unbalanced species tree, Fig. 3C, we confirm our algorithm, SDI, worst case O(n2) behavior.Over about n=550 genes and species, our implementation of Eulenstein’s algorithm outperforms SDI.If only the calculation of M(g) is compared (excluding all preprocessing and initialization steps), Eulenstein’s algorithm outperforms SDI for n larger than about 200 taxa.

Page 80: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Results – Real DataReal Data

2478 multiple sequence alignments from the ‘full’ alignments (as opposed to the smaller ‘seed’ alignments) in the protein family database Pfam (release 5.5; Bateman et al., 2000)Alignments were removed

not originating from the curated SWISS-PROT database (Bairoch and Apweiler, 2000) not from species in our species tree (see below)

Alignments were discardedWith fewer than four or more than 1000 sequences

Leaving 1750 alignments

Page 81: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Results – Real DataColumns containing one or more gap symbols were removed from the alignment if the resulting alignment after this filtering was at least 100 amino acids in length.Construct the Gene Tree

Pairwise distances were calculated based on the Dayhoff PAM matrix.Using the program PROTDIST from Felsenstein’s PHYLIP (1993) A neighbor-joining tree was constructed using the program NEIGHBOR from PHYLIP.

Midpoint rooting method (Swofford et al., 1996)

Page 82: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Results – Real DataConstruct the Species Tree

A single master species tree was compiled manually, containing 200 of the most commonly encountered species in Pfam.The topology of this species tree is based on the taxonomy database at NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/), the Tree of Life project (Madison and Madison, at http://phylogeny.arizona.edu/tree/phylogeny.html )This tree is available at http://www.genetics.wustl.edu/eddy/forester/

Page 83: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Real Data using Eulenstein’s algorithm

Real Data using SDI algorithm

Page 84: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Results – Real DataThe average case behavior of SDI algorithm on real data sets is approximately O(n).

Worst case is not realized.

Page 85: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Analysis exampleThe fibrinogen beta and gamma chain Pfam family is presented in figure 5.The fibrinogen sequence family contains fibrinogen alpha, beta and gamma chains (sequences with FIBA, FIBB, FIBG prefixes).

Page 86: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲
Page 87: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Analysis exampleEach chain type appears on the tree as a paralogous subtree.A special case is FIBH_HUMAN (fibrinogen gamma-B chain)

It appears to be the result of alternative splicing of the human gamma chain gene.

Sequences with TENA prefixes (such as Tenascins)The fibrinogen family also contains various proteins probably involved in adhesion, which share the fibrinogen-like domain with the fibrinogen sequences.

Page 88: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Analysis exampleInteresting case—FIBX_MOUSE

A mouse enzyme with prothrombinase activityIs similar to fibrinogen beta and gamma chains (Parr er al. 1995)

The node connecting FIBX_MOUSE to the rest of the tree is inferred to be a duplication event.Since the placement of FIBX_MOUSE contradicts the species tree and hence FIBX_MOUSE is inferred to be paralogous to the fibrinogen beta chain subfamily (FIBB).In contrast, BLAST analysis of the FIBX_MOUSE sequence could easily have misannotated it as the mouse fibrinogen beta chain.

Page 89: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Motivation: Orthologous sequences are more

reliable predictors of a new protein’s function than paralogous sequences

Goal: Automate phylogenomics using

explicit phylogenetic inference.

A simple algorithm to infer gene duplication and

speciation events on a gene tree

Page 90: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲
Page 91: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Performance

Difficulties for practical useRootedBiological correct

Reliability

Discussion

Page 92: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Performance

The comparison of asymptotic worst-case running time may be misleadingOur algorithm is O(n2)Empirically outperforms Eulenstein’s (1998) - more complex, asymptotic bound close to O(n)

Page 93: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Worst case : pathological M(g) for every internal node points to the rootNo two genes from the same species, no. in S is O(n), S is maximally unbalancedIn real data, O(n)

Performance

Page 94: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

The improved asymptotic bound will not be worth the cost of the extra complexity nor the extra computational overhead

Page 95: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Practical use

Use SDI as part of a system for

automating phylogenomics

( forester ) .

Assumption : the gene tree and

species tree are both properly

rooted and biological correct.

Page 96: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Rooted

Page 97: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Rooted properly

Molecular clock

•No. of substitution time back to common ancestor , constant rate

•Dubious in sequences family

•In paralogous sequences family, depend on duplication inference

•Minimize the dissimilarity between gene tree and species tree

Outgroup

Page 98: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Molecular Clock

Page 99: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Reliability

Problematic : duplication to predict function

Multifurcations : lack of resolution

Limitation of algorithm

Concept of orthology and paralogy

Page 100: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

Reliability sampling

BootstrapMCMC ( Markov Chain Monte Carlo )Integrate orthology assignments over tree spaceProbability, confidence valueRank the inferred orthologyAlso help to root

Page 101: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲

….

Bootstrap -- resampling

Page 102: A simple algorithm to infer gene duplication and speciation events on a gene tree 生物資訊相關演算法 期末報告 學生: 陳智豪 王秀綾 王緯誠 江志民 侯藹玲