segment alignment sea b89902010 鄭智懷 b89902037 黃敬強 b89902117 胡書瑜 segment alignment...

Segment alignment SEASegment alignment SEA

B89902010B89902010 鄭智懷鄭智懷B89902037B89902037 黃敬強黃敬強B89902117B89902117 胡書瑜胡書瑜

Segment Alignment to Compare Protein SEA

B89010 鄭智懷

B89037 黃敬強

B89117 胡書瑜

IntroductionIntroduction Outline of the paper

Increasing evolutionary distance causes homologous proteins to be hard to compare on a sequence level…

We focus on the folds of the protein, PLSSs, LSSs…etc.

A new look at the local structure prediction

Network matching problem

IntroductionIntroductionPLSS( Predicted Local Structure Segment )

LSS( Local Structure Segment )

maximal structural of units that are shared by proteins with different folds.

Predicted by the “Nearest-Neighbor Method”

PLSSthe LSS use the previous method

What is the “ Nearest-What is the “ Nearest-neighbor Method”?neighbor Method”?

The Basic IdeaThe Basic Idea

Start with an arbitrarily chosen vertex and try Start with an arbitrarily chosen vertex and try to add edges vertex-by-vertexto add edges vertex-by-vertex

Let x denote the latest vertex that was added to Let x denote the latest vertex that was added to the path. Pick the one that is closest to x, and the path. Pick the one that is closest to x, and add to the path the edge connecting x and this add to the path the edge connecting x and this vertex. Repeat until all vertices are includedvertex. Repeat until all vertices are included

Connect the starting vertex and the lastConnect the starting vertex and the last

107

1214

813

9

65

IntroductionIntroduction“Nearest-Neighbor Method”

d/dd/d00 >= + 1/2 >= + 1/2

Denote LDenote Li i as the i-th longest edge in D.as the i-th longest edge in D.

1.1. dd0 0 >= 2L>= 2L11

2.2. dd00 >= 2 >= 2 1<=k<=[n/2]1<=k<=[n/2]

dd0 0 >= 2>= 2

n

1ki

Li

nlg2/1

n

ni

Li12/

AlgorithmAlgorithmGiven two networks of PLSSs, find two optimal paths from the source to Given two networks of PLSSs, find two optimal paths from the source to the sink in each of the networks, whose corresponding PLSSs are most the sink in each of the networks, whose corresponding PLSSs are most similar to each other.similar to each other.

It does not follow the typical position-by-position alignment mode

AlgorithmAlgorithm

我們要表達的是一個 sequence 可以換成用 PLSS 來表示這一個蛋白質結構。

bit level sequencepossible PLSSs

another presentation( by arrows )No overlapped PLSSs

AlgorithmAlgorithmDefinition:

)(first

)(last

the position of the first vertex in this segment

the position of the last vertex in this segment

PLSSeachfor

23 456789…

就 EEEEE 來看


PLSSeachfor

A segment covers i, if

ilastifirst )(,)(

23456789...

就 EEEEE 來看

EEEEE covers 4,5,6,7,8

縮寫為 iα


PLSSeachfor

The set of PLSSs covering position i is denoted E(i).

23456789...

就 position=3 來說

E(3) = {“EEE”,”HHHHHHHHHHH”}

AlgorithmAlgorithm

The difference between original alignment and The difference between original alignment and segment alignmentsegment alignment

The same property between these two The same property between these two alignmentalignment

ans: there are more than one possibility at position (i,j) .

For any pair of positions, i and j, their covering segments are considered in a combinatorial way (total |E(i)|x|E(j)| combinations ). Here i and j are in the different sequence!

ans: Using dynamic programming technique.

AlgorithmAlgorithm

The idea:The idea:

Using dynamic programming conceptUsing dynamic programming concept

我們的目標是要找兩個 PLSS sequence 相似度最大 ( 一個來自 A ；一個來自 B), 也就是把某一個 PLSS sequence align 到另一個 PLSS sequence 花最少的 effort 。在這樣的構想下我們便使用 dynamic programming method to calculate the maximum score V(i,j).

We define V(i,j ) as the maximum similarity score for transforming S1[1…i] to S2[1…j] calculated by

),(max),( )(),(,),( jiVjiV jEiEnscombinatioall

AlgorithmAlgorithm

The The NEWNEW scoring scheme scoring scheme

Δ( iα , jβ ) = WaΔ( Aai , Aaj )+ WsΔ( α , β )

Wa 和Ws 代表權重 -> Wa + Ws = 1

A 是 Blosum62 similarity matrix 也就是一個 scoring table

Δ( α , β )是另一個 similarity matrix( scoring table ) from the HOMSTRAD database

就如同上課所教的，給定兩個 alphabets 會 return 這兩個 alphabet 的分數。這裡是給定 sequence similarity defined by Blosum62 similarity matrix.

這裡是給定兩個 PLSS ， return 這兩個 PLSS 的分數。這裡是給定 local structure similarity.

AlgorithmAlgorithm

HOMSTRAD databaseHOMSTRAD database

((HomHomologous ologous StrStructure ucture AAlignment lignment DDatabase)atabase)

This database provides all known protein structure clustered into homologous families

Using method: common ancestry

AlgorithmAlgorithm

V(V(i,ji,j)) is the maximum similarity score for tran is the maximum similarity score for transforming Ssforming S11[1,…,[1,…,ii] to S] to S22[1,…,[1,…,jj]]

V(V(iiαα,j,jββ)) is the maximum similarity score for tra is the maximum similarity score for tra

nsforming Snsforming S11[1,…,[1,…,iiαα] to S] to S22[1,…,[1,…,jjββ]]

),(max),( )(),(,),( jiVjiV jEiEnscombinatioall

AlgorithmAlgorithm

]})1,(,)1,({max[max),(

]}),1(,),1({max[max),(

)],()1,1([max),(

]0),,(),,(),,(max[),(

,

hjiIgjiVjiI

hjiDgjiVjiD

jijiVjiS

jiIjiDjiSjiV

g stands for the gap initiating penalty.h stands for the gap extension penalty.

The similarity score of aligned positions from (i-1, j-1) to (i, j) is ∆(iα,jβ)

在這個位置的總分減掉開一個 deletion gap 的分數

在這個位置已經存在 deletion( insertion) gap 再扣掉 extension 的分數

看哪一個分數會比較高 ( 在所有 segment 的可能下 )

End with deletion

End with insertion

AlgorithmAlgorithm

]})1,(,)1,({max[max),(

]}),1(,),1({max[max),(

)],()1,1([max),( ,

hjiIgjiVjiI

hjiDgjiVjiD

jijiVjiS

elseilastiE

ifirstif

......1)(&)1(

))((..............................................

elsejlastjE

jfirstif

......1)(&)1(

))((..............................................

where

代表前面一個位置一樣選 α會有較高的分數 (α 之前出現過 )

如果 α第一次出現的位置是 i則我們就要去考慮前面哪一個 PLSS 和 α在一起的分數會最大

為什麼只須要考慮 last(r) = i-1 的呢 ?

AlgorithmAlgorithm

當當 rr 屬於屬於 E(E(ii-1) & last(-1) & last(rr) != ) != ii-1-1

代表這一個 PLSS 會 cover 到 i ，因為這一個 PLSS cover i-1 但是 last(r) != i-1 所以 last(r) 會比 i-1 來的大，也就是說 r 會 cover 到 i 。這一種情形事實上已經被 case 1 所包括，因為 case 1 是對所有 cover i 的 PLSS ，所以這個時候的 r 便是屬於 case 1 。

ComplexityComplexity

我們我們先假設先假設 dynamic programmingdynamic programming 每一個每一個 enentrytry 花的花的 time complexity = time complexity = O(1)O(1) 。因為各做。因為各做一次的一次的 substitution, insertion, deletionsubstitution, insertion, deletion 的時間的時間複雜度是複雜度是 O(1)O(1) 。。

先假設先假設的原因是有一些的原因是有一些 entryentry 做的事不只是做的事不只是O(1)O(1) ，如果，如果 γγ ,,δδ 是一個是一個集合集合的話！的話！


In Sequence1 The first vertex 被 a1 個 PLSS 所 cover (E(1)= a1)The second vertex 被 a2 個 PLSS 所 cover (E(2)= a2)The third vertex 被 a3 個 PLSS 所 cover (E(3)= a1)

The m-1th vertex 被 am-1 個 PLSS 所 cover (E(m-1)= am-1)The mth vertex 被 am 個 PLSS 所 cover (E(m)= am)

In Sequence2 The first vertex 被 b1 個 PLSS 所 cover (E(1)=b1)The second vertex 被 b2 個 PLSS 所 cover (E(2)=b2)The third vertex 被 b3 個 PLSS 所 cover (E(3)=b3)

The n-1th vertex 被 bn-1 個 PLSS 所 cover (E(n-1)=bn-1)The nth vertex 被 bn 個 PLSS 所 cover (E(n)= bn)

我們先不失一般性的假設 :


In the i-row and j-column entity of the matrix do (ai)x (bj) operations

So, the total operation in the matrix

)(

)...

)(...

(

)...

()...

(

)...)(...(

)...(...)...()...(

21

21

2121

2121

2121

212221212111

CMNCO

CMNCN

bbb

M

aaaMN

N

bbbN

M

aaaM

bbbaaa

bababababababababa

nm

nm

nm

nmmmnn

M

aaaC m

...21

1

N

bbbC n

...21

2

First row Second row Last row

在我們之前的假設下的 time complexity


先假設先假設的原因是有一些的原因是有一些 entryentry 做的事不只是做的事不只是O(1)O(1) ，如果，如果 γγ ,,δδ 是一個是一個集合集合的話！的話！

我們要證明我們要證明 γγ ,,δδ 是一個是一個集合集合時這種情形個時這種情形個數數並不會太多並不會太多。。

當當 γγ ,,δδ 是一個是一個集合集合時，代表此時，代表此 PLSSPLSS 是第一是第一次出現在這一個位置，也就是說這種情形次出現在這一個位置，也就是說這種情形的個數的個數 =PLSS=PLSS 的個數。的個數。

ComplexityComplexity假設在 sequence 1 的 PLSS 個數是 P1 ，在 sequence 1 的 PLSS 個數是 P2 。

則在做 deletion 時且 γγ 是一個集合的總個數是是一個集合的總個數是 NNC1P1

則在做 insertion 時且 δδ 是一個集合的總個數是是一個集合的總個數是 MMC2P2

而 substitution 的總個數 = (#deletion) + (#insertion)

所以這一些的動作所須的 time complexity = (#deletion) + (#insertion)

NNC1P1 +MMC2P2 <= NNC1M +MMC2N <= MNC1C2 = O(MNC1C2)

同理

Example: (1e68A,1nkl)Example: (1e68A,1nkl)

<Residue-Number>specifies the number of a residue

<Residue-Number>:<Residue-Number>selects all atoms that have residue numbers greater than or equal to the first residue number but less than or equal to the second residue number.

1e68A: Bacteriocin As-48

1nkl : Nk-lysin

Residue Number

Example: (1e68A,1nkl)Example: (1e68A,1nkl)

Each protein is represented as a collection of potentially overlapping and contradictory PLSSs (a network).

SEA finds an optimal alignment between these two proteins

Simultaneously, SEA identifies the optimal subset of PLSSs (a path in the network) describing each protein.

1e68A: Bacteriocin As-48

1nkl : Nk-lysin

Residue Number

Several Variants of the SEA Several Variants of the SEA AlgorithmAlgorithm

SEA_trueSEA_true: using segments derived from the actual 3D structure: using segments derived from the actual 3D structure

SEA_cnSEA_cn: n is the maximum segment coverage(the numbers of segments that cover : n is the maximum segment coverage(the numbers of segments that cover a position in each protein), ex: SEA_c30, SEA_c10, …etc.a position in each protein), ex: SEA_c30, SEA_c10, …etc.

SEA_1DSEA_1D: using 1D prediction (single predicted local structure): using 1D prediction (single predicted local structure)

subset measures CE SEA_true SEA_c30 SEA_c10 SEA_c5 SEA_1d BLAST ALIGN FFAS

average-shift

0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49

shift>0.9 73 69 63 56 47 51 60 43

shift>0.7 207 199 192 183 152 146 165 161

shift>0.5 282 260 259 251 215 197 228 227

RMSD3.0

257 95 82 82 76 63 77 54 40

RMSD5.0 397 237 184 171 177 147 157 138 118

RMSD8.0 408 294 248 249 249 231 196 206 194

Family

(409 pairs)

all 409 345 404 398 368 366 232 372 409

average-shift

0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07

shift>0.9 3 3 3 2 0 1 2 1

shift>0.7 17 8 9 7 4 10 9 7

shift>0.5 54 26 23 21 17 18 18 17

RMSD3.0

55 12 6 6 7 6 8 3 1

RMSD5.0 160 44 16 18 18 11 18 11 1

RMSD8.0 163 69 37 34 41 28 23 22 15

Superfamily

(225 pairs)

all 166 128 217 204 181 177 41 149 225

General performance of SEA incorporating different local structure diversities

CE(1998): Combinatorial Extension, combine a path defined by AFPs

BLAST(1990), ALIGN(1988), FFAS(2000) are not computed from PLSS


average-shift

0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49

shift>0.9 73 69 63 56 47 51 60 43

shift>0.7 207 199 192 183 152 146 165 161

shift>0.5 282 260 259 251 215 197 228 227

RMSD3.0

257 95 82 82 76 63 77 54 40

RMSD5.0 397 237 184 171 177 147 157 138 118

RMSD8.0 408 294 248 249 249 231 196 206 194

Family

(409 pairs)

all 409 345 404 398 368 366 232 372 409

average-shift

0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07

shift>0.9 3 3 3 2 0 1 2 1

shift>0.7 17 8 9 7 4 10 9 7

shift>0.5 54 26 23 21 17 18 18 17

RMSD3.0

55 12 6 6 7 6 8 3 1

RMSD5.0 160 44 16 18 18 11 18 11 1

RMSD8.0 163 69 37 34 41 28 23 22 15

Superfamily

(225 pairs)

all 166 128 217 204 181 177 41 149 225


Shift score: measure misalignment between a predicted alignment of two proteins and their reference alignment.


average-shift

0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49

shift>0.9 73 69 63 56 47 51 60 43

shift>0.7 207 199 192 183 152 146 165 161

shift>0.5 282 260 259 251 215 197 228 227

RMSD3.0

257 95 82 82 76 63 77 54 40

RMSD5.0 397 237 184 171 177 147 157 138 118

RMSD8.0 408 294 248 249 249 231 196 206 194

Family

(409 pairs)

all 409 345 404 398 368 366 232 372 409

average-shift

0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07

shift>0.9 3 3 3 2 0 1 2 1

shift>0.7 17 8 9 7 4 10 9 7

shift>0.5 54 26 23 21 17 18 18 17

RMSD3.0

55 12 6 6 7 6 8 3 1

RMSD5.0 160 44 16 18 18 11 18 11 1

RMSD8.0 163 69 37 34 41 28 23 22 15

Superfamily

(225 pairs)

all 166 128 217 204 181 177 41 149 225

General performance of SEA incorporating different local structure diversities Shift Score

where is a small number used as a parameter to the scoring algorithm

Shift Score range from -0.2(epsilon) and 1.0

gap


average-shift

0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49

shift>0.9 73 69 63 56 47 51 60 43

shift>0.7 207 199 192 183 152 146 165 161

shift>0.5 282 260 259 251 215 197 228 227

RMSD3.0

257 95 82 82 76 63 77 54 40

RMSD5.0 397 237 184 171 177 147 157 138 118

RMSD8.0 408 294 248 249 249 231 196 206 194

Family

(409 pairs)

all 409 345 404 398 368 366 232 372 409

average-shift

0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07

shift>0.9 3 3 3 2 0 1 2 1

shift>0.7 17 8 9 7 4 10 9 7

shift>0.5 54 26 23 21 17 18 18 17

RMSD3.0

55 12 6 6 7 6 8 3 1

RMSD5.0 160 44 16 18 18 11 18 11 1

RMSD8.0 163 69 37 34 41 28 23 22 15

Superfamily

(225 pairs)

all 166 128 217 204 181 177 41 149 225


RMSD: root mean square deviation of C*alpha positions after optimal superposition (for structural similarity)


average-shift

0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49

shift>0.9 73 69 63 56 47 51 60 43

shift>0.7 207 199 192 183 152 146 165 161

shift>0.5 282 260 259 251 215 197 228 227

RMSD3.0

257 95 82 82 76 63 77 54 40

RMSD5.0 397 237 184 171 177 147 157 138 118

RMSD8.0 408 294 248 249 249 231 196 206 194

Family

(409 pairs)

all 409 345 404 398 368 366 232 372 409

average-shift

0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07

shift>0.9 3 3 3 2 0 1 2 1

shift>0.7 17 8 9 7 4 10 9 7

shift>0.5 54 26 23 21 17 18 18 17

RMSD3.0

55 12 6 6 7 6 8 3 1

RMSD5.0 160 44 16 18 18 11 18 11 1

RMSD8.0 163 69 37 34 41 28 23 22 15

Superfamily

(225 pairs)

all 166 128 217 204 181 177 41 149 225


SEA_c30 and SEA_c10 produced most accurate alignments

segment alignment sea b89902010 鄭智懷 b89902037 黃敬強 b89902117 胡書瑜 segment alignment...

Documents