segment alignment sea b89902010 鄭智懷 b89902037 黃敬強 b89902117 胡書瑜 segment alignment...
TRANSCRIPT
Segment alignment SEASegment alignment SEA
B89902010B89902010 鄭智懷鄭智懷B89902037B89902037 黃敬強黃敬強B89902117B89902117 胡書瑜胡書瑜
Segment Alignment to Compare Protein SEA
B89010 鄭智懷
B89037 黃敬強
B89117 胡書瑜
IntroductionIntroduction Outline of the paper
Increasing evolutionary distance causes homologous proteins to be hard to compare on a sequence level…
We focus on the folds of the protein, PLSSs, LSSs…etc.
A new look at the local structure prediction
Network matching problem
IntroductionIntroductionPLSS( Predicted Local Structure Segment )
LSS( Local Structure Segment )
maximal structural of units that are shared by proteins with different folds.
Predicted by the “Nearest-Neighbor Method”
PLSSthe LSS use the previous method
What is the “ Nearest-What is the “ Nearest-neighbor Method”?neighbor Method”?
The Basic IdeaThe Basic Idea
Start with an arbitrarily chosen vertex and try Start with an arbitrarily chosen vertex and try to add edges vertex-by-vertexto add edges vertex-by-vertex
Let x denote the latest vertex that was added to Let x denote the latest vertex that was added to the path. Pick the one that is closest to x, and the path. Pick the one that is closest to x, and add to the path the edge connecting x and this add to the path the edge connecting x and this vertex. Repeat until all vertices are includedvertex. Repeat until all vertices are included
Connect the starting vertex and the lastConnect the starting vertex and the last
107
1214
813
9
65
IntroductionIntroduction“Nearest-Neighbor Method”
d/dd/d00 >= + 1/2 >= + 1/2
Denote LDenote Li i as the i-th longest edge in D.as the i-th longest edge in D.
1.1. dd0 0 >= 2L>= 2L11
2.2. dd00 >= 2 >= 2 1<=k<=[n/2]1<=k<=[n/2]
dd0 0 >= 2>= 2
n
1ki
Li
nlg2/1
n
ni
Li12/
AlgorithmAlgorithmGiven two networks of PLSSs, find two optimal paths from the source to Given two networks of PLSSs, find two optimal paths from the source to the sink in each of the networks, whose corresponding PLSSs are most the sink in each of the networks, whose corresponding PLSSs are most similar to each other.similar to each other.
It does not follow the typical position-by-position alignment mode
AlgorithmAlgorithm
我們要表達的是一個 sequence 可以換成用 PLSS 來表示這一個蛋白質結構。
bit level sequencepossible PLSSs
another presentation( by arrows )No overlapped PLSSs
AlgorithmAlgorithmDefinition:
)(first
)(last
the position of the first vertex in this segment
the position of the last vertex in this segment
PLSSeachfor
23 456789…
就 EEEEE 來看
AlgorithmAlgorithmDefinition:
PLSSeachfor
A segment covers i, if
ilastifirst )(,)(
23456789...
就 EEEEE 來看
EEEEE covers 4,5,6,7,8
縮寫為 iα
AlgorithmAlgorithmDefinition:
PLSSeachfor
The set of PLSSs covering position i is denoted E(i).
23456789...
就 position=3 來說
E(3) = {“EEE”,”HHHHHHHHHHH”}
AlgorithmAlgorithm
The difference between original alignment and The difference between original alignment and segment alignmentsegment alignment
The same property between these two The same property between these two alignmentalignment
ans: there are more than one possibility at position (i,j) .
For any pair of positions, i and j, their covering segments are considered in a combinatorial way (total |E(i)|x|E(j)| combinations ). Here i and j are in the different sequence!
ans: Using dynamic programming technique.
AlgorithmAlgorithm
The idea:The idea:
Using dynamic programming conceptUsing dynamic programming concept
我們的目標是要找兩個 PLSS sequence 相似度最大 ( 一個來自 A ;一個來自 B), 也就是把某一個 PLSS sequence align 到另一個 PLSS sequence 花最少的 effort 。在這樣的構想下我們便使用 dynamic programming method to calculate the maximum score V(i,j).
We define V(i,j ) as the maximum similarity score for transforming S1[1…i] to S2[1…j] calculated by
),(max),( )(),(,),( jiVjiV jEiEnscombinatioall
AlgorithmAlgorithm
The The NEWNEW scoring scheme scoring scheme
Δ( iα , jβ ) = WaΔ( Aai , Aaj )+ WsΔ( α , β )
Wa 和Ws 代表權重 -> Wa + Ws = 1
A 是 Blosum62 similarity matrix 也就是一個 scoring table
Δ( α , β )是另一個 similarity matrix( scoring table ) from the HOMSTRAD database
就如同上課所教的,給定兩個 alphabets 會 return 這兩個 alphabet 的分數。這裡是給定 sequence similarity defined by Blosum62 similarity matrix.
這裡是給定兩個 PLSS , return 這兩個 PLSS 的分數。這裡是給定 local structure similarity.
AlgorithmAlgorithm
HOMSTRAD databaseHOMSTRAD database
((HomHomologous ologous StrStructure ucture AAlignment lignment DDatabase)atabase)
This database provides all known protein structure clustered into homologous families
Using method: common ancestry
AlgorithmAlgorithm
V(V(i,ji,j)) is the maximum similarity score for tran is the maximum similarity score for transforming Ssforming S11[1,…,[1,…,ii] to S] to S22[1,…,[1,…,jj]]
V(V(iiαα,j,jββ)) is the maximum similarity score for tra is the maximum similarity score for tra
nsforming Snsforming S11[1,…,[1,…,iiαα] to S] to S22[1,…,[1,…,jjββ]]
),(max),( )(),(,),( jiVjiV jEiEnscombinatioall
AlgorithmAlgorithm
]})1,(,)1,({max[max),(
]}),1(,),1({max[max),(
)],()1,1([max),(
]0),,(),,(),,(max[),(
,
hjiIgjiVjiI
hjiDgjiVjiD
jijiVjiS
jiIjiDjiSjiV
g stands for the gap initiating penalty.h stands for the gap extension penalty.
The similarity score of aligned positions from (i-1, j-1) to (i, j) is ∆(iα,jβ)
在這個位置的總分減掉開一個 deletion gap 的分數
在這個位置已經存在 deletion( insertion) gap 再扣掉 extension 的分數
看哪一個分數會比較高 ( 在所有 segment 的可能下 )
End with deletion
End with insertion
AlgorithmAlgorithm
]})1,(,)1,({max[max),(
]}),1(,),1({max[max),(
)],()1,1([max),( ,
hjiIgjiVjiI
hjiDgjiVjiD
jijiVjiS
elseilastiE
ifirstif
......1)(&)1(
))((..............................................
elsejlastjE
jfirstif
......1)(&)1(
))((..............................................
where
代表前面一個位置一樣選 α會有較高的分數 (α 之前出現過 )
如果 α第一次出現的位置是 i則我們就要去考慮前面哪一個 PLSS 和 α在一起的分數會最大
為什麼只須要考慮 last(r) = i-1 的呢 ?
AlgorithmAlgorithm
當當 rr 屬於屬於 E(E(ii-1) & last(-1) & last(rr) != ) != ii-1-1
代表這一個 PLSS 會 cover 到 i ,因為這一個 PLSS cover i-1 但是 last(r) != i-1 所以 last(r) 會比 i-1 來的大,也就是說 r 會 cover 到 i 。這一種情形事實上已經被 case 1 所包括,因為 case 1 是對所有 cover i 的 PLSS ,所以這個時候的 r 便是屬於 case 1 。
ComplexityComplexity
我們我們先假設先假設 dynamic programmingdynamic programming 每一個每一個 enentrytry 花的花的 time complexity = time complexity = O(1)O(1) 。因為各做。因為各做一次的一次的 substitution, insertion, deletionsubstitution, insertion, deletion 的時間的時間複雜度是複雜度是 O(1)O(1) 。。
先假設先假設的原因是有一些的原因是有一些 entryentry 做的事不只是做的事不只是O(1)O(1) ,如果,如果 γγ ,,δδ 是一個是一個集合集合的話!的話!
ComplexityComplexity
In Sequence1 The first vertex 被 a1 個 PLSS 所 cover (E(1)= a1)The second vertex 被 a2 個 PLSS 所 cover (E(2)= a2)The third vertex 被 a3 個 PLSS 所 cover (E(3)= a1)
The m-1th vertex 被 am-1 個 PLSS 所 cover (E(m-1)= am-1)The mth vertex 被 am 個 PLSS 所 cover (E(m)= am)
In Sequence2 The first vertex 被 b1 個 PLSS 所 cover (E(1)=b1)The second vertex 被 b2 個 PLSS 所 cover (E(2)=b2)The third vertex 被 b3 個 PLSS 所 cover (E(3)=b3)
The n-1th vertex 被 bn-1 個 PLSS 所 cover (E(n-1)=bn-1)The nth vertex 被 bn 個 PLSS 所 cover (E(n)= bn)
我們先不失一般性的假設 :
ComplexityComplexity
In the i-row and j-column entity of the matrix do (ai)x (bj) operations
So, the total operation in the matrix
)(
)...
)(...
(
)...
()...
(
)...)(...(
)...(...)...()...(
21
21
2121
2121
2121
212221212111
CMNCO
CMNCN
bbb
M
aaaMN
N
bbbN
M
aaaM
bbbaaa
bababababababababa
nm
nm
nm
nmmmnn
M
aaaC m
...21
1
N
bbbC n
...21
2
First row Second row Last row
在我們之前的假設下的 time complexity
ComplexityComplexity
先假設先假設的原因是有一些的原因是有一些 entryentry 做的事不只是做的事不只是O(1)O(1) ,如果,如果 γγ ,,δδ 是一個是一個集合集合的話!的話!
我們要證明我們要證明 γγ ,,δδ 是一個是一個集合集合時這種情形個時這種情形個數數並不會太多並不會太多。。
當當 γγ ,,δδ 是一個是一個集合集合時,代表此時,代表此 PLSSPLSS 是第一是第一次出現在這一個位置,也就是說這種情形次出現在這一個位置,也就是說這種情形的個數的個數 =PLSS=PLSS 的個數。的個數。
ComplexityComplexity假設在 sequence 1 的 PLSS 個數是 P1 ,在 sequence 1 的 PLSS 個數是 P2 。
則在做 deletion 時且 γγ 是一個集合的總個數是是一個集合的總個數是 NNC1P1
則在做 insertion 時且 δδ 是一個集合的總個數是是一個集合的總個數是 MMC2P2
而 substitution 的總個數 = (#deletion) + (#insertion)
所以這一些的動作所須的 time complexity = (#deletion) + (#insertion)
NNC1P1 +MMC2P2 <= NNC1M +MMC2N <= MNC1C2 = O(MNC1C2)
同理
Example: (1e68A,1nkl)Example: (1e68A,1nkl)
<Residue-Number>specifies the number of a residue
<Residue-Number>:<Residue-Number>selects all atoms that have residue numbers greater than or equal to the first residue number but less than or equal to the second residue number.
1e68A: Bacteriocin As-48
1nkl : Nk-lysin
Residue Number
Example: (1e68A,1nkl)Example: (1e68A,1nkl)
Each protein is represented as a collection of potentially overlapping and contradictory PLSSs (a network).
SEA finds an optimal alignment between these two proteins
Simultaneously, SEA identifies the optimal subset of PLSSs (a path in the network) describing each protein.
1e68A: Bacteriocin As-48
1nkl : Nk-lysin
Residue Number
Several Variants of the SEA Several Variants of the SEA AlgorithmAlgorithm
SEA_trueSEA_true: using segments derived from the actual 3D structure: using segments derived from the actual 3D structure
SEA_cnSEA_cn: n is the maximum segment coverage(the numbers of segments that cover : n is the maximum segment coverage(the numbers of segments that cover a position in each protein), ex: SEA_c30, SEA_c10, …etc.a position in each protein), ex: SEA_c30, SEA_c10, …etc.
SEA_1DSEA_1D: using 1D prediction (single predicted local structure): using 1D prediction (single predicted local structure)
subset measures CE SEA_true SEA_c30 SEA_c10 SEA_c5 SEA_1d BLAST ALIGN FFAS
average-shift
0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49
shift>0.9 73 69 63 56 47 51 60 43
shift>0.7 207 199 192 183 152 146 165 161
shift>0.5 282 260 259 251 215 197 228 227
RMSD3.0
257 95 82 82 76 63 77 54 40
RMSD5.0 397 237 184 171 177 147 157 138 118
RMSD8.0 408 294 248 249 249 231 196 206 194
Family
(409 pairs)
all 409 345 404 398 368 366 232 372 409
average-shift
0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07
shift>0.9 3 3 3 2 0 1 2 1
shift>0.7 17 8 9 7 4 10 9 7
shift>0.5 54 26 23 21 17 18 18 17
RMSD3.0
55 12 6 6 7 6 8 3 1
RMSD5.0 160 44 16 18 18 11 18 11 1
RMSD8.0 163 69 37 34 41 28 23 22 15
Superfamily
(225 pairs)
all 166 128 217 204 181 177 41 149 225
General performance of SEA incorporating different local structure diversities
CE(1998): Combinatorial Extension, combine a path defined by AFPs
BLAST(1990), ALIGN(1988), FFAS(2000) are not computed from PLSS
subset measures CE SEA_true SEA_c30 SEA_c10 SEA_c5 SEA_1d BLAST ALIGN FFAS
average-shift
0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49
shift>0.9 73 69 63 56 47 51 60 43
shift>0.7 207 199 192 183 152 146 165 161
shift>0.5 282 260 259 251 215 197 228 227
RMSD3.0
257 95 82 82 76 63 77 54 40
RMSD5.0 397 237 184 171 177 147 157 138 118
RMSD8.0 408 294 248 249 249 231 196 206 194
Family
(409 pairs)
all 409 345 404 398 368 366 232 372 409
average-shift
0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07
shift>0.9 3 3 3 2 0 1 2 1
shift>0.7 17 8 9 7 4 10 9 7
shift>0.5 54 26 23 21 17 18 18 17
RMSD3.0
55 12 6 6 7 6 8 3 1
RMSD5.0 160 44 16 18 18 11 18 11 1
RMSD8.0 163 69 37 34 41 28 23 22 15
Superfamily
(225 pairs)
all 166 128 217 204 181 177 41 149 225
General performance of SEA incorporating different local structure diversities
Shift score: measure misalignment between a predicted alignment of two proteins and their reference alignment.
subset measures CE SEA_true SEA_c30 SEA_c10 SEA_c5 SEA_1d BLAST ALIGN FFAS
average-shift
0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49
shift>0.9 73 69 63 56 47 51 60 43
shift>0.7 207 199 192 183 152 146 165 161
shift>0.5 282 260 259 251 215 197 228 227
RMSD3.0
257 95 82 82 76 63 77 54 40
RMSD5.0 397 237 184 171 177 147 157 138 118
RMSD8.0 408 294 248 249 249 231 196 206 194
Family
(409 pairs)
all 409 345 404 398 368 366 232 372 409
average-shift
0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07
shift>0.9 3 3 3 2 0 1 2 1
shift>0.7 17 8 9 7 4 10 9 7
shift>0.5 54 26 23 21 17 18 18 17
RMSD3.0
55 12 6 6 7 6 8 3 1
RMSD5.0 160 44 16 18 18 11 18 11 1
RMSD8.0 163 69 37 34 41 28 23 22 15
Superfamily
(225 pairs)
all 166 128 217 204 181 177 41 149 225
General performance of SEA incorporating different local structure diversities Shift Score
where is a small number used as a parameter to the scoring algorithm
Shift Score range from -0.2(epsilon) and 1.0
gap
subset measures CE SEA_true SEA_c30 SEA_c10 SEA_c5 SEA_1d BLAST ALIGN FFAS
average-shift
0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49
shift>0.9 73 69 63 56 47 51 60 43
shift>0.7 207 199 192 183 152 146 165 161
shift>0.5 282 260 259 251 215 197 228 227
RMSD3.0
257 95 82 82 76 63 77 54 40
RMSD5.0 397 237 184 171 177 147 157 138 118
RMSD8.0 408 294 248 249 249 231 196 206 194
Family
(409 pairs)
all 409 345 404 398 368 366 232 372 409
average-shift
0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07
shift>0.9 3 3 3 2 0 1 2 1
shift>0.7 17 8 9 7 4 10 9 7
shift>0.5 54 26 23 21 17 18 18 17
RMSD3.0
55 12 6 6 7 6 8 3 1
RMSD5.0 160 44 16 18 18 11 18 11 1
RMSD8.0 163 69 37 34 41 28 23 22 15
Superfamily
(225 pairs)
all 166 128 217 204 181 177 41 149 225
General performance of SEA incorporating different local structure diversities
RMSD: root mean square deviation of C*alpha positions after optimal superposition (for structural similarity)
subset measures CE SEA_true SEA_c30 SEA_c10 SEA_c5 SEA_1d BLAST ALIGN FFAS
average-shift
0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49
shift>0.9 73 69 63 56 47 51 60 43
shift>0.7 207 199 192 183 152 146 165 161
shift>0.5 282 260 259 251 215 197 228 227
RMSD3.0
257 95 82 82 76 63 77 54 40
RMSD5.0 397 237 184 171 177 147 157 138 118
RMSD8.0 408 294 248 249 249 231 196 206 194
Family
(409 pairs)
all 409 345 404 398 368 366 232 372 409
average-shift
0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07
shift>0.9 3 3 3 2 0 1 2 1
shift>0.7 17 8 9 7 4 10 9 7
shift>0.5 54 26 23 21 17 18 18 17
RMSD3.0
55 12 6 6 7 6 8 3 1
RMSD5.0 160 44 16 18 18 11 18 11 1
RMSD8.0 163 69 37 34 41 28 23 22 15
Superfamily
(225 pairs)
all 166 128 217 204 181 177 41 149 225
General performance of SEA incorporating different local structure diversities
SEA_c30 and SEA_c10 produced most accurate alignments