6b -1 the prediction of protein structures. 6b -2 amino acids ( 胺基酸 )...
Post on 19-Dec-2015
329 views
TRANSCRIPT
6B -1
The Prediction of Protein Structures
6B -2
Amino Acids ( 胺基酸 ) 胺基酸:蛋白質的基本單位,共 20 種
6B -3
Amino Acids ( 胺基酸 ) 分子
6B -4
Protein ( 蛋白質 ) 分子
6B -5
Primary Structure ( 一級結構 ) of Protein
Primary structure: primary sequence of amino acids
牛的胰島素 ( 一種蛋白質 ) 之胺基酸序列:
6B -6
Secondary Structure ( 二級結構 ) of Protein
Secondary structure: -helix -sheet loop
6B -7
Tertiary Structure ( 三級結構 ) of Protein
血紅素分子三級結構
6B -8
Quaternary Structure ( 四級結構 ) of Protein
血紅素分子四級結構
6B -9
蛋白質動畫
取自 http://elearning.bioinfo.ntu.edu.tw/
6B -10
蛋白質折疊動畫
取自 http://elearning.bioinfo.ntu.edu.tw/
6B -11
Relation between Structures Sequence structure function
6B -12
Reason for Prediction Why do we need protein structure prediction
Biological technique X-ray Crystallography (X-ray 結晶法 ) Nuclear Magnetic Resource(NMR)( 核磁共振 )
Expensive, time-consuming and limit to small or medium protein(~ 700 residues)
Computational strategies
6B -13
Prediction Competition Advance the methods of identifying protein
structure from sequence CASP(Critical Assessment of Techniques
for Protein Structure Prediction ) http://predictioncenter.org Every 2 years(1994 ~ now) CASP6(Gaeta, Italy, Dec. 2004) CASP7(Pacific Grove, USA, Nov. 2006)
6B -14
6B -15
Accuracy Measurement RMSD(Root Mean Square Deviation )
2
1
)(1
N
i
Bi
Ai xx
N
Distance RMSD =
2
1 1
1( )
n nA B
ij iji j
d dn
6B -16
Prediction of Protein Structures Ab Initio Methods( 重頭起算法 )
Thermodynamics ( 分子熱力學 ) Without reference from other known structures.
Homology Modeling( 同源模擬法 ) Knowledge-based modeling Sequence similarity More accurate
6B -17
Previous Works
PHDthreader(http://www.embl-heidelberg.de/predictprotein) < 30% of the predicted first hits are true remote homologues Ab initio method
SWISS-MODEL(http://expasy.hcuge.ch/swissmod/SWISS-MODEL.html) An automated knowledge-based protein modeling server
InsightII(http://www.accelrys.com/products/insight/index.html)(Charged) Protein structure prediction
Paircoil(http://ostrich.lcs.mit.edu/cgi-bin/score) Prediction of coiled coil regions
List of other methods or programs http://restools.sdsc.edu/biotools/biotools9.html
6B -18
Properties of Ab Initio Methods Score functions
HMM(Hidden Markov Model) electrostatics( 電性 ), VdW( 凡得瓦力 ) and H-
bonds( 氫鍵 ) and others. Hydrophobic( 疏水性 ) and hydrophilic( 親水性 )
Protein folding problem
6B -19
Homology Modeling General presumption:
Little changes on protein sequence would also alter little changes on structure.
Protein identity > 30%
General procedure:1. Database searching and template selection ( 模版選擇 )
2. Energy minimization( 能量最小化 )3. Rationality evaluation( 合理性評估 )
6B -20
General Procedure of Protein Structure Prediction on Homology Model
Input : S1=SSKCSRLKTFPQNACVYHK Output : The backbone conformation model of S1. Step 1: Select a template.
S2=SVYCSSLACSDHN Step 2: Perform sequence alignment.
S1=SSKCSRLKTFPQNACVYHK
S2=SVYCSSL------ ACSDHN
6B -21
Step 3 : Find the structurally conversed regions. Copy the coordinators of structurally conversed regions from S2 to S1.
6B -22
6B -23
Step 4 : Apply the folding algorithm to position the residues that lose of sequence similarity.
LKTFPQNA 10011001
6B -24
Step 5 :
- Find the the structure-known proteins with 70% or higher sequence similarity.
- Construct a segment of B-spline curve for every four points.
P
N
T Q
P
AK
TF
P Q
candidate protein structures
K T
FL P
Q N
A
QK
L
TF
A
L
K
F
N
the folding structures
1. 2.
3.
N
L A
6B -25
Final Conformation
6B -26
Template Search on Protein Databases
PDB(Protein Data Bank) http://www.rcsb.org/pdb/
Swiss-prot http://tw.expasy.org/sprot/
Classification: CATH(Class, Architecture, Topology and Homologous
superfamily) http://cathwww.biochem.ucl.ac.uk/latest/
SCOP(Structural Classification of Proteins) http://scop.mrc-lmb.cam.ac.uk/scop/index.html
6B -27
6B -28
Template Selection Methods (Tools)
How to select Sequence alignment ClustalW, Blastp and others Secondary structure prediction[Al-Lazikani et al.]
Structural reserved blocks ( 結構保留區塊 )
6B -29
PAM250 Score Matrix A C D E F G H I K L M N P Q R S T V W Y A 2 C -2 12 D 0 -5 4 E 0 -5 3 4 F -4 -4 -6 -5 9 G 1 -3 1 0 -5 5 H -1 -3 1 1 -2 -2 6 I -1 -2 -2 -2 1 -3 -2 5 K -1 -5 0 0 -5 -2 0 -2 5 L -2 -6 -4 -3 2 -4 -2 2 -3 6 M -1 -5 -3 -2 0 -3 -2 2 0 4 6 N 0 -4 2 1 -4 0 2 -2 1 -3 -2 2 P 1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 Q 0 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 R -2 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 S 1 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 T 1 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 V 0 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 W -6 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 Y -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10
6B -30
Blosum62 Matrix A C D E F G H I K L M N P Q R S T V W Y
A 4
C 0 9
D -2 -3 6
E -1 -4 2 5
F -2 -2 -3 -3 6
G 0 -3 -1 -2 -3 6
H -2 -3 1 0 -1 -2 8
I -1 -1 -3 -3 0 -4 -3 4
K -1 -3 -1 1 -3 -2 -1 -3 5
L -1 -1 -4 -3 0 -4 -3 2 -2 4
M -1 -1 -3 -2 0 -3 -2 1 -1 2 5
N -2 -3 1 0 -3 0 -1 -3 0 -3 -2 6
P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -1 7
Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5
R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5
S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4
T -1 -1 1 0 -2 1 0 -2 0 -2 -1 0 1 0 -1 1 4
V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 -2 4
W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -3 -3 11
Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7
6B -31
Protein Folding Problem Given the primary structure of a protein, to compute it
s 3-dimensional structure. H-P model was Proposed by Dill in 1985 [Dill’85] Minimizing the total free energy The characteristic of each of 20 amino acids:
H (hydrophobic, non-polar) : 1 (hating water, 疏水性 ) P (hydrophilic, polar) : 0 (loving water, 親水性 )
The amino acid sequence of a protein can be viewed as a binary sequence of H’s (1’s) and P’s (0’s).
6B -32
Example of H-P Model Input sequence: 011001001110010
0 1 1 0
0
1
00
1
11
1 0
0
0
0 1 1 0
0
1
00
1
11
1
0
0
0
Score = 5Score = 3
6B -33
Protein Folding on H-P Model The protein folding on H-P model: Given a
sequence of 1’s (H’s) and 0’s (P’s), to find a self-avoiding paths embedded in either a 2D or 3D lattice such that the number of pairs of adjacent 1’s is maximized.
NP-complete even for 2D lattice [Hart’97].
6B -34
U-Fold Algorithm for HP Find a suitable point where to split the string into t
wo substrings. Example :0100101001110101000010
0100----101001
01000010101--1
6B -35
Ant Colony Optimization System
The ant colony optimization (ACO) algorithm was presented by Dorigo et al. in 1991.
6B -36
General Lattice Model
Square Lattice Model Triangular Lattice Model
6B -37
Experiments of Different Models
1b1u 1a6n 118l 102l 1b8k
Cubic 12.08891 13.35721 13.01421 13.98656 17.50644
FCC 10.18907 12.09836 12.39913 11.93452 15.06346
•Measured by RMSD(Å)
•Data source: PDB
•Folding by genetic algorithm 05
101520253035
0 100 200 300 400 500
Sequence Length(residues)
RM
SD
cubic
FCC
σ of cubic
σ of FCC
FCC: Face Center Cubic Model
6B -38
Structure Alignment by Curve Fitting
B-spline curves
6B -39
Curve Matching Curve matching - measure function
||||][ 1212 dqdqrdsdsg
T1=T2
A1=A2
q1
q2
s2
s1
s2- s1
C1
C2
B1
B2
6B -40
- Apply the curve alignment.
Our score function of the curve alignment:
)]1,1[],1,[],,1[max(],[
62
631
31
02
,
ij
ji
wjiAjiAjiAjiA
delse
ddif
ddif
dif
w
6B -41
Additional Constraints Improvement on the HP model
Prediction results are not successful enough Consideration of hydrophobicity is not enough.
Other features should also be considered Secondary structure elements (SSEs)
1. helix
2. sheet Electrostatic attractions Disulfide bonds
6B -42
Electrostatic Attractions and Disulfide Bonds
Electrostatic attractions:
Disulfide bond: formed between two C’s
6B -43
Probabilistic Disulfide Bonds Folding with the constraint of disulfide
bonds.
6B -44
Experiments for Disulfide Bonds Experiments of folding with disulfide
constraints
0
5
10
15
20
25
0 100 200 300 400 500
Sequence Length
RM
SD
Without SS
SS fold
σ of without SS
σ of SS
6B -45
Secondary Structures Conformations of helix
Distance between ith amino acid
and (i+4)th amino acid
6B -46
Secondary Structures Conformations of sheet
6B -47
Further Improvement--Sliced Lattice Model
The origin lattice models cannot work well. Slice the lattice into little lattices.
6B -48
Sliced Lattice Model
6B -49
Global Folding
6B -50
Experimental Materials
Database: PDB (http://www.rcsb.org/pdb/) April 17, 2005 20,380 proteins
Data of CASP6 (http://predictioncenter.llnl.gov/) 2004
Alignment: Blastp (http://www.ncbi.nlm.nih.gov/) Sequence identity < 90% Blosum-62
6B -51
Experiment Results Target protein: 1LIN (146)
Template
Protein
Sequence
Similarity
RMSD(‘03) RMSD(‘04) RMSD(‘05)
1CFD 100% 7.34 - -
1TNW 69% 18.72 13.37 10.56
1IQ5 55% 15.15 9.18 7.35
1DTL 52.9% 10.22 7.48 6.17
5PAL 36.4% 12.18 8.43 5.89
Measured by RMSD
6B -52
Experiment Results Target protein: 1QG1(104)
Template
Protein
Sequence
Similarity
RMSD(‘03) RMSD(‘04)
RMSD(‘05)
1JYQ 90.4% 4.15 - 4.24
1JYU 90.4% 13.89 - 10.89
1SHA 46.7% 4.82 4.82 3.65
1SHD 45.2% 8.89 6.77 5.55
5PDR 24.4% 10.55 8.0 6.76
Measured by RMSD
6B -53
Experimental Results of CASP6
# of proteins 77
# of positive improvement 59
# of negative improvement 12
Average improvement 21.44%
Average sequence length 208(53~435)
Average template identity 36%
Average template similarity 21%
Compared with Chen’03:
6B -54
Compared with Palu et al.
Palu et al.[Palu’04], without template FCC lattice model
6B -55
Comparing with Zheng et al.
Zheng et al. [Zheng02] Homology Lattice model
6B -56
An Example of Our Results PDB code: 7RSA, Length=124, RMSD =1.48Å
Our result Real structure
6B -57
Protein Structure Prediction Systemtarget protein 7RSAStep 1: Prepare
6B -58
Protein Structure Prediction Systemhttp://par.cse.nsysu.edu.tw/main.html
Step 2: Predict
6B -59
Protein Structure Prediction SystemStep 3: Display result
6B -60
Protein Structure Prediction SystemStep 3: Display result
6B -61
Protein Structure Prediction SystemStep 3: Display result
6B -62
Protein Structure Prediction SystemStep 4: Compare
Our result Real structureRMSD
6B -63Our result Real structure
Protein Structure Prediction System
Step 4: Compare
6B -64
Protein Side Chain Packing
6B -65
Amino Acids & Side-chain
Elements of protein Three groups
Lysine (LYS)
Side-chain
6B -66
Protein Structure Prediction Input: 1D sequence Output: 3D structure
3D backbone structure in general
Protein structure = Backbone structure +
side-chain structure
ACE GLY ASP VAL GLU LYS GLY LYS LYS ILE PHE VAL GLN
6B -67
Backbone and Side Chain
Protein SAV1595, Journal of Biomolecular NMR (2004) 29: 391–394
BackboneSide-chain
6B -68
Protein Side Chain Packing Problem
PSCPP Given the fixed backbone of the protein For each residue of backbone other than Gly
cine, there is a set of possible rotamers. Problem: Choose one suitable rotamer for e
ach residue, such that the total energy of the protein is minimized.
The PSCPP is NP-hard.
6B -69
Graph Model of PSCPP Problem
Let R = {r1, r2, . . . , rn} be the set of residues of the target protein.
Let an undirected graph G = (V, E) represent the side chain of a protein.
Vi = {vi,j |vi,j does not collide with each backbone atoms }.
Then we have V = ∪Vi and E = {(vi,j , vi+1,k)|vi,j does not collide with vi+1,k}. rotamer
6B -70
Dihedral Angles
Side-chain Atoms C, C
, O.
Dihedral Angles [Iupa70]
: Ci-1-Ni-C-Ci
: Ni-C-Ci-Ni+1
X1 : Ni-C-C
-Oi
O2i
O1i
Ci
Ci+1
Ni+1
Oi
Ci
H
Ci
Ci
HNi
Ci-1
H
Oi-1
C
H
H
H
X1
Asp
H
X2
Backbone
Side-chain
6B -71
The Rotamer Library
The accuracy of side chain prediction depends primarily on the quality of rotamer library.
Our rotamer library is a coordinate rotamer library, which reserves the bond lengths and bond angles that do not appear in the standard rotamer library.
The source of our rotamer library is based on 850 proteins, which are the same as the backbone-dependent rotamer library proposed by Dunbrack and Karplus. [Dunb93]
6B -72
Example of the Rotamer Library [A.A.] [φ] [ψ] [X1] [Prob.][3-D Coordinate]
6B -73
Formulas of ACO for PSCPP
Pheromone probability formula
Pheromone update formula
0 << 1, is the rate of the pheromone evaporation
1, ii VuVs
)]([)]([
)]([)]([),(
,
,
wt
utusP
ws
usk
skJw
)()()1()1( ,1
,, ttt kus
m
kusus
6B -74
ACO Prediction for PSCPP
Input: A backbone coordinate data.Output: The route with near minimum score.
Step1: Set parameters and initialize pheromone trails.Step 2: Each ant k chooses one rotamer u of residue i according
to the probability function pk(s, u) for all 1 ≤ i ≤ n, u Vi.Step3: Update the pheromone trails.Step 4: If current best solution has not exceeded some percent a
fter some predefined generations or the number of generations has reached the predefined value, return the route with minimum score; otherwise, go to Step 2.
6B -75
The Score Function Features in ACO score functions:
The disulfide bondsS1 =BonS #(disulfide bonds),
The hydrogen bondsS2 =BonH #(hydrogen bonds),
The charge-charge interactionsS3 =BonC (#(different charge pairs)− #(same charge pairs)),
The van der Waals interactionsS4 =BonV Ei,j
Energy score function: E = S1 + S2 + S3 + S4
6B -76
Experiments
Two test sets: 25 proteins from Xiang and Honig 2001 5 proteins from Canutescu et al. 2003
Cutoff value: 20 ° [Xie06, R3]
If X1 is within 20 ° of corresponding angle in the real structure, the prediction angle would be considered correct.
Comparing with SCWRL 3.0 [Canu03] and R3 [Xie06]
6B -77
Parameters in Experiments Weights of features in
score function
Parameters used in ACO Algorithm
Feature Value
BonS 0.5S4
BonH 5
BonC 2
BonV 1
Parameter Value
Population 50
Generation 300~600
1.0
1.0
Initial Pheromone
1.0
6B -78
Experimental Results (First Case)
NO.Target Protein Our Method SCWRL 3.0 R3 Method
Protein Length X1 X1 X1
1 1AAC 85 87.1 84.7/95 76.5/86
2 1AHO 54 85.2 68.5/67 64.8/65
3 1B9O 112 70.5 68.8/73 66.1/77
4 1C5E 71 81.7 81.7/86 73.2/82
5 1C9O 53 84.9 66.0/72 71.7/70
6 1CC7 66 80.3 68.2/83 63.6/79
7 1CEX 146 85.6 76.7/82 75.3/77
8 1CKU 60 81.7 76.7/82 68.3/80
Column 5-6: I UPAC-IUB rules / Xie and Sahinidis’s (R3) result
6B -79
NO.Target Protein Our Method SCWRL 3.0 R3 Method
Protein Length X1 X1 X1
9 1CTJ 61 77.0 68.9/79 70.5/80
10 1CZ9 111 70.3 64.0/73 64.0/76
11 1CZP 83 79.5 77.1/86 73.5/81
12 1D4T 89 77.5 76.4/86 67.4/82
13 1IGD 50 82.0 68.0/74 54.0/68
14 1MFM 118 75.4 68.6/80 70.3/81
15 1PLC 82 72.0 67.1/72 70.7/71
16 1QJ4 221 71.5 72.9/84 67.9/80
17 1QQ4 143 83.9 73.4/78 71.3/78
Experimental Results (First Case)
6B -80
NO.Target Protein Our Method SCWRL 3.0 R3 Method
Protein Length X1 X1 X1
18 1QTN 134 86.6 74.6/82 67.9/78
19 1QU9 99 79.8 71.7/81 73.7/78
20 1RCF 142 79.6 83.8/86 81.7/80
21 1VFY 63 79.4 69.8/76 71.4/75
22 2PTH 151 82.1 78.8/83 78.1/84
23 3LZT 105 73.3 78.1/86 69.5/82
24 5P2L 144 78.5 70.8/78 63.2/71
25 7RSA 109 75.2 65.1/75 61.5/67
Experimental Results (First Case)
Column 5-6: IUPAC-IUB rules / Xie and Sahinidis’s (R3) result
6B -81
Experimental Results (Second Case)
NO.Target Protein Our Method SCWRL 3.0 R3 Method
Protein Length X1 X1 X1
1 1A8I 704 73.4 71.3 / 80 64.1 / 75
2 1B0P 978 70.8 62.3 / 69 - / 66
3 1BU7 399 74.9 70.4 / 78 64.4 / 72
4 1GAI 386 73.6 72.8 / 81 66.6 / 72
5 1XWL 496 71.5 66.7 / 73 61.5 / 72
Column 5-6: IUPAC-IUB rules / Xie and Sahinidis’s (R3) result