6b -1 the prediction of protein structures. 6b -2 amino acids ( 胺基酸 )...

6B -1

The Prediction of Protein Structures

6B -2

Amino Acids ( 胺基酸 ) 胺基酸：蛋白質的基本單位，共 20 種

6B -3

Amino Acids ( 胺基酸 ) 分子

6B -4

Protein ( 蛋白質 ) 分子

6B -5

Primary Structure ( 一級結構 ) of Protein

Primary structure: primary sequence of amino acids

牛的胰島素 ( 一種蛋白質 ) 之胺基酸序列：

6B -6

Secondary Structure ( 二級結構 ) of Protein

Secondary structure: -helix -sheet loop

6B -7

Tertiary Structure ( 三級結構 ) of Protein

血紅素分子三級結構

6B -8

Quaternary Structure ( 四級結構 ) of Protein

血紅素分子四級結構

6B -9

蛋白質動畫

取自 http://elearning.bioinfo.ntu.edu.tw/

6B -10

蛋白質折疊動畫

取自 http://elearning.bioinfo.ntu.edu.tw/

6B -11

Relation between Structures Sequence structure function

6B -12

Reason for Prediction Why do we need protein structure prediction

Biological technique X-ray Crystallography (X-ray 結晶法 ) Nuclear Magnetic Resource(NMR)( 核磁共振 )

Expensive, time-consuming and limit to small or medium protein(~ 700 residues)

Computational strategies

6B -13

Prediction Competition Advance the methods of identifying protein

structure from sequence CASP(Critical Assessment of Techniques

for Protein Structure Prediction ) http://predictioncenter.org Every 2 years(1994 ~ now) CASP6(Gaeta, Italy, Dec. 2004) CASP7(Pacific Grove, USA, Nov. 2006)

6B -14

6B -15

Accuracy Measurement RMSD(Root Mean Square Deviation )

2

1

)(1

N

i

Bi

Ai xx

N

Distance RMSD =

2

1 1

1( )

n nA B

ij iji j

d dn

6B -16

Prediction of Protein Structures Ab Initio Methods( 重頭起算法 )

Thermodynamics ( 分子熱力學 ) Without reference from other known structures.

Homology Modeling( 同源模擬法 ) Knowledge-based modeling Sequence similarity More accurate

6B -17

Previous Works

PHDthreader(http://www.embl-heidelberg.de/predictprotein) < 30% of the predicted first hits are true remote homologues Ab initio method

SWISS-MODEL(http://expasy.hcuge.ch/swissmod/SWISS-MODEL.html) An automated knowledge-based protein modeling server

InsightII(http://www.accelrys.com/products/insight/index.html)(Charged) Protein structure prediction

Paircoil(http://ostrich.lcs.mit.edu/cgi-bin/score) Prediction of coiled coil regions

List of other methods or programs http://restools.sdsc.edu/biotools/biotools9.html

6B -18

Properties of Ab Initio Methods Score functions

HMM(Hidden Markov Model) electrostatics( 電性 ), VdW( 凡得瓦力 ) and H-

bonds( 氫鍵 ) and others. Hydrophobic( 疏水性 ) and hydrophilic( 親水性 )

Protein folding problem

6B -19

Homology Modeling General presumption:

Little changes on protein sequence would also alter little changes on structure.

Protein identity > 30%

General procedure:1. Database searching and template selection ( 模版選擇 )

2. Energy minimization( 能量最小化 )3. Rationality evaluation( 合理性評估 )

6B -20

General Procedure of Protein Structure Prediction on Homology Model

Input : S1=SSKCSRLKTFPQNACVYHK Output : The backbone conformation model of S1. Step 1: Select a template.

S2=SVYCSSLACSDHN Step 2: Perform sequence alignment.

S1=SSKCSRLKTFPQNACVYHK

S2=SVYCSSL------ ACSDHN

6B -21

Step 3 : Find the structurally conversed regions. Copy the coordinators of structurally conversed regions from S2 to S1.

6B -22

6B -23

Step 4 : Apply the folding algorithm to position the residues that lose of sequence similarity.

LKTFPQNA 10011001

6B -24

Step 5 :

- Find the the structure-known proteins with 70% or higher sequence similarity.

- Construct a segment of B-spline curve for every four points.

P

N

T Q

P

AK

TF

P Q

candidate protein structures

K T

FL P

Q N

A

QK

L

TF

A

L

K

F

N

the folding structures

1. 2.

3.

N

L A

6B -25

Final Conformation

6B -26

Template Search on Protein Databases

PDB(Protein Data Bank) http://www.rcsb.org/pdb/

Swiss-prot http://tw.expasy.org/sprot/

Classification: CATH(Class, Architecture, Topology and Homologous

superfamily) http://cathwww.biochem.ucl.ac.uk/latest/

SCOP(Structural Classification of Proteins) http://scop.mrc-lmb.cam.ac.uk/scop/index.html

6B -27

6B -28

Template Selection Methods (Tools)

How to select Sequence alignment ClustalW, Blastp and others Secondary structure prediction[Al-Lazikani et al.]

Structural reserved blocks ( 結構保留區塊 )

6B -29

PAM250 Score Matrix A C D E F G H I K L M N P Q R S T V W Y A 2 C -2 12 D 0 -5 4 E 0 -5 3 4 F -4 -4 -6 -5 9 G 1 -3 1 0 -5 5 H -1 -3 1 1 -2 -2 6 I -1 -2 -2 -2 1 -3 -2 5 K -1 -5 0 0 -5 -2 0 -2 5 L -2 -6 -4 -3 2 -4 -2 2 -3 6 M -1 -5 -3 -2 0 -3 -2 2 0 4 6 N 0 -4 2 1 -4 0 2 -2 1 -3 -2 2 P 1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 Q 0 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 R -2 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 S 1 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 T 1 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 V 0 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 W -6 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 Y -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10

6B -30

Blosum62 Matrix A C D E F G H I K L M N P Q R S T V W Y

A 4

C 0 9

D -2 -3 6

E -1 -4 2 5

F -2 -2 -3 -3 6

G 0 -3 -1 -2 -3 6

H -2 -3 1 0 -1 -2 8

I -1 -1 -3 -3 0 -4 -3 4

K -1 -3 -1 1 -3 -2 -1 -3 5

L -1 -1 -4 -3 0 -4 -3 2 -2 4

M -1 -1 -3 -2 0 -3 -2 1 -1 2 5

N -2 -3 1 0 -3 0 -1 -3 0 -3 -2 6

P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -1 7

Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5

R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5

S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4

T -1 -1 1 0 -2 1 0 -2 0 -2 -1 0 1 0 -1 1 4

V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 -2 4

W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -3 -3 11

Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7

6B -31

Protein Folding Problem Given the primary structure of a protein, to compute it

s 3-dimensional structure. H-P model was Proposed by Dill in 1985 [Dill’85] Minimizing the total free energy The characteristic of each of 20 amino acids:

H (hydrophobic, non-polar) : 1 (hating water, 疏水性 ) P (hydrophilic, polar) : 0 (loving water, 親水性 )

The amino acid sequence of a protein can be viewed as a binary sequence of H’s (1’s) and P’s (0’s).

6B -32

Example of H-P Model Input sequence: 011001001110010

0 1 1 0

0

1

00

1

11

1 0

0

0

0 1 1 0

0

1

00

1

11

1

0

0

0

Score = 5Score = 3

6B -33

Protein Folding on H-P Model The protein folding on H-P model: Given a

sequence of 1’s (H’s) and 0’s (P’s), to find a self-avoiding paths embedded in either a 2D or 3D lattice such that the number of pairs of adjacent 1’s is maximized.

NP-complete even for 2D lattice [Hart’97].

6B -34

U-Fold Algorithm for HP Find a suitable point where to split the string into t

wo substrings. Example :0100101001110101000010

0100----101001

01000010101--1

6B -35

Ant Colony Optimization System

The ant colony optimization (ACO) algorithm was presented by Dorigo et al. in 1991.

6B -36

General Lattice Model

Square Lattice Model Triangular Lattice Model

6B -37

Experiments of Different Models

1b1u 1a6n 118l 102l 1b8k

Cubic 12.08891 13.35721 13.01421 13.98656 17.50644

FCC 10.18907 12.09836 12.39913 11.93452 15.06346

•Measured by RMSD(Å)

•Data source: PDB

•Folding by genetic algorithm 05

101520253035

0 100 200 300 400 500

Sequence Length(residues)

RM

SD

cubic

FCC

σ of cubic

σ of FCC

FCC: Face Center Cubic Model

6B -38

Structure Alignment by Curve Fitting

B-spline curves

6B -39

Curve Matching Curve matching - measure function

||||][ 1212 dqdqrdsdsg

T1=T2

A1=A2

q1

q2

s2

s1

s2- s1

C1

C2

B1

B2

6B -40

- Apply the curve alignment.

Our score function of the curve alignment:

)]1,1[],1,[],,1[max(],[

62

631

31

02

,

ij

ji

wjiAjiAjiAjiA

delse

ddif

ddif

dif

w

6B -41

Additional Constraints Improvement on the HP model

Prediction results are not successful enough Consideration of hydrophobicity is not enough.

Other features should also be considered Secondary structure elements (SSEs)

1. helix

2. sheet Electrostatic attractions Disulfide bonds

6B -42

Electrostatic Attractions and Disulfide Bonds

Electrostatic attractions:

Disulfide bond: formed between two C’s

6B -43

Probabilistic Disulfide Bonds Folding with the constraint of disulfide

bonds.

6B -44

Experiments for Disulfide Bonds Experiments of folding with disulfide

constraints

0

5

10

15

20

25

0 100 200 300 400 500

Sequence Length

RM

SD

Without SS

SS fold

σ of without SS

σ of SS

6B -45

Secondary Structures Conformations of helix

Distance between ith amino acid

and (i+4)th amino acid

6B -46

Secondary Structures Conformations of sheet

6B -47

Further Improvement--Sliced Lattice Model

The origin lattice models cannot work well. Slice the lattice into little lattices.

6B -48

Sliced Lattice Model

6B -49

Global Folding

6B -50

Experimental Materials

Database: PDB (http://www.rcsb.org/pdb/) April 17, 2005 20,380 proteins

Data of CASP6 (http://predictioncenter.llnl.gov/) 2004

Alignment: Blastp (http://www.ncbi.nlm.nih.gov/) Sequence identity < 90% Blosum-62

6B -51

Experiment Results Target protein: 1LIN (146)

Template

Protein

Sequence

Similarity

RMSD(‘03) RMSD(‘04) RMSD(‘05)

1CFD 100% 7.34 - -

1TNW 69% 18.72 13.37 10.56

1IQ5 55% 15.15 9.18 7.35

1DTL 52.9% 10.22 7.48 6.17

5PAL 36.4% 12.18 8.43 5.89

Measured by RMSD

6B -52

Experiment Results Target protein: 1QG1(104)

Template

Protein

Sequence

Similarity

RMSD(‘03) RMSD(‘04)

RMSD(‘05)

1JYQ 90.4% 4.15 - 4.24

1JYU 90.4% 13.89 - 10.89

1SHA 46.7% 4.82 4.82 3.65

1SHD 45.2% 8.89 6.77 5.55

5PDR 24.4% 10.55 8.0 6.76

Measured by RMSD

6B -53

Experimental Results of CASP6

# of proteins 77

# of positive improvement 59

# of negative improvement 12

Average improvement 21.44%

Average sequence length 208(53~435)

Average template identity 36%

Average template similarity 21%

Compared with Chen’03:

6B -54

Compared with Palu et al.

Palu et al.[Palu’04], without template FCC lattice model

6B -55

Comparing with Zheng et al.

Zheng et al. [Zheng02] Homology Lattice model

6B -56

An Example of Our Results PDB code: 7RSA, Length=124, RMSD =1.48Å

Our result Real structure

6B -57

Protein Structure Prediction Systemtarget protein 7RSAStep 1: Prepare

http://par.cse.nsysu.edu.tw/~pspk/2.gif

6B -58

Protein Structure Prediction Systemhttp://par.cse.nsysu.edu.tw/main.html

Step 2: Predict


6B -59

Protein Structure Prediction SystemStep 3: Display result


6B -60



6B -61



6B -62

Protein Structure Prediction SystemStep 4: Compare

Our result Real structureRMSD


6B -63Our result Real structure

Protein Structure Prediction System

Step 4: Compare


6B -64

Protein Side Chain Packing

6B -65

Amino Acids & Side-chain

Elements of protein Three groups

Lysine (LYS)

Side-chain

6B -66

Protein Structure Prediction Input: 1D sequence Output: 3D structure

3D backbone structure in general

Protein structure = Backbone structure +

side-chain structure

ACE GLY ASP VAL GLU LYS GLY LYS LYS ILE PHE VAL GLN

6B -67

Backbone and Side Chain

Protein SAV1595, Journal of Biomolecular NMR (2004) 29: 391–394

BackboneSide-chain

6B -68

Protein Side Chain Packing Problem

PSCPP Given the fixed backbone of the protein For each residue of backbone other than Gly

cine, there is a set of possible rotamers. Problem: Choose one suitable rotamer for e

ach residue, such that the total energy of the protein is minimized.

The PSCPP is NP-hard.

6B -69

Graph Model of PSCPP Problem

Let R = {r1, r2, . . . , rn} be the set of residues of the target protein.

Let an undirected graph G = (V, E) represent the side chain of a protein.

Vi = {vi,j |vi,j does not collide with each backbone atoms }.

Then we have V = ∪Vi and E = {(vi,j , vi+1,k)|vi,j does not collide with vi+1,k}. rotamer

6B -70

Dihedral Angles

Side-chain Atoms C, C

, O.

Dihedral Angles [Iupa70]

: Ci-1-Ni-C-Ci

: Ni-C-Ci-Ni+1

X1 : Ni-C-C

-Oi

O2i

O1i

Ci

Ci+1

Ni+1

Oi

Ci

H

Ci

Ci

HNi

Ci-1

H

Oi-1

C

H

H

H

X1

Asp

H

X2

Backbone

Side-chain

6B -71

The Rotamer Library

The accuracy of side chain prediction depends primarily on the quality of rotamer library.

Our rotamer library is a coordinate rotamer library, which reserves the bond lengths and bond angles that do not appear in the standard rotamer library.

The source of our rotamer library is based on 850 proteins, which are the same as the backbone-dependent rotamer library proposed by Dunbrack and Karplus. [Dunb93]

6B -72

Example of the Rotamer Library [A.A.] [φ] [ψ] [X1] [Prob.][3-D Coordinate]

6B -73

Formulas of ACO for PSCPP

Pheromone probability formula

Pheromone update formula

0 << 1, is the rate of the pheromone evaporation

1, ii VuVs

)]([)]([

)]([)]([),(

,

,

wt

utusP

ws

usk

skJw

)()()1()1( ,1

,, ttt kus

m

kusus

6B -74

ACO Prediction for PSCPP

Input: A backbone coordinate data.Output: The route with near minimum score.

Step1: Set parameters and initialize pheromone trails.Step 2: Each ant k chooses one rotamer u of residue i according

to the probability function pk(s, u) for all 1 ≤ i ≤ n, u Vi.Step3: Update the pheromone trails.Step 4: If current best solution has not exceeded some percent a

fter some predefined generations or the number of generations has reached the predefined value, return the route with minimum score; otherwise, go to Step 2.

6B -75

The Score Function Features in ACO score functions:

The disulfide bondsS1 =BonS #(disulfide bonds),

The hydrogen bondsS2 =BonH #(hydrogen bonds),

The charge-charge interactionsS3 =BonC (#(different charge pairs)− #(same charge pairs)),

The van der Waals interactionsS4 =BonV Ei,j

Energy score function: E = S1 + S2 + S3 + S4

6B -76

Experiments

Two test sets: 25 proteins from Xiang and Honig 2001 5 proteins from Canutescu et al. 2003

Cutoff value: 20 ° [Xie06, R3]

If X1 is within 20 ° of corresponding angle in the real structure, the prediction angle would be considered correct.

Comparing with SCWRL 3.0 [Canu03] and R3 [Xie06]

6B -77

Parameters in Experiments Weights of features in

score function

Parameters used in ACO Algorithm

Feature Value

BonS 0.5S4

BonH 5

BonC 2

BonV 1

Parameter Value

Population 50

Generation 300~600

1.0

1.0

Initial Pheromone

1.0

6B -78

Experimental Results (First Case)

NO.Target Protein Our Method SCWRL 3.0 R3 Method

Protein Length X1 X1 X1

1 1AAC 85 87.1 84.7/95 76.5/86

2 1AHO 54 85.2 68.5/67 64.8/65

3 1B9O 112 70.5 68.8/73 66.1/77

4 1C5E 71 81.7 81.7/86 73.2/82

5 1C9O 53 84.9 66.0/72 71.7/70

6 1CC7 66 80.3 68.2/83 63.6/79

7 1CEX 146 85.6 76.7/82 75.3/77

8 1CKU 60 81.7 76.7/82 68.3/80

Column 5-6: I UPAC-IUB rules / Xie and Sahinidis’s (R3) result

6B -79



9 1CTJ 61 77.0 68.9/79 70.5/80

10 1CZ9 111 70.3 64.0/73 64.0/76

11 1CZP 83 79.5 77.1/86 73.5/81

12 1D4T 89 77.5 76.4/86 67.4/82

13 1IGD 50 82.0 68.0/74 54.0/68

14 1MFM 118 75.4 68.6/80 70.3/81

15 1PLC 82 72.0 67.1/72 70.7/71

16 1QJ4 221 71.5 72.9/84 67.9/80

17 1QQ4 143 83.9 73.4/78 71.3/78


6B -80



18 1QTN 134 86.6 74.6/82 67.9/78

19 1QU9 99 79.8 71.7/81 73.7/78

20 1RCF 142 79.6 83.8/86 81.7/80

21 1VFY 63 79.4 69.8/76 71.4/75

22 2PTH 151 82.1 78.8/83 78.1/84

23 3LZT 105 73.3 78.1/86 69.5/82

24 5P2L 144 78.5 70.8/78 63.2/71

25 7RSA 109 75.2 65.1/75 61.5/67


Column 5-6: IUPAC-IUB rules / Xie and Sahinidis’s (R3) result

6B -81

Experimental Results (Second Case)



1 1A8I 704 73.4 71.3 / 80 64.1 / 75

2 1B0P 978 70.8 62.3 / 69 - / 66

3 1BU7 399 74.9 70.4 / 78 64.4 / 72

4 1GAI 386 73.6 72.8 / 81 66.6 / 72

5 1XWL 496 71.5 66.7 / 73 61.5 / 72

Column 5-6: IUPAC-IUB rules / Xie and Sahinidis’s (R3) result

6b -1 the prediction of protein structures. 6b -2 amino acids ( 胺基酸 )...

Documents