beta-barrel discrimination

60
Babak Alipanahi Babak Alipanahi Prof. Ming Li Prof. Ming Li CS882-Fall 2006 CS882-Fall 2006 Beta-Barrel Beta-Barrel Discrimination Discrimination

Upload: kirti

Post on 14-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Beta-Barrel Discrimination. Babak Alipanahi Prof. Ming Li CS882-Fall 2006. Outline :. A tale of two barrels Membrane proteins A review of β -barrels Folding Mechanism Seven families, some examples Literature Review What I have done What will I do…. Two Kinds of Closed Barrels. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Beta-Barrel Discrimination

Babak AlipanahiBabak Alipanahi

Prof. Ming LiProf. Ming Li

CS882-Fall 2006CS882-Fall 2006

Beta-BarrelBeta-BarrelDiscriminationDiscrimination

Page 2: Beta-Barrel Discrimination

2/60

OutlineOutline:

• A tale of two barrels

• Membrane proteins

• A review of β-barrels

• Folding Mechanism

• Seven families, some examples

• Literature Review

• What I have done

• What will I do…

Page 3: Beta-Barrel Discrimination

3/60

Two Kinds of Closed Barrels

• There are two kinds of closed barrels

– α/β barrels (Globular)– β barrels (Transmembrane)

• These two types of proteins are similar in the way that in both types (Branden 99)– Similar structures have very different a.a. sequences– The function of protein is determined by the loops and not by

strands or helices (α/β barrels only). (Actually, all strands and helices are only needed to form the barrel and usually β strands and α helices are structurally equivalent)

• They are different in the way that– In α/β barrels, β strands are parallel and are connected to each

other by α helices while in β barrels they are anti-parallel and are connected to each other by (usually) simple loops

– They have a very fundamental difference (actually this is the important difference between all transmembrane and globular proteins ). I will come back to this later…

Page 4: Beta-Barrel Discrimination

4/60

An example of α/β Barrel (Branden 99)

• In the right picture, β-Core of Glycolate Oxidase (8 β-stranded α/β barrel which is in an enzyme) is depicted. Note that all β-strands are parallel

• The eight-stranded α/β barrel is one of the largest and most regular of all domain structures

• At least 200 a.a. are required for formation of this structure

• Most of them are enzymes with completely different a.a. sequences and diverse functions

Page 5: Beta-Barrel Discrimination

5/60

An example of α/β Barrel (cntd.)

• As it can be seen, parallel β strands are connected to each other by α helices

• Eight β strands enclose a tightly packed hydrophobic core formed entirely by β strands side chains

• The active site in all α/β barrels is formed by loops at one end of the barrel

Page 6: Beta-Barrel Discrimination

6/60

β-Barrels

• β-barrel proteins are found in the outer membranes of Gram-negative bacteria, mitochondria and chloroplasts (Schulz 00)

• It has been hypothesized that possibly most of integral outer membrane proteins of mitochondria and chloroplasts are β-barrels because these are relics of their evolutionary history as symbiotic intracellular Gram-negative bacteria (Wimley 03)

• Abundant mitochondrial voltage-dependent anion channel (VDAC) has been long been thought to be a β-barrel (Wimley 03)

Page 7: Beta-Barrel Discrimination

7/60

Membrane Proteins

• Hallmark of Gram-negative bacteria is their cell envelope which has two membranes (inner and outer, called IM and OM respectively) separated by periplasm (Ruiz 05)

Image from Nature

Page 8: Beta-Barrel Discrimination

8/60

Membrane Proteins

• The structure, function, and composition of IM and OM is dramatically different. IM is in direct contact with cytoplasm and periplasm while OM is in contact with extracellular environment (Ruiz 05)

Image from Nature

Page 9: Beta-Barrel Discrimination

9/60

Analysis of E. coli cell envelope: IM (Ruiz 05)

• IM, which is the major permeability barrier between cell’s inside and outside (Tamm 04), is a bilayer composed of phospholipids (PL) and proteins:

– Integral IM proteins: Span the IM with α-helical transmembrane domains

– Lipoproteins: Anchored to outer leaflet of IM by lipid modifications of the N-terminal

• All of the membrane-bound biochemical process that occur in eukaryotic cells such as oxidative phosphorylation, lipid biosynthesis and protein translocation, occur in IM (Ruiz05). In other words, most membrane-associated metabolic functions are carried out in IM (Tamm 04)

• It should be noted that surface of integral IM proteins is less hydrophobic than OM proteins and they have less complex folding mechanism (Tamm 04)

Page 10: Beta-Barrel Discrimination

10/60

• 10% of the cell volume is occupied by periplasm that is comprised of soluble proteins and peptidoglycan layer. Periplasm is an oxidizing environment and contains enzymes that catalyse formation of disulphide bonds

• Periplasm is ATP free, so all the activities are done in absence of an obvious energy source

• Peptidoglycan functions as an extracytoplasmic cytoskeleton and prevents cell from lysing in dilute environments

Analysis of E. coli cell envelope: Periplasm (Ruiz 05)

Page 11: Beta-Barrel Discrimination

11/60

• OM is unique in a sense that unlike most other eukaryotic and prokaryotic membranes ,it is asymmetric. Upper and lower leaflets composed of mainly LPS1 and PL respectively

• OM functions as a selective barrier and inhibits entry of toxic and unwanted molecules which is a crucial task for bacterial survival in many (possibly hostile) environments. For example, E. Coli is resistant to bile salts which helps bacteria to live in intestines

• There are two kinds of proteins in OM:

– Lipoproteins: 90% of lipoproteins are in OM

– β-barrels: These are called OM proteins (OMP). Some of them act as channel. Since the membranes are impermeable to hydrophilic solutes; these channels are necessary for nutrient intake and excretion of toxic waste products (we will revisit OMPS diverse functions later)

Analysis of E. coli cell envelope: OM (Ruiz 05)

1: Lipopolysacharide

Page 12: Beta-Barrel Discrimination

12/60

Barrel Construction Principles (Schulz 00)

1. “The number of β strands is even and both N and C terminal are at the periplasmic barrel end”

2. “The β -strand tilt is always around 45° and corresponds to the common β-sheet twist. Only one of the two possible tilt directions is assumed, the other one is an energetically disfavored mirror image”

3. “All β strands are anti-parallel and connected locally to their next neighbors along the chain, resulting in a maximum neighborhood correlation”

Image from Schulz 00

OmpX, a defense protein which is a toxin binder

Page 13: Beta-Barrel Discrimination

13/60

Barrel Construction Principles (cntd.)

4. “The shear number of an n-stranded barrel is positive and around n+2, in agreement with the observed tilt”

5. “The strand connections at the periplasmic barrel end are short turns of a couple of residues named T1, T2 and so on”

6. “At the external barrel end, the strand connections are usually long loops named L1, L2 and so on”

Images from (Waldispühl 06) with complete modifications

Page 14: Beta-Barrel Discrimination

14/60

Barrel Construction Principles (cntd.)

7. “The β -barrel surface contacting the nonpolar membrane interior consists of aliphatic side chains forming a nonpolar ribbon with a width of about” 27 Å (Tamm 04)

8. “The aliphatic ribbon is lined by two girdles of aromatic side chains, which have intermediate polarity and contact the two nonpolar–polar interface layers of the membrane”

9. “The sequence variability of all parts of the β barrel during evolution is high when compared with soluble proteins”

10.“The external loops show exceptionally high sequence variability and they are usually mobile”. “The loops exhibit the largest sequence variability and thus contain the most of functional characteristics of each protein…” (Tamm 04)

Image from (Wimley 02) with complete modification

Page 15: Beta-Barrel Discrimination

15/60

β-Barrels folding mechanism (Tamm 04)

• Folding and membrane insertion of OmpA

– Unfolded state U hydrophobically collapses intro intermediate water soluble state IW

– This intermediate chain binds to membrane and forms intermediate state IM1

– IM1 proceeds to intermediate state

IM2 or molten disk. Some part of

β-strands are formed in this state

– Next, four Trps on the four beta hairpins move to center of bilayer (intermediate state IM3)

– IM3 is more globular and is called

molten globule but still has not reached its native tertiary structure

• Folding and membrane insertion are coupled processes

• Membrane interface is involved in the folding

Image from (Tamm 04)

Blue balls are Tryptophan (Trp) in the above image. Technique used for

finding these steps is Time-resolved Trp Fluorescence Quenching (TDFQ)

Page 16: Beta-Barrel Discrimination

16/60

Assisted folding of β-Barrels (Tamm 04)

• As told before periplasmic region is ATP free, so during the evolutionary process, mechanisms have been devised that let OMPs spontaneously insert into OM after being translocated to periplasm

• Two periplasmic proteins have been proposed for helping β-barrels folding process: – Skp is a soluble protein that can also bind to phospholipid

bilayer. Three or four Skps bind to a newly synthesized and unfolded OMP immediately after it is translocated through IM and act as a passive chaperon (remember that periplasmic region is ATP free) and prevent aggregation. But this protein does not assist folding process

– SurA is a periplasmic peptidyl-prolyl isomerace that has been shown to assist the folding of OMPs. Experiments show that “Sequences containing aromatic-random-aromatic motifs bind particularly to SurA”. It has a long 50 Å docking cleft for accommodating unfolded peptide chains

Page 17: Beta-Barrel Discrimination

17/60

Features of OMPs

• Nearly 2~3% of genes in Gram-negative bacteria genomes encode β barrels. In E. Coli genome, 60 proteins are annotated as known or probable OMPs (Wimley 03)

• Average length of β-strands is 11 a.a. residues in trimeric porins and 13-14 residues in monomeric β-barrels (Tamm 04)

• Regarding the 40~45° tilt of β-barrels from membrane normal, the average rise per residue is 3.8*sin(45) which is 2.7 Å rise per residue (Tamm 04)

• Most OMPs lack Cysteines so no possible disulphide bonds in the OMPs

Page 18: Beta-Barrel Discrimination

18/60

• Interior facing TM β-strands of β-barrels are rich in small and polar a.a. such as glycine (Gly) threonine (Thr), serine (Ser), asparagine (Asn) and glutamine (Gln). (Tamm 04), (Wimley 03)

• 40% of lipid exposed residues are aromatic (Wimley 03), also aromatic residues tyrosine (Tyr) and tryptophan (Trp) are abundant in loop regions (Tamm 04)

Images from (Wimley 03)

Features of OMPs (cntd.)

Page 19: Beta-Barrel Discrimination

19/60

Six families of OMPs (based on Tamm 04)

• General Porins: porins typically control the diffusion of small metabolites like sugars, ions, and amino acids

• Passive Transporters: these proteins are selective passive transporters of maltose, sucrose and fatty acids

• Active Transporters of Siderophores and Vitamin B12: They receive their energy through interaction with IM proteins

• Enzymes: proteases and phospholipases

• Defensive Proteins: fight hostile molecules

• Structural Proteins: membrane anchors

• Toxins (non-constitutive): kill target cell

Page 20: Beta-Barrel Discrimination

20/60

Some examples of OMPs

• Name: OmpA• β-Strands: 8• Oligometric State: monomer• Organism: E. Coli• Residues: 171• Function: Structural protein• Features:

– The residues inside the barrel are so tightly packed that lumen inside is filled with polar side chains that interact with each other through some Hydrogen bonds and electrostatic reactions. Groups of water molecules are also can be found in the lumen

– They link the outer membrane to the periplasmic peptidoglycan, in other words they are some kind of membrane anchors

– “Extensive mutagenesis studies show that OmpA is quite robust agianst many mutations especially in the loop, turn and bilayer facing area.” Surprising fact is that transmembrane spanning domain of OmpA “can even be circularly permutated without impairing its assembly and functions” (Tamm 04)

Page 21: Beta-Barrel Discrimination

21/60

• Name: FepA• β-Strands: 22• Oligometric State: monomer• Organism: E. Coli• Residues: 724• Function: iron transporter (active

transporter)• Features: FepA which is a TonB-

dependent active Fe-siderophore transporter, uses metabolic energy through interaction with IM proteins. C-terminal forms the β-barrel domain while the N-terminal forms a hatch domain that plugs the barrel and regulates iron transport (Tamm 04), (Wimley 03)

Some examples of OMPs

Page 22: Beta-Barrel Discrimination

22/60

MspA: a very long porin

• Name: MspA• β-Strands: 8x2• Oligometric State: octamer• Organism: M. smegmatis• Residues: 184• Function: mycobacterial porin• Features: It has two sequential β-

barrels of different diameter, the narrow barrel which has a hydrophobic surface which is 37Å long, because mycobacteria’s membrane do not contain LPS but very long mycolic fatty acids. It should be noted that members of mycobacteria cause tuberculosis (Tamm 04)

Bottom image from (Tamm 04)

Page 23: Beta-Barrel Discrimination

23/60

TolC: involved in multi-drug resistance

• Name: TolC

• β-Strands: 3x4

• Oligometric State: trimmer

• Organism: E. Coli

• Residues: 428

• Function: active export channel

• Features: TolC is a small molecule transporter that is involved in multi-drug resistance of bacteria (it facilitates drug efflux (Bigelow 04)). It derives its energy from its interactions with IM proteins. Lumen of β-barrel is connected to the lumen of an α-helical bundle that extends through periplasm to IM (i.e. a direct path to cytoplasm) (Wimley 03), (Tamm 04)

Page 24: Beta-Barrel Discrimination

24/60

OmpLA: an enzyme

• Name: OmpLA

• Β-Strands: 12

• Oligometric State: dimmer

• Organism: E. Coli

• Residues: 269

• Function: enzyme

• Features: Phospholipase OmpLA is only active in the dimmer form. Active site is at the outer edge of barrels and in the interface between two barrels. It role is possibly hydrolyzing the PL that have migrated to extracellular leaflet of OM, where normally they should not be there (Tamm 04), (Wimley 03)

Active site

Page 25: Beta-Barrel Discrimination

25/60

α-Hemolysin : a deadly toxin

• Name: TolC

• β-Strands: 7x2

• Oligometric State: heptamer

• Organism: S. aureus

• Residues: 293

• Function: toxin

• Features: This toxin is secreted as monomeric protein that ultimately forms a 14-stranded β-barrel with each monomer contributing a β-hair pin to the heptamer. After insertion into the victim cell’s membrane, they form an ungated pore that leads to osmotic cytolysis. Note that how clean is the pore (Wimley 03), (Tamm 04)

Page 26: Beta-Barrel Discrimination

26/60

Β-barrel discrimination: Literature review

• The research done on β-barrels can be categorized into two major groups (both of them rely only on a.a. sequence):

– Secondary structure (herein after: S.S.) prediction

– Discrimination of β-barrels from globular and IM proteins

• Usually, most methods for secondary structure prediction also provide a side-kick algorithm for discrimination because:

– Unlike globular (water soluble) proteins that have a hydrophobic core and a hydrophilic surface, β-barrels have a hydrophilic core (interior wall of lumen) and a hydrophobic surface (lipid exposed)

– Two very similar β-barrels can have very different sequences that do not show even little signs of homology

• Discrimination accuracy of α-helical TM proteins from non- α-helical TM proteins is very high (99% accuracy is reported) because of their unique features (Hirokawa 98)

Page 27: Beta-Barrel Discrimination

27/60

Some definitions

• After a.a. sequence is feed into discrimination algorithm, it determines whether it is an OMP (positive) or not (negative). A positive answer, can be true (true positive, TP) or false (false positive, FP). likewise a negative answer can be true (true negative, TN) or false (false negative, FN). So, we define:

– TP: # of correctly classified OMPs

– TN: # of correctly classified non-OMPS

– FP: # of non-OMPs classified as OMP

– FN: # of OMPs classified as non-OMP

Page 28: Beta-Barrel Discrimination

28/60

Some definitions (cntd)

OMPsallof

FNTP

TPySensitivit

#

OMPsnonallof

FPTN

TNySpecificit

#

• Sensitivity (SEN): fraction of OMPs correctly discovered by the algorithm. this shows the ability to correctly predict OMPs (Park 05)

• Specificity (SPC): fraction of correctly discovered OMPs. This shows the ability to reject non-OMPS (Park 05)

• A dumb algorithm that declares every input to be OMP will have sensitivity of 100% and specificity of 0%!

• Some people really cheat! we will see…

Page 29: Beta-Barrel Discrimination

29/60

Some definitions (cntd)

proteinsallof

proteinsassignedcorrectlyof

TNFPTNTP

TNTPaccuracyoverall

#

#

• Overall accuracy (ACC) is very useful for determination of overall performance, but it is not enough. Our dumb algorithm will have a 50% accuracy! (assuming # of OMPs and non-OMPs are the same)

Page 30: Beta-Barrel Discrimination

30/60

• Matthews correlation coefficient (MCC) is a very powerful measure of performance. It is zero for completely random algorithms (our dumb algorithm’s MCC is zero) and a perfect algorithm’s MCC is one (Park 05)

))()()(( FNTNFPTNFNTPFPTP

FNFPTNTPMCC

Some definitions (cntd)

Page 31: Beta-Barrel Discrimination

31/60

Prediction approaches (1)

• Profile-based HMMs: HMM is trained by sequence profiles computed from a multiple sequence alignment. Two major studies are

– (Martelli 02): A very successful and highly cited research. In this study, every residue can be either loop or β-strand. Discrimination is done by calculating posterior probability of sequence based on the given model. S.S. prediction accuracy is 84% , discrimination accuracy (ACC) is 84% and false positive rate is 10% (SEN=90%)

– (Bigelow 04): The algorithm, PROFtmb, is mainly based on (Martelli 02) with some modifications, like having four state for each residue: up-strand, down-strand, periplasmic- loop and outer-loop. S.S. prediction accuracy is 86% , SPC=100% and SEN=45%

Page 32: Beta-Barrel Discrimination

32/60

Prediction approaches (2)

• (Zhai 02): in β-barrel finder (BBF), hydropathy and amphipathicity values are used for discrimination. A sliding-window of size seven residues is used to calculate hydropathy and amphipathicity values for all a.a. in the protein sequence. Since the resulting function is noisy, it is averaged over multiple aligned sequences. They claim that every TM β-strand corresponds in position to a peak of hydropathy and one of amphipathicity

Page 33: Beta-Barrel Discrimination

33/60

• (Waldispühl 06): This method, uses pairwise interstarnd residue statistical potential derived from globular proteins for prediction of super-secondary structure of OMPs. transFOLD algorithm employs a generalized HMM (multi-tape S-attribute grammar (MTSAG)) to describe potential β-barrel structure and then computes the minimum free energy by dynamic programming

– They claim that unlike other approaches, they consider long range interactions between residues

– S.S. prediction accuracy is 79% but rate of correctly predicted structures is 93%

– For OMP discrimination, they use four parameters: sequence length, folding pseudo-energy in water-filled and non-water-filled lumen model and overall hydrophobicity. Discrimination is performed by SVM. SEN=88% and SPC=63% and ACC=75%

Prediction approaches (3)

Page 34: Beta-Barrel Discrimination

34/60

• Neural Network based (Jacoboni 01): This work has been cited many times and is highly appreciated as one of the first reasonably good prediction methods

– A feed-forward neural network is implemented and trained using the error back-propagation algorithm for discrimination of β-strands from extra membrane regions (i.e. a two state prediction, β-strand or non-β-strand)

– Evolutionary information is given as input in form of sequence profile after multiple-sequence alignments

– S.S. prediction accuracy is nearly 78%

Prediction approaches (4)

Page 35: Beta-Barrel Discrimination

35/60

Methods based on peptide and dipeptide composition

• In these methods, abundance of single a.a. or a.a. pairs is used for discrimination of OMPs

• It has been shown that a.a. and a.a. pair composition is reasonably different in OMPs and non-OMPs

• Methods using a.a.composition as classification features, have much better performance in comparison to methods using other features such as hydrophobicity or posterior probability in HMM-based methods

• With these features at hand, several techniques have been applied for classification such as k-nearest neighbors (k-NN), SVM, simple a.a. weighting and neural network

N

nicomp i)(

1),(

N

nnjidipep ji

Page 36: Beta-Barrel Discrimination

36/60

Methods based on peptide and dipeptide composition (cntd)

• a.a. abundance in lipid exposed and barrel interior (Wimley 02): in this research a clever observation made that the relative abundance of a.a. (relative to whole genome) in interior and lipid exposed areas are very different

• If we show a lipid exposed a.a. by E and barrel interior a.a. by I, a β-strand will have this pattern:

– …EIEIEIEIEIE…

Images from (Wimley 03)

Page 37: Beta-Barrel Discrimination

37/60

(Wimley 02) (cntd.)

10,8,6,4,29,7,5,3,1

10,8,6,4,29,7,5,3,1

,

max

iij

iij

iij

iij

IisAEisA

EisAIisA

sequenceofjpositionofscorestrand

• In Aj+i is I assumption, it is assumed that a.a. j+i in sequence is barrel interior facing a.a. so it will be scored based on barrel interior a.a. relative abundance table and vice versa

• It has been assumed that β-strand length is 10 but this is not so realistic

• No performance measure is given

Page 38: Beta-Barrel Discrimination

38/60

• k-NN: (Garrow 05) in TMBhunt, features are comp(i) values. For a new query, its k nearest neighbors are found (by calculating the Euclidian distance) and by majority vote, its class is identified. Performance is reinforced by including differentially weighted a.a., evolutionary information and by calibrating the scoring system. SEN=91%, SPC=93.8% and ACC=92.5% (these results were doubted in (Park 05) to be 89.2 %)

• sum-of-deviations: (Gromiha 04) in this study, the average comp(i) in all proteins for each class (OMP or non-OMP) is computed. For a new query, comp(i) values are computed and the absolute value of deviation comp(i) from each class is computed. The query is of the type that has less total deviation from (They could use Euclidian distance which is more meaningful). SPC=80%, SEN=84%

Methods based on peptide and dipeptide composition (cntd)

Page 39: Beta-Barrel Discrimination

39/60

• sum-of-deviations: (Gromiha 05-a) this study is virtually the same is the previous one but the new algorithm works only with averaged dipeptide abundance values (dipep(i,j)). For a new query, dipep(i,j) values are computed (400 values) and then weighted with regard to pre-calculated dipeptide abundance difference table for OMPs and non-OMPs (only globular proteins). Finally the decision is made based on the sign of the summation of weighted terms. SEN=94.7%, SPC=79.2% and ACC=84.8%. Major problem of this method is that training data has not been filtered for homologous sequences giving overestimated results

• Neural-Network: (Gromiha 05-b) discrimination method is exactly the same as (Gromiha 04) but they have introduced neural network for S.S. prediction that has a prediction accuracy of 73.2%

Methods based on peptide and dipeptide composition (cntd)

Page 40: Beta-Barrel Discrimination

40/60

Methods based on peptide and dipeptide composition (cntd)

• SVM: (Park 05) (note: Gromiha is the second author!) sequences used for training are filtered by all-to-all sequence similarity check using CD-HIT (Li 01) that produces a non-redundant protein data base. They used SVM with radial basis function (RBF) kernel for discrimination. This study is actually the first organized study with well-defined definitions and representation of results

• They use composition values (xC means that x comp(i) values have been used for discrimination) and dipeptide values (yD means y dipep(i,j) values has been used). x and y are found using backward and forward feature selection algorithms

• I have defined some notations for ease of results presentation– OMP: outer membrane proteins

– TMH: trans membrane α-helices proteins

– GLB: globular proteins

– NOM: non-outer membrane protein

• So, OMP-TMH classification means discrimination of OMP and TMH proteins

Page 41: Beta-Barrel Discrimination

41/60

Results of SVM-peptide composition method

Prediction rate (%) SEN SPC ACC MCC

OMP-TMH (15C) 99 92.7 95.9 0.920

OMP-GLB (17C+8D) 88 96.4 94.4 0.846

OMP-NOM (18C+10D) 90.9 94.7 93.9 0.816

OMP-NOM (20C+400D) 79.3 99.0 95.2 0.840

• Results are better than any previous methods but are far from the accuracy rates for TMH set (99%)

• It is interesting that the discrimination between OMP and NOM (which is TMH+GLB) is less than each of OMP-TMH and OMP-GLB. Also, OMP-TMH has the highest discrimination rate

Page 42: Beta-Barrel Discrimination

42/60

What I have done: 1-Data Set

• The data set I have used is the same as study done by (Park 05) which has been shown that be one of the most comprehensive and challenging data sets that contain

– 208 non-homologous OMPs

– 206 non-homologous TMHs

– 673 non-homologous GLBs that consist of

• 155 all α proteins• 156 all β proteins• 184 α+β proteins• 179 α/β proteins

• For finding the optimal features, I first started with a.a composition ratios (20C), then added sequence length (L) and finally I found that β-strand score (B) (as defined in (Wimley 02)) can enhance the performance

Page 43: Beta-Barrel Discrimination

43/60

Averaged a.a. composition

Page 44: Beta-Barrel Discrimination

44/60

β-strand quality factor

• I have assumed that mean β-strand length is 12 because it is the best choice for covering all β-barrels (including newly discovered ones)

• β-factor is calculated (and is called B feature) by summing squared values of β-strand quality factor for all residues

Page 45: Beta-Barrel Discrimination

45/60

What I have done: 2-Feature Selection

• There is a very useful and usual scaling insensitive measure for linear classification that can give some information even for non-linear classification called Fisher Discrimination Ratio (FDR) which is defined as (Park 05):

22

2

FDRi

IIclassi

Iclass

iIIclass

iIclass

i

Page 46: Beta-Barrel Discrimination

46/60

FDR values for all the features

Page 47: Beta-Barrel Discrimination

47/60

FDR values for all features for OMP and TMH classification

Page 48: Beta-Barrel Discrimination

48/60

Two good features: β-factor and sequence length

Page 49: Beta-Barrel Discrimination

49/60

Another good feature: Serine composition

Page 50: Beta-Barrel Discrimination

50/60

3-Algorithms used for prediction

• I have used several algorithms for classification including:

– Support Vector Machine (SVM): SVM with radial basis function (RBF) kernel

– Locally Linear Neurofuzzy Model (LLNM): LLNM with locally linear model tree (Lolimot) model construction method

– Neural Network: multi-layer perceptron (MLP) feed-forward network with error back propagation learning algorithm

• The prediction accuracy is nearly the same for all algorithms so none has clear advantage over the others, however since SVM is much faster, I have chosen it

• A very possible danger when using powerful algorithms is overfitting that that destroys the generalization capability. When training dataset is small, overfitting is a fatal risk

• To avoid overfitting, usually n-fold cross validation is used specially when the training data set is small

– Data set is divided into n subsets, at each step algorithm is trained by n-1 subsets and validated by the remaining 1 subset. This process is repeated for all n subsets and performance is averaged over all n experiments

Page 51: Beta-Barrel Discrimination

51/60

A little note about over-fitting

Page 52: Beta-Barrel Discrimination

52/60

Another important factor: Scaling

• It has been shown that most machine learning algorithms are sensitive to scaling, especially when different features have different natures. For example sequence length (with a mean of 550) has nothing to do with composition ratios (mostly in the order of 0.1)

• To scale the data, usually data is scaled to [-1 1], so all data points lie within a n-dimensional hypercube (n is dimension of data or # of features)

• A really common mistake is to scale the training and validation data at the same time which leads to better but wrong results

• Finally, performance measures highly depend on the data set used (I will give an example why later). So, results of different studies are not easily comparable to each other

Page 53: Beta-Barrel Discrimination

53/60

4-Performance results

Prediction rate (%) SEN SPC ACC MCC

OMP-NOM (L+20C+B) 85 97.6 95.1 0.844

OMP-NOM (L+20C+B)-(Xdata) 92.6 96.5 95.3 0.888

OMP-NOM (20C) 77.5 95.6 91.8 0.7347

OMP-NOM (18C+10D)-(Park) 90.9 94.7 93.9 0.816

OMP-TMH (20C+B) 96 93.8 95 0.899

OMP-GLB (L+20C+B) 88.3 98.1 95.7 0.883

TMH-NTM (L+20C+10D+B) 85.9 99.3 96.6 0.892

Page 54: Beta-Barrel Discrimination

54/60

Estimated Pr(error) for all proteins

• The above figure, is not very promising. Pr(error)=1 means that whenever the protein is not in the training set, it will be classified incorrectly

Page 55: Beta-Barrel Discrimination

55/60

What will I do?

• I am looking to improve the prediction accuracy rate up to 99%, so like TMH discrimination, the research on discrimination finishes, to do so

– I will examine the proteins that are always wrong and determine the major weakness of the features that leads to the wrong decision

– Perform a large and sophisticated feature selection

– Possibly, add some new features

• Also, I want to use three state-of-the-art newly proposed algorithms that improve classification accuracy

– Metric Learning

– Metric Learning by collapsing classes

– Metric Learning for Large Margin Nearest neighbor classification

Page 56: Beta-Barrel Discrimination

56/60

New classification methods

• Metric Learning: In metric learning, a linear transformation L is found so that the distance between similar transformed data points (data points in the same class) is reduced while the distance between different transformed data points (data points in different classes) is increased. Classification then will be done on the transformed data points (Xing 02)

• Metric Learning by collapsing classes: similar to the previous method, a linear transformation is found but the difference is that in this approach, the final goal is that all similar transformed data points collapse to a single point and pushing other classes’ data points infinitely away from this point (Globerson 05)

Page 57: Beta-Barrel Discrimination

57/60

New classification methods (cntd)

• Metric Learning for Large Margin Nearest neighbor classification: In this study, the linear transformation is trained with the goal that all the k-nearest neighbors always belong to the same class while data points from other classes be far enough and do not “invade” the local neighborhood of data points of other classes (Weinberger 06)

• In the figure on the right, local neighborhood is purified after training

image from (Weinberger 06)

Page 58: Beta-Barrel Discrimination

58/60

References

• (Bigelow 04) Bigelow,H.R., Petrey,D.S., Liu,J., Przybylski,D. and Rost,B. (2004) Predicting transmembrane beta-barrels in proteomes. Nucleic Acids Res., 32, 2566–2577

• (Branden 99) Branden,C. and Tooze,C. (1999) Introduction to Protein Structure. Garland Publishing Inc., New York.

• (Globerson 05) Amir Globerson, Sam Roweis (2005), Metric Learning by Collapsing Classes, Neural Information Processing Systems 18 (NIPS'05). pp. 451-458

• (Gromiha 05-a) Gromiha MM, Suwa M., 2005. A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics, doi:10.1093/bioinformatics/bti126.

• (Hirokawa 98) Hirokawa,T., Boon-Chieng,S. and Mitaku,S. (1998) SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics, 14, 378–379

• (Jacobni 02) Jacoboni,I., Martelli,P.L., Fariselli,P., De Pinto,V. and Casadio,R. (2001) Prediction of the transmembrane regions of beta-barrel membrane proteins with a neural network-based predictor. Protein Sci., 10, 779±787.

Page 59: Beta-Barrel Discrimination

59/60

References (cntd.)

• (Martelli 02) Martelli,P.L., Fariselli,P., Krogh,A. and Casadio,R. (2002) A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics, 18, S46–S53.

• (Schulz 00) Schulz, G.E. 2000. _-Barrel membrane proteins. Curr. Opin. Struct. Biol. 10: 443–447.

• (Tamm 04) Tamm LK, Hong H, Liang B. Folding and assembly of beta-barrel membrane proteins. Biochim Biophys Acta 2004;1666:250–263.

• (Weinberger 06) K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. in Y. Weiss, B. Schoelkopf, and J. Platt (eds.), Advances in Neural information Processing Systems 18. MIT Press: Cambridge, MA

• (Wimley 02) Wimley, W.C. 2002. Toward genomic identification of _-barrel membrane proteins: Composition and architecture of known structures. Protein Sci. 11: 301–312.

• (Wimley 03) Wimley WC. The versatile beta-barrel membrane protein. Curr. Opin Struct Biol 2003;13:404–411.

• (Xing 02) E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press

• (Zhai 02) Zhai,Y. and Saier,M.H.,Jr (2002) The beta-barrel finnder (BBF) program, allowing identification of outer membrane beta-barrel proteins encoded within prokaryotic genomes. Protein Sci., 11, 2196-2207.

Page 60: Beta-Barrel Discrimination

60/60

Thanks for your patience!

….any questions?