Download - 20131019 生物物理若手 Journal Club
DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins. 2013 Jun 5.
20131019生物物理若手関西支部 Journal Club
Topics
Prediction of protein-DNA binding residues
Statistics of network
Machine learning
Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding
residues.
Query protein, Template protein, TP, FP, FN
Machine learning Template DNABind
CprK
(3E6
C:C)
EcoR
V(1R
VE:A
)
DNABind improves classification.
True positive residues.
Aim
Protein-DNA interactions is important for cell biology.
Its determination by experiments is time- and cost-consuming.
Computational approaches are desirable.
Computational approaches
Data bank (PDB)Binding residues charactersExposed solventsHigher electrostatics potentialMore conservedHotspots as clusters of conserved residues
Structural properties (DNA-binding residue vs surface)Packing densitySurface curvatureB-factorResidue fluctuationHydrogen bond donor
http://www.rcsb.org/pdb/home/home.do
Feature-basedExtract effective features
Template-basedAlign template and retrieve the best match
Computational algorithms
Template!!
Feature-basedExtract effective features
Template-basedAlign template and retrieve the best match
Computational algorithms
Template!!
Feature-basedExtract effective features
Template-basedAlign template and retrieve the best match
Computational algorithms
Template!!
Features used in machine learningStructure-based
PSSM (position specific scoring matrix)Evolutionally conservationSolvent accessibilityLocal geometry (depth and protrusion index)Topological features
degree, closeness, betweenness, clustering coefficient
Relative position (distance to centroid)Statistical potential (Boltzmann distribution)
Sequence-based (more difficult than structure)Amino acid identityResidue physicochemical properties
polarity, secondary structure, molecular volume, codon diversity, electrostatic charge
Predicted structure (Not need 3D structure !!)
Features used in machine learning
Structure-basedPSSMRelative solvent accessibilityDepth and protrusion indexTopological featuresDistance to centroidStatistical potentials
Sequence-basedPSSMPredicted structuresAmino acid indicesStatistical potentials
𝑀𝐿𝑠𝑐𝑜𝑟𝑒=𝛼𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒+(1−𝛼 )𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒
Construct machine learning (SVM)
𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒 𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒
Used in image recognition, etc…Recognition of faces in the camera.
Template-based approach
Template!!
Used in image recognition, etc…Recognition of faces in the camera.
Template-based approach
Match!! Template!!
Template-based prediction
Template-basedStructural alignment and statistical potentialThe binding residue prediction will be conducted only if the target protein was considered as a DNA-binding protein.
312 templates were selected.
Network
Degree is a commonly used measure to reflect the local connectivity of a node.
Closeness is a global centrality metric used to determine how critical a residue is in a residue interaction network.
Betweenness of residue i is defined to be the sum of the fraction of shortest paths between all pairs of residues that pass through residue i.
Clustering coefficient (transitivity) quantifies how close its neighbors are to being a clique. Probability that the adjacent vertices of a vertex are connected.
Motif, hub, and community are also important…
Network sample; human protein interactome
Scale-freeSmall-worldCluster
Power law (Pareto distribution)
Bioinformatics. 2012 Jan 1;28(1):84-90.
Machine learning
Example; spam4601 samples, 57 parameters.Classification; spam or nonspam
Machine learningSupport vector machine (SVM)Decision treeRandomForestLogistic regressionLASSO (Elastic net and Ridge)Neural networks (Deep learning)
Evolutionary algorithmGaussian processingk nearest neighborClusteringBayesian networksAssociation rule learningInductive logic programming (ILP)
Support vector machine (SVM)
Make hyperplane to divide groups.Kernel method; non-linear to linearEasy to do.Much computational time.Tuning is very difficult.
Decision tree
Make many trees.Easy to understand graphically.Performance is not so good.
RandomForest
Make many decision trees.Much precise.A little time consumer.
Logistic regression
Many medical researchers use…Easy to use but tuning is very difficult.(to tell the truth…)
LASSO, Elastic net, and Ridge regression
𝛼={1⋮0LASSOElastic NetRidge
Least Absolute Shrinkage and Selection Operator
Neural networks
Artificial mammal brain (perceptron).Hidden multi-layer.
Deep learning is hot topic!!(hard to understand…)
http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html
n-fold cross validation
To evaluate how the results of a statistical analysis will generalize to an independent data set.
n-fold cross validation
To evaluate how the results of a statistical analysis will generalize to an independent data set.
Train data
Test
n-fold cross validation
To evaluate how the results of a statistical analysis will generalize to an independent data set.
Train data
Test
n-fold cross validation
To evaluate how the results of a statistical analysis will generalize to an independent data set.
Train data
Test
n-fold cross validation
To evaluate how the results of a statistical analysis will generalize to an independent data set.
Train data
Test
n-fold cross validation
To evaluate how the results of a statistical analysis will generalize to an independent data set.
Train data
Test
n-fold cross validation
To evaluate how the results of a statistical analysis will generalize to an independent data set.
Train data
Test 1
One-leave out CV
Performance
SVM Tree RandomForest LASSO Elastic net Ridge Logistic nnet
Recall 0.917 0.872 0.927 0.894 0.892 0.852 0.893 0.930
Precision 0.948 0.914 0.954 0.932 0.926 0.926 0.930 0.935
F 0.932 0.893 0.940 0.913 0.911 0.887 0.911 0.932
MMC 0.890 0.826 0.902 0.858 0.856 0.821 0.856 0.888
Combine two approaches
𝐶 𝑠𝑐𝑜𝑟𝑒={𝛽𝑀 𝐿𝑠𝑐𝑜𝑟𝑒+(1− 𝛽)𝑇 𝐿𝑠𝑐𝑜𝑟𝑒
𝑀𝐿𝑠𝑐𝑜𝑟𝑒
if
𝑀𝐿𝑠𝑐𝑜𝑟𝑒=𝛼𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒+(1−𝛼 )𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒
and are determined by CV and ROC analysis.
A: Binding residues are highly solvent accessible.B, C: Binding residues have low depth and high protrusion.D-G: Not so much difference in networks.H: Binding residues are less distant to the centroid.
Statistical features of structure
Performance
Performance
Proteins. 2004 Dec 1;57(4):702-10.Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.
TM-score is a measure of similarity between two protein structures with different tertiary structures. < 0.2 is random relation and > 0.5 is highly related.
Higher TM score is required for good prediction.
PerformanceComparison among ML, TL, and DNABind.
Comparison between DNABind and other software.
Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding
residues.
Query protein, Template protein, TP, FP, FN
Machine learning Template DNABind
CprK
(3E6
C:C)
EcoR
V(1R
VE:A
)
DNABind improves classification.
True positive residues.