Download - SUMOylation site prediction
SUMOylation-site Prediction
Denis C. Bauer
Fabian A. Buske
Mikael Bodén
Overview
• Background– SUMOylation - what is that ?
• Published predictors• Our approach• What makes SUMO hard to tackle
SUMO is not 相撲
• Small Ubiquitin-related Modifier is a small protein of 97 amino acids.
• 20% homology to ubiquitin• Post-translational modification• Covalently attached to Lysines• Involved in many
pathways/mechanisms– Transcriptional regulation– Compartmentisation
SUMOylation pathway
SUMOylation motif
• One consensus motif [ILV]K.E for about 60% of known sites
However• Not all [ILV]K.E -sites
are SUMOylated
• Not all SUMOylated sites have the consensus motif
TP
FP
FN
Baseline prediction
Method CC
Regular Expression scanner 0.68
Comparison with existing predictors
Method CC
Regular Expression scanner 0.68
SUMOpre+ 0.64
SUMOsp‡ 0.26
SUMOplot† 0.48
+ Xu J., BMC Bioinformatics 2008, 9:8‡ Xue Y., Nucleic Acid Res 2006, W254-W257† http://www.abgent.com/doc/sumoplot (commercial)
Case study : Core histones in yeast
• Identified SUMOylation sites+
– H2B : K6/7, K16/17– H2A : K2, K126– H4 : somewhere in the tail
• No SUMOylation consensus site
• Predictor to date are not able to predict even a single SUMOylation site in the histone sequence
+ Nathan D., Genes Dev 2006, 20(8):966-76
Our approach
• Identify – window size
– which ML method is best
• Voilá: better predictor !
SequencexxxxKxxxx
SUMOylation1/0ML
Training in more Detail
wU wD
Protein Sequence K K K
Imbalance in the dataset - more negatives than positives
ML
T010
P110
K
K
SUMOylated K
Not SUMOylated K
Prediction in more Detail
wU wD
Protein Sequence K K K
TrainedML
1
1
0
K
K
K
K
SUMOylated K
Not SUMOylated K
ML methods
• Bidirectional Recurrent Neural Network (BRNN)– Using information of flanking windows– Decaying with distance to center window– Prone to overfit
• Support Vector Machine (SVM)– regularized– requires suitable kernel and feature representation – Standard Kernels
• Linear, Polynomial, RBF
– String Kernel• P-kernel, local-alignment kernel
Data set
• Training/Testing data– 144 proteins with – 241 SUMOylation sites– 5,741 non-SUMOylated Lysines– 68% of the SUMOulated sites confom to the
consensus motif
• Hold-out – 13 proteins with– 27 SUMOylation sites– 48% consensus motif
Xu J., BMC Bioinformatics 2008, 9:8
Evaluation
• 5-fold cross-validation• Matthews correlation
coefficient (CC)• Sensitivity, Specificity,
Accuracy
• Area under the curve (AUC)
Performance overview
SUMOsvm
Comparison with existing methods
Quest to improve performance
• Protein structural features and evolutionary features
• Separating SUMOylation sites from different species or compartment
• Clustering for other motifs using kernel hierarchical clustering
Summary
• Regular Expression Scanner is still the best classifier.
• SUMO more versatile than expected !
• The road to better predictions– Are there other motifs?
– Which features can discriminate?
– Is the dataset biased?
htt
p://
spo
t.co
lora
do
.ed
u/~
cole
ma
b/T
hea
tre
_Re
sou
rce
s/S
um
oB
alle
rin
a.jp
g
Acknowledgment
Predictor/Analysis– Mikael Bodén– Fabian Buske
Dataset– Xu et al.
PhD Supervisors– Tim Bailey– Andrew Perkins– Mikael Bodén
Other Bioinformatic tools:STREAM – a practical workbench for modeling
transcriptional regulation.www.bioinformatics.org.au/stream/