Download - SUMOylation site prediction

SUMOylation-site Prediction

Denis C. Bauer

Fabian A. Buske

Mikael Bodén

Overview

• Background– SUMOylation - what is that ?

• Published predictors• Our approach• What makes SUMO hard to tackle

SUMO is not 相撲

• Small Ubiquitin-related Modifier is a small protein of 97 amino acids.

• 20% homology to ubiquitin• Post-translational modification• Covalently attached to Lysines• Involved in many

pathways/mechanisms– Transcriptional regulation– Compartmentisation

SUMOylation pathway

SUMOylation motif

• One consensus motif [ILV]K.E for about 60% of known sites

However• Not all [ILV]K.E -sites

are SUMOylated

• Not all SUMOylated sites have the consensus motif

TP

FP

FN

Baseline prediction

Method CC

Regular Expression scanner 0.68

Comparison with existing predictors

Method CC

Regular Expression scanner 0.68

SUMOpre+ 0.64

SUMOsp‡ 0.26

SUMOplot† 0.48

+ Xu J., BMC Bioinformatics 2008, 9:8‡ Xue Y., Nucleic Acid Res 2006, W254-W257† http://www.abgent.com/doc/sumoplot (commercial)

Case study : Core histones in yeast

• Identified SUMOylation sites+

– H2B : K6/7, K16/17– H2A : K2, K126– H4 : somewhere in the tail

• No SUMOylation consensus site

• Predictor to date are not able to predict even a single SUMOylation site in the histone sequence

+ Nathan D., Genes Dev 2006, 20(8):966-76

Our approach

• Identify – window size

– which ML method is best

• Voilá: better predictor !

SequencexxxxKxxxx

SUMOylation1/0ML

Training in more Detail

wU wD

Protein Sequence K K K

Imbalance in the dataset - more negatives than positives

ML

T010

P110

K

K

SUMOylated K

Not SUMOylated K

Prediction in more Detail

wU wD

Protein Sequence K K K

TrainedML

1

1

0

K

K

K

K

SUMOylated K

Not SUMOylated K

ML methods

• Bidirectional Recurrent Neural Network (BRNN)– Using information of flanking windows– Decaying with distance to center window– Prone to overfit

• Support Vector Machine (SVM)– regularized– requires suitable kernel and feature representation – Standard Kernels

• Linear, Polynomial, RBF

– String Kernel• P-kernel, local-alignment kernel

Data set

• Training/Testing data– 144 proteins with – 241 SUMOylation sites– 5,741 non-SUMOylated Lysines– 68% of the SUMOulated sites confom to the

consensus motif

• Hold-out – 13 proteins with– 27 SUMOylation sites– 48% consensus motif

Xu J., BMC Bioinformatics 2008, 9:8

Evaluation

• 5-fold cross-validation• Matthews correlation

coefficient (CC)• Sensitivity, Specificity,

Accuracy

• Area under the curve (AUC)

Performance overview

SUMOsvm

Comparison with existing methods

Quest to improve performance

• Protein structural features and evolutionary features

• Separating SUMOylation sites from different species or compartment

• Clustering for other motifs using kernel hierarchical clustering

Summary

• Regular Expression Scanner is still the best classifier.

• SUMO more versatile than expected !

• The road to better predictions– Are there other motifs?

– Which features can discriminate?

– Is the dataset biased?

htt

p://

spo

t.co

lora

do

.ed

u/~

cole

ma

b/T

hea

tre

_Re

sou

rce

s/S

um

oB

alle

rin

a.jp

g

Acknowledgment

Predictor/Analysis– Mikael Bodén– Fabian Buske

Dataset– Xu et al.

PhD Supervisors– Tim Bailey– Andrew Perkins– Mikael Bodén

Other Bioinformatic tools:STREAM – a practical workbench for modeling

transcriptional regulation.www.bioinformatics.org.au/stream/

Download - SUMOylation site prediction

Top Related