Machine Learning & Bioinformatics 1
Molecular Biomedical Informatics分子生醫資訊實驗室Machine Learning and Bioinformatics機器學習與生物資訊學
2Machine Learning and Bioinformatics
EvaluationThe key to success
Machine Learning and Bioinformatics 3
Three datasetsof which the answers must be known
Machine Learning and Bioinformatics 4
Note on parameter tuning It is important that the testing data is not used in any
way to create the classifier Some learning schemes operate in two stages
– build the basic structure
– optimize parameters
The testing data can not be used for parameter tuning proper procedure uses three sets: training, tuning and
testing data
Machine Learning and Bioinformatics 5
Data is usually limited Error on the training data is NOT a good
indicator of performance on future data– otherwise 1 NN would be the optimum classifier
Not a problem if lots of (answered) data is
available– split data into training, turning and testing sets
However, (answered) data is usually limited More sophisticated techniques need to be used
Machine Learning and Bioinformatics 6
Issues in evaluation Statistical reliability of estimated differences
in performance significance tests Choice of performance measures
– number of correctly classified samples
– ratio of correctly classified samples
– error in numeric predictions
Costs assigned to different types of errors– many practical applications involve costs
Machine Learning and Bioinformatics 7
Training and testing sets Testing set must play no part, including
parameter tuning, in classifier formation Ideally, both training and testing sets are
representative samples of the underlying
problem, but they may differ in nature– we got data from two different towns A and B and
want to estimate the performance of our classifier
in a completely new town
Machine Learning and Bioinformatics 8
Which(training vs. tuning/testing)
should be more similar to the target new town?
Machine Learning and Bioinformatics 9
Making the most of the data Once evaluation is complete, all the data can
be used to build the final classifier for real
(unknown) data A dilemma
– generally, the larger the training data the better
the classifier (but returns diminish)
– the larger the testing data the more accurate the
error estimate
Machine Learning and Bioinformatics 10
Holdout procedure Method of splitting original data into training and testing
sets Reserve a certain amount for testing and use the remainder
for training usually one third for testing and the rest for training
The samples might not be representative e.g., class might be missing in the testing data
Stratification ensures that each class is represented with approximately equal
proportions in both subsets
Machine Learning and Bioinformatics 11
Repeated holdout procedure Holdout procedure can be made more reliable by
repeating the process with different subsamples– in each iteration, a certain proportion is randomly
selected for testing (possibly with stratification)
– the error rates on the different iterations are averaged
to yield an overall error rate
This is called the repeated holdout procedure A problem is that the different testing sets overlap
Machine Learning and Bioinformatics 12
Cross -validation Cross- validation avoids overlapping test sets
– split data into n subsets of equal size
– use each subset in turn for testing, the remainder for
training
– the error estimates are averaged to yield an overall error
estimate
Called n- fold cross- validation Often the subsets are stratified before the cross-
validation is performed
Machine Learning and Bioinformatics 13
More on cross -validation Stratified ten- fold cross- validation Why ten?
– extensive experiments have shown that this is the
best choice to get an accurate estimate
– there is also some theoretical evidence for this
Repeated stratified cross- validation– e.g., ten- fold cross- validation is repeated ten times
and results are averaged (reduces the variance)
Machine Learning and Bioinformatics 14
Leave- One- Out cross- validation A particular form of cross -validation
– set number of folds to number of training instances
Makes best use of the data Involves no random subsampling Very computationally expensive
Advantage and disadvantage
Machine Learning and Bioinformatics 15
LOO-CV and stratification Stratification is not possible
– there is only one instance in the testing set
An extreme example– random dataset split equally into two classes
– best inducer predicts majority class
– 50% accuracy on fresh data
– LOO-CV estimate is 100% error
16Machine Learning and Bioinformatics
Cost
Machine Learning and Bioinformatics 17
Counting the cost In practice, different types of classification errors
often incur different costs Examples
– terrorist profiling, where predicting ‘negative’ achieves
99.99% accuracy
– loan decisions
– oil- slick detection
– fault diagnosis
– promotional mailing
Machine Learning and Bioinformatics 18
Confusion matrixPredicted class
Yes No
Actual classYes True positive False negativeNo False positive True negative
Machine Learning and Bioinformatics 19
Classification with costs Two cost matrices
Error rate is replaced by average cost per
prediction
Predicted class Predicted class
Yes No 0 1 2
Actual class
Yes 0 1Actual class
0 0 1 1
No 1 0 1 1 0 1
2 1 1 0
Machine Learning and Bioinformatics 20
Cost -sensitive learning A basic idea is to only predict high- cost class when very
confident about the prediction Instead predicting the most likely class, we should make the
prediction that minimizes the expected cost– dot product of class probabilities and appropriate column in cost
matrix
– choose column (class) that minimizes expected cost
Not at training time Most learning schemes do not perform cost sensitive learning
– they generate the same classifier no matter what costs are assigned to
the different classes
Machine Learning and Bioinformatics 21
A simple methodfor cost -sensitive learning
Machine Learning and Bioinformatics 22
Resampling of instances according to costs
23Machine Learning and Bioinformatics
Measures
Machine Learning and Bioinformatics 24
Lift charts In practice, costs are rarely known Decisions are usually made by comparing possible scenarios E.g., promotional mail to 1,000,000 households
– mail to all; 0.1% respond (1000)
– a data mining tool identifies subset of 100,000 most promising,
0.4% of these respond (400)
– another tool identifies subset of 400,000 most promising, 0.2%
respond (800)
Which is better? A lift chart allows a visual comparison
Machine Learning and Bioinformatics 25
Generating a lift chart Sort instances according to predicted probability
of being positive
x-axis is sample size; y-axis is number of true
positives
Predicted probability Actual class
1 0.95 Yes
2 0.93 Yes
3 0.93 No
4 0.88 Yes
… … …
Machine Learning and Bioinformatics 26
A hypothetical lift chart
Machine Learning and Bioinformatics 27
ROC curves ROC curves are similar to lift charts
– stands for “receiver operating characteristic”
– used in signal detection to show tradeoff between hit
rate and false alarm rate over noisy channel
Differences to lift chart– y-axis shows percentage of true positives in sample
rather than absolute number
– x-axis shows percentage of false positives in sample
rather than sample size
Machine Learning and Bioinformatics 28
A sample ROC curve
Jagged curve one set of test dataSmooth curve use cross-validation
Machine Learning and Bioinformatics 29
More measures Precision = , percentage of reported samples that are positive Recall = , percentage of positive samples that are reported Precision/recall curves have hyperbolic shape Three-pointaverageis the average precision at 20%, 50% and
80% recall F-measure = , harmonic mean of precision and recall
– makes precision and recall as equal as possible
Specificity = , percentage of negative samples that are not
reported Area under the ROC curve (AUC)
Machine Learning and Bioinformatics 30
Summary of some measuresDomain Plot Explanation
Lift chart Marketing TPSubset size
TP(TP+FP)/(TP+FP+TN+FN)
ROC curve Communications TP rateFP rate
TP/(TP+FN)FP/(FP+TN)
Recall-precision curve
Informationretrieval
RecallPrecision
TP/(TP+FN)TP/(TP+FP)
31Machine Learning and Bioinformatics
Evaluating numeric predictionSame strategies, including independent testing sets,
cross-validation, significance tests, etc.
Machine Learning and Bioinformatics 32
Measures in numeric prediction Actual target values: Predicted target values: The most popular measure is mean squared
error (MSE), , because it is easy to manipulate
mathematically
Machine Learning and Bioinformatics 33
Other measures Root mean squared error (RMSE) = Mean absolute error (MAE), , is less
sensitive to outliers than MSE Sometimes relative error values
are more appropriate
Machine Learning and Bioinformatics 34
Improvement on the mean How much does the scheme improve on
simply predicting the average? Relative squared error = Relative absolute error =
Machine Learning and Bioinformatics 35
Correlation coefficient / 相關係數 Measures the statistical correlation between the predicted values and
the actual values
Scale independent, between –1 and +1 Good performance leads to large values
36http://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gif
Machine Learning and Bioinformatics 37
Which measure? Best to look at all of them Often it doesn’t matter
D the best; C the second-best; A and B are arguable
A B C D
Root mean-squared error 67.8 91.7 63.3 57.4
Mean absolute error 41.3 38.5 33.4 29.2
Root relative squared error 42.2% 57.2% 39.4% 35.8%
Relative absolute error 43.1% 40.1% 34.8% 30.4%
Correlation coefficient 0.88 0.88 0.89 0.91
Machine Learning & Bioinformatics 38
Today’s exercise
Machine Learning & Bioinformatics 39
Parameter tuningDesign your own select, feature, buy and sell
programs. Upload and test them in our
simulation system. Finally, commit your best
version and send TA Jang a report before 23:59 11/5
(Mon).
Machine Learning and Bioinformatics 40
Possible ways Enlarge parameter range in CV Stratified, repeated…
– minimize the variance
Make turning set– use large training set; make tuning set as similar to the target stocks as possible
Cost matrix– resampling, otherwise it would be very difficult
Change measures– or plot ROC curves to understand your classifiers
The best measure is the transaction profit, but it requires the simulation
system. Instead, you can develop a comprising evaluation script, which is more
complicated than any theoretic measures but simpler than the real problem.
This is usually required in practice.