machine learning and bioinformatics 機器學習與生物資訊學

Machine Learning & Bioinformatics 1

Molecular Biomedical Informatics分子生醫資訊實驗室Machine Learning and Bioinformatics機器學習與生物資訊學

2Machine Learning and Bioinformatics

EvaluationThe key to success

Machine Learning and Bioinformatics 3

Three datasetsof which the answers must be known


Note on parameter tuning It is important that the testing data is not used in any

way to create the classifier Some learning schemes operate in two stages

– build the basic structure

– optimize parameters

The testing data can not be used for parameter tuning proper procedure uses three sets: training, tuning and

testing data


Data is usually limited Error on the training data is NOT a good

indicator of performance on future data– otherwise 1 NN would be the optimum classifier

Not a problem if lots of (answered) data is

available– split data into training, turning and testing sets

However, (answered) data is usually limited More sophisticated techniques need to be used


Issues in evaluation Statistical reliability of estimated differences

in performance significance tests Choice of performance measures

– number of correctly classified samples

– ratio of correctly classified samples

– error in numeric predictions

Costs assigned to different types of errors– many practical applications involve costs


Training and testing sets Testing set must play no part, including

parameter tuning, in classifier formation Ideally, both training and testing sets are

representative samples of the underlying

problem, but they may differ in nature– we got data from two different towns A and B and

want to estimate the performance of our classifier

in a completely new town


Which(training vs. tuning/testing)

should be more similar to the target new town?


Making the most of the data Once evaluation is complete, all the data can

be used to build the final classifier for real

(unknown) data A dilemma

– generally, the larger the training data the better

the classifier (but returns diminish)

– the larger the testing data the more accurate the

error estimate


Holdout procedure Method of splitting original data into training and testing

sets Reserve a certain amount for testing and use the remainder

for training usually one third for testing and the rest for training

The samples might not be representative e.g., class might be missing in the testing data

Stratification ensures that each class is represented with approximately equal

proportions in both subsets


Repeated holdout procedure Holdout procedure can be made more reliable by

repeating the process with different subsamples– in each iteration, a certain proportion is randomly

selected for testing (possibly with stratification)

– the error rates on the different iterations are averaged

to yield an overall error rate

This is called the repeated holdout procedure A problem is that the different testing sets overlap


Cross -validation Cross- validation avoids overlapping test sets

– split data into n subsets of equal size

– use each subset in turn for testing, the remainder for

training

– the error estimates are averaged to yield an overall error

estimate

Called n- fold cross- validation Often the subsets are stratified before the cross-

validation is performed


More on cross -validation Stratified ten- fold cross- validation Why ten?

– extensive experiments have shown that this is the

best choice to get an accurate estimate

– there is also some theoretical evidence for this

Repeated stratified cross- validation– e.g., ten- fold cross- validation is repeated ten times

and results are averaged (reduces the variance)


Leave- One- Out cross- validation A particular form of cross -validation

– set number of folds to number of training instances

Makes best use of the data Involves no random subsampling Very computationally expensive

Advantage and disadvantage


LOO-CV and stratification Stratification is not possible

– there is only one instance in the testing set

An extreme example– random dataset split equally into two classes

– best inducer predicts majority class

– 50% accuracy on fresh data

– LOO-CV estimate is 100% error


Cost


Counting the cost In practice, different types of classification errors

often incur different costs Examples

– terrorist profiling, where predicting ‘negative’ achieves

99.99% accuracy

– loan decisions

– oil- slick detection

– fault diagnosis

– promotional mailing


Confusion matrixPredicted class

Yes No

Actual classYes True positive False negativeNo False positive True negative


Classification with costs Two cost matrices

Error rate is replaced by average cost per

prediction

Predicted class Predicted class

Yes No 0 1 2

Actual class

Yes 0 1Actual class

0 0 1 1

No 1 0 1 1 0 1

2 1 1 0


Cost -sensitive learning A basic idea is to only predict high- cost class when very

confident about the prediction Instead predicting the most likely class, we should make the

prediction that minimizes the expected cost– dot product of class probabilities and appropriate column in cost

matrix

– choose column (class) that minimizes expected cost

Not at training time Most learning schemes do not perform cost sensitive learning

– they generate the same classifier no matter what costs are assigned to

the different classes


A simple methodfor cost -sensitive learning


Resampling of instances according to costs


Measures


Lift charts In practice, costs are rarely known Decisions are usually made by comparing possible scenarios E.g., promotional mail to 1,000,000 households

– mail to all; 0.1% respond (1000)

– a data mining tool identifies subset of 100,000 most promising,

0.4% of these respond (400)

– another tool identifies subset of 400,000 most promising, 0.2%

respond (800)

Which is better? A lift chart allows a visual comparison


Generating a lift chart Sort instances according to predicted probability

of being positive

x-axis is sample size; y-axis is number of true

positives

Predicted probability Actual class

1 0.95 Yes

2 0.93 Yes

3 0.93 No

4 0.88 Yes

… … …


A hypothetical lift chart


ROC curves ROC curves are similar to lift charts

– stands for “receiver operating characteristic”

– used in signal detection to show tradeoff between hit

rate and false alarm rate over noisy channel

Differences to lift chart– y-axis shows percentage of true positives in sample

rather than absolute number

– x-axis shows percentage of false positives in sample

rather than sample size


A sample ROC curve

Jagged curve one set of test dataSmooth curve use cross-validation


More measures Precision = , percentage of reported samples that are positive Recall = , percentage of positive samples that are reported Precision/recall curves have hyperbolic shape Three-pointaverageis the average precision at 20%, 50% and

80% recall F-measure = , harmonic mean of precision and recall

– makes precision and recall as equal as possible

Specificity = , percentage of negative samples that are not

reported Area under the ROC curve (AUC)


Summary of some measuresDomain Plot Explanation

Lift chart Marketing TPSubset size

TP(TP+FP)/(TP+FP+TN+FN)

ROC curve Communications TP rateFP rate

TP/(TP+FN)FP/(FP+TN)

Recall-precision curve

Informationretrieval

RecallPrecision

TP/(TP+FN)TP/(TP+FP)


Evaluating numeric predictionSame strategies, including independent testing sets,

cross-validation, significance tests, etc.


Measures in numeric prediction Actual target values: Predicted target values: The most popular measure is mean squared

error (MSE), , because it is easy to manipulate

mathematically


Other measures Root mean squared error (RMSE) = Mean absolute error (MAE), , is less

sensitive to outliers than MSE Sometimes relative error values

are more appropriate


Improvement on the mean How much does the scheme improve on

simply predicting the average? Relative squared error = Relative absolute error =


Correlation coefficient / 相關係數 Measures the statistical correlation between the predicted values and

the actual values

Scale independent, between –1 and +1 Good performance leads to large values

36http://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gif

http://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gif

http://upload.wikimedia.org/wikipedia/commons/8/86/Correlation_coefficient.gif


Which measure? Best to look at all of them Often it doesn’t matter

D the best; C the second-best; A and B are arguable

A B C D

Root mean-squared error 67.8 91.7 63.3 57.4

Mean absolute error 41.3 38.5 33.4 29.2

Root relative squared error 42.2% 57.2% 39.4% 35.8%

Relative absolute error 43.1% 40.1% 34.8% 30.4%

Correlation coefficient 0.88 0.88 0.89 0.91


Today’s exercise


Parameter tuningDesign your own select, feature, buy and sell

programs. Upload and test them in our

simulation system. Finally, commit your best

version and send TA Jang a report before 23:59 11/5

(Mon).

http://zoro.ee.ncku.edu.tw/mlb2012/sim/

mailto:[email protected]

mailto:[email protected]


Possible ways Enlarge parameter range in CV Stratified, repeated…

– minimize the variance

Make turning set– use large training set; make tuning set as similar to the target stocks as possible

Cost matrix– resampling, otherwise it would be very difficult

Change measures– or plot ROC curves to understand your classifiers

The best measure is the transaction profit, but it requires the simulation

system. Instead, you can develop a comprising evaluation script, which is more

complicated than any theoretic measures but simpler than the real problem.

This is usually required in practice.

machine learning and bioinformatics 機器學習與生物資訊學

Documents