machine learning and bioinformatics 機器學習與生物資訊學

40
Molecular Biomedical Informatics 分分分分分分分分分 Machine Learning and Bioinformatics 分分分分分分分分分分 Machine Learning & Bioinformatics 1

Upload: admon

Post on 10-Jan-2016

52 views

Category:

Documents


3 download

DESCRIPTION

Machine Learning and Bioinformatics 機器學習與生物資訊學. Evaluation. The key to success. Three datasets. of which the answers must be known. Note on parameter tuning. It is important that the  testing data is not used in any way to create the classifier - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning & Bioinformatics 1

Molecular Biomedical Informatics分子生醫資訊實驗室Machine Learning and Bioinformatics機器學習與生物資訊學

Page 2: Machine Learning and Bioinformatics 機器學習與生物資訊學

2Machine Learning and Bioinformatics

EvaluationThe key to success

Page 3: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 3

Three datasetsof which the answers must be known

Page 4: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 4

Note on parameter tuning It is important that the testing data is not used in any

way to create the classifier Some learning schemes operate in two stages

– build the basic structure

– optimize parameters

The testing data can not be used for parameter tuning proper procedure uses three sets: training, tuning and

testing data

Page 5: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 5

Data is usually limited Error on the training data is NOT a good

indicator of performance on future data– otherwise 1 NN would be the optimum classifier

Not a problem if lots of (answered) data is

available– split data into training, turning and testing sets

However, (answered) data is usually limited More sophisticated techniques need to be used

Page 6: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 6

Issues in evaluation Statistical reliability of estimated differences

in performance  significance tests Choice of performance measures

– number of correctly classified samples

– ratio of correctly classified samples

– error in numeric predictions

Costs assigned to different types of errors– many practical applications involve costs

Page 7: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 7

Training and testing sets Testing set must play no part, including

parameter tuning, in classifier formation Ideally, both training and testing sets are

representative samples of the underlying

problem, but they may differ in nature– we got data from two different towns A and B and

want to estimate the performance of our classifier

in a completely new town

Page 8: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 8

Which(training vs. tuning/testing)

should be more similar to the target new town?

Page 9: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 9

Making the most of the data Once evaluation is complete, all the data can

be  used to build the final classifier for real

(unknown) data A dilemma

– generally, the larger the training data the better

the classifier (but returns diminish)

– the larger the testing data the more accurate the 

error estimate

Page 10: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 10

Holdout procedure Method of splitting original data into training and testing

sets Reserve a certain amount for testing and use the remainder

for training usually one third for testing and the rest for training

The samples might not be representative e.g., class might be missing in the testing data

Stratification ensures that each class is represented with approximately equal

proportions in both subsets

Page 11: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 11

Repeated holdout procedure Holdout procedure can be made more reliable by

repeating the process with different subsamples– in each iteration, a certain proportion is randomly

selected for testing (possibly with stratification)

– the error rates on the different iterations are averaged

to yield an overall error rate

This is called the repeated holdout procedure A problem is that the different testing sets overlap

Page 12: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 12

Cross -validation Cross- validation avoids overlapping test sets

– split data into n subsets of equal size

– use each subset in turn for testing, the remainder for

training

– the error estimates are averaged to yield an overall error

estimate

Called n- fold cross- validation Often the subsets are stratified before the cross-

validation is performed

Page 13: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 13

More on cross -validation Stratified ten- fold cross- validation Why ten?

– extensive experiments have shown that this is the

best choice to get an accurate estimate

– there is also some theoretical evidence for this

Repeated stratified cross- validation– e.g., ten- fold cross- validation is repeated ten times

and results are averaged (reduces the variance)

Page 14: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 14

Leave- One- Out cross- validation A particular form of cross -validation

– set number of folds to number of training instances

Makes best use of the data Involves no random subsampling Very computationally expensive

Advantage and disadvantage

Page 15: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 15

LOO-CV and stratification Stratification is not possible

– there is only one instance in the testing set

An extreme example– random dataset split equally into two classes

– best inducer predicts majority class

– 50% accuracy on fresh data

– LOO-CV estimate is 100% error

Page 16: Machine Learning and Bioinformatics 機器學習與生物資訊學

16Machine Learning and Bioinformatics

Cost

Page 17: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 17

Counting the cost In practice, different types of classification errors

often incur different costs Examples

– terrorist profiling, where predicting ‘negative’ achieves

99.99% accuracy

– loan decisions

– oil- slick detection

– fault diagnosis

– promotional mailing

Page 18: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 18

Confusion matrixPredicted class

Yes No

Actual classYes True positive False negativeNo False positive True negative

Page 19: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 19

Classification with costs Two cost matrices

Error rate is replaced by average cost per

prediction

Predicted class Predicted class

Yes No 0 1 2

Actual class

Yes 0 1Actual class

0 0 1 1

No 1 0 1 1 0 1

2 1 1 0

Page 20: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 20

Cost -sensitive learning A basic idea is to only predict high- cost class when very

confident about the prediction Instead predicting the most likely class, we should make the

prediction that minimizes the expected cost– dot product of class probabilities and appropriate column in cost

matrix

– choose column (class) that minimizes expected cost

Not at training time Most learning schemes do not perform cost sensitive learning

– they generate the same classifier no matter what costs are assigned to

the different classes

Page 21: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 21

A simple methodfor cost -sensitive learning

Page 22: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 22

Resampling of instances according to costs

Page 23: Machine Learning and Bioinformatics 機器學習與生物資訊學

23Machine Learning and Bioinformatics

Measures

Page 24: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 24

Lift charts In practice, costs are rarely known Decisions are usually made by comparing possible scenarios E.g., promotional mail to 1,000,000 households

– mail to all; 0.1% respond (1000)

– a data mining tool identifies subset of 100,000 most promising,

0.4% of these respond (400)

– another tool identifies subset of 400,000 most promising, 0.2%

respond (800)

Which is better? A lift chart allows a visual comparison

Page 25: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 25

Generating a lift chart Sort instances according to predicted probability

of being positive

x-axis is sample size; y-axis is number of true

positives

Predicted probability Actual class

1 0.95 Yes

2 0.93 Yes

3 0.93 No

4 0.88 Yes

… … …

Page 26: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 26

A hypothetical lift chart

Page 27: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 27

ROC curves ROC curves are similar to lift charts

– stands for “receiver operating characteristic”

– used in signal detection to show tradeoff between hit

rate and false alarm rate over noisy channel

Differences to lift chart– y-axis shows percentage of true positives in sample

rather than absolute number

– x-axis shows percentage of false positives in sample

rather than sample size

Page 28: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 28

A sample ROC curve

Jagged curve one set of test dataSmooth curve use cross-validation

Page 29: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 29

More measures Precision = , percentage of reported samples that are positive Recall = , percentage of positive samples that are reported Precision/recall curves have hyperbolic shape Three­-point­average­is the average precision at 20%, 50% and

80% recall F-­measure = , harmonic mean of precision and recall

– makes precision and recall as equal as possible

Specificity = , percentage of negative samples that are not

reported Area under the ROC curve (AUC)

Page 30: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 30

Summary of some measuresDomain Plot Explanation

Lift chart Marketing TPSubset size

TP(TP+FP)/(TP+FP+TN+FN)

ROC curve Communications TP rateFP rate

TP/(TP+FN)FP/(FP+TN)

Recall-precision curve

Informationretrieval

RecallPrecision

TP/(TP+FN)TP/(TP+FP)

Page 31: Machine Learning and Bioinformatics 機器學習與生物資訊學

31Machine Learning and Bioinformatics

Evaluating numeric predictionSame strategies, including independent testing sets,

cross-validation, significance tests, etc.

Page 32: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 32

Measures in numeric prediction Actual target values:  Predicted target values:  The most popular measure is mean squared

error (MSE), , because it is easy to manipulate

mathematically

Page 33: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 33

Other measures Root mean squared error (RMSE) = Mean absolute error (MAE), , is less

sensitive to outliers than MSE Sometimes relative error values

are more appropriate

Page 34: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 34

Improvement on the mean How much does the scheme improve on

simply predicting the average? Relative squared error = Relative absolute error =

Page 35: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 35

Correlation coefficient / 相關係數 Measures the statistical correlation between the predicted values and

the actual values

Scale independent, between –1 and +1 Good performance leads to large values

Page 37: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 37

Which measure? Best to look at all of them Often it doesn’t matter

D the best; C the second-best; A and B are arguable

A B C D

Root mean-squared error 67.8 91.7 63.3 57.4

Mean absolute error 41.3 38.5 33.4 29.2

Root relative squared error 42.2% 57.2% 39.4% 35.8%

Relative absolute error 43.1% 40.1% 34.8% 30.4%

Correlation coefficient 0.88 0.88 0.89 0.91

Page 38: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning & Bioinformatics 38

Today’s exercise

Page 39: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning & Bioinformatics 39

Parameter tuningDesign your own select, feature, buy and sell

programs. Upload and test them in our

simulation system. Finally, commit your best

version and send TA Jang a report before 23:59 11/5

(Mon).

Page 40: Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 40

Possible ways Enlarge parameter range in CV Stratified, repeated…

– minimize the variance

Make turning set– use large training set; make tuning set as similar to the target stocks as possible

Cost matrix– resampling, otherwise it would be very difficult

Change measures– or plot ROC curves to understand your classifiers

The best measure is the transaction profit, but it requires the simulation

system. Instead, you can develop a comprising evaluation script, which is more

complicated than any theoretic measures but simpler than the real problem.

This is usually required in practice.