costs and filters dr. avi rosenfeld department of industrial engineering jerusalem college of...

Costs and Filters

Dr. Avi Rosenfeld

Department of Industrial EngineeringJerusalem College of Technology

[email protected]

מושגים בהרצה

Minority Class ProblemUnderfittingOverfittingFeature Selection

Correlation Feature SelectionSmote

Cost Based LearningMetacost

[email protected] 2

Under-fitting / Over-fittingUnderfittingדיוק גורע בתוך ה :TEST DATA בגלל

.TRAININGשהמודל לא התאים לנתונים של ה אולי אתה חייב יותרDATA( במספר השורות instances) אולי אתה חייב יותרDATA)במספר העמודות )מאפיינים

Overfittingדיוק גרוע בתוך ה :TEST DATA בגלל היה יותר מדי טוב )!(TRAININGשהמודל בתוך ה

היכולת של המודל לנבא מה יקרה בDATA!עתידי לא טוב הTRAININGתפס דברים לא חשובים

[email protected] 3

הבעיה הכלליMinority Class Problem

מה עושים אם המידע לא מאוזן פתרון פשוט אבל לא הגיוני– תתעלם

MINORITYמה עם 1% מהאנשים בלי סרטן ו99%דוגמא: אם יש לך

?ZEROסרטן מה יהיה הדיוק של מה יהיה הPRECISIONו RECALL?וכל קטגוריה האם זה קרוב יותר לUnderfitting או Overfitting?

[email protected] 4

Quick Overview of Feature Selectionhttp://www.cs.nott.ac.uk/~jqb/G54DMT

Dr. Jaume Bacardit )now at Newcastle([email protected]

Topic 2: Data PreprocessingLecture 3: Feature and prototype selection

http://www.cs.nott.ac.uk/~jqb/G54DMT

mailto:[email protected]

OVERFITTING פתרון אפשרי לFeature Selection

Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

Feature Selection Great article: An Introduction to Variable and

Feature Selection )Guyon and Elisseeff, 2003( Three basic approaches:

Filters, wrappers, and embedded methods

בשיעור שלנו נעשה פילטר וגםWRAPPER

[email protected] 7

Dataset Filter Classification method

?איך בוחרים מה להוריד

ENTROPY / INFOGAIN מורידים אפיינים בליINFOGAINמעל סף מסויים

מורידים מאפיינים שקרובים אחד לשני :אפשרות אחתCFS )Correlation Feature Selection(

"Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other“

רוצים מאפיינים שונים וגם טובים

[email protected] 8

, אני MINORITYכדי לפתור את הצריך:

Cost-Sensitive Learningמוסיפים עלות לכל סוג של סיווגבדרך כלל, אלגוריתמים לא מפרידים בין קטגוריות:אבל יש ה-ר-ב-ה יישומים שזה חשוב

אבחון סרטן)'אבחון סיכונים )התקפות מחשב, זיוף, וכו

Class Imbalance vs. Asymmetric Misclassification costs

Class Imbalance: one class occurs much more often than the other Asymmetric misclassification costs: the cost of misclassifying an

example from one class is much larger than the cost of misclassifying an example from the other class.

לפי הניסיון שלי: שני המוסגים בדרך כלל באים ביחדגם הפתרונות דומות:דוגמאות

תוריד שורות מהMAJORITY CLASSתוסיף שורות לMINORITY CLASS תוסיףFilterל DATA-- SMOTEתוסיף עלות לMINORITY -- METACOST

???איזה מצב יותר טוב

P N

P 20 10

N 30 90

Predicted

Act

ual

P N

P 0 2

N 1 0

Confusion matrix 2

Cost matrix

P N

P 10 20

N 15 105

Predicted

Act

ual

Confusion matrix 1

FN

FP

Error rate: 40/150Cost: 30x1+10x2=50

Error rate: 35/150Cost: 15x1+20x2=55

FN

Making Classifier Balanced with Changing the Data )Filter(

Baseline Methods Random over-sampling Random under-sampling

Under-sampling Methods Tomek links Condensed Nearest Neighbor Rule One-sided selection CNN + Tomek links Neighborhood Cleaning Rule

Over-sampling Methods Smote

Combination of Over-sampling method with Under-sampling method Smote + Tomek links Smote + ENN

MetaCost

By wrapping a cost-minimizing procedure, “meta-learning” stage, around the classifier

"תתייחס לסוג הלמידה כ"קופסה שחורהאפשר לשנות את הCOSTולהשפיע על התוצאות

דוגמא:CRUISE CONTROLמתי אנשים מפעילים

0 0.1 0.2 0.3 0.4 0.5 0.6 0.775

77

79

81

83

85

87

89

91

93

95

AllWithout

Recall of Minority Case

Ove

rall

Accu

racy

WEKAמתוך Metacostבלי תוספת עלות

WEKAמתוך Metacostעם עלות

...הכנה לקראת התרגיל

[email protected] 17

AUC )Area under ROC(


Lift: Note top Left


Attribute Selection and Smote


MetaCost


COSTבחירה בשיטת למידה ו


costs and filters dr. avi rosenfeld department of industrial engineering jerusalem college of...

Documents