machine learning: an introduction โดย รศ.ดร.สุรพงค์ ...
TRANSCRIPT
The First NIDA Business Analytics and Data Sciences Contest/Conferenceวันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์
https://businessanalyticsnida.wordpress.comhttps://www.facebook.com/BusinessAnalyticsNIDA/
โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลสาขาวิชาวิทยาการข้อมูล
คณะสถิติประยุกต์ สถาบันบัณฑิตพัฒนบริหารศาสตร์
Machine Learning: An introduction
เครื่องจักรเรียนรู้ได้อย่างไร?เครื่องจักรเรียนรู้อะไรได้บ้างการเรียนรู้ของเครื่องจักรเอาไปประยุกต์ใช้งานใดได้บ้างต้องใช้คณิตศาสตร์ขั้นสูงในการเรียนรู้เรื่องการเรียนรู้ของเครื่องจักร?มี software อะไรบ้างที่ใช้ส าหรับการเรียนรู้ของเครื่องจักรประเภทของการเรียนรู้ของเครื่องจักรมีกี่ประเภท แต่ละประเภทเอาไปประยุกต์ใช้อะไร
นวมินทราธิราช 3003 วันที่ 1 กันยายน 2559 10.15-12.30 น.
Machine Learning
An Introduction
Types of Machine Learning
• Supervised Learning ( Classification,
Prediction )
• Unsupervised Learning ( Cluster Analysis )
• Association Analysis
• Reinforcement Learning
• Evolutionary Learning
Classification
• Based on Supervised Learning
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class.
• Find a model for class attribute as a function of the values of other attributes.
• Goal: previously unseen records should be assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learning
algorithm
Training Set
Examples of Classification Tasks
• Predicting potential customers of a new product
• Identifying spam emails or network intrusion
connections
• Classifying credit risks of customers
• Categorizing news stories as finance,
weather, entertainment, sports, etc
Classification Techniques
• Decision Trees
• K-nearest Neighbors
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Ensemble Method
Example of a Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Model: Decision Tree
Decision Tree Classification
Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Start from the root of tree.
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
Decision Boundary
y < 0.33?
: 0
: 3
: 4
: 0
y < 0.47?
: 4
: 0
: 0
: 4
x < 0.43?
Yes
Yes
No
No Yes No
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
Tree Induction
• Greedy strategy
– Split the training records assigned to each node
from root node to the leaf nodes based on an
attribute test that optimizes certain criterion e.g.
gains of homogeneity of training records for each
node in the tree
– Measures of homogeneity of training records for
a tree node : Entropy, GINI
– Stop splitting when some predefined criterion are
met e.g. the measures reach a predefined
certain thresholds
Measure of Impurity: GINI
• Gini Index for a given node t :
• p( j | t) is the relative frequency of class j at node t.
– Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
j
tjptGINI 2)]|([1)(
Measure of Impurity: Entropy
• Entropy at a given node t:
• p( j | t) is the relative frequency of class j at node t.
– Measures impurity of a node.
• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
j
tjptjptEntropy )|(log)|()(
Nearest Neighbor Classifiers
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
Nearest-Neighbor Classifiers
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
Nearest Neighbor Classification
• Choosing the value of k:– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
X
Bayesian Classifiers
• Consider each attribute and class label as random
variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from
data?
Bayesian Classifiers
• Approach:– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
– Choose value of C that maximizes P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizesP(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
)(
)()|()|(
21
21
21
n
n
n
AAAP
CPCAAAPAAACP
Naïve Bayes Classifier
• Assume independence among attributes Ai when
class is given:
– P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
– Can estimate P(Ai| Cj) for all Ai and Cj.
– New unknown record is classified to Cj if P(Cj)
P(Ai| Cj) is maximal.
Artificial Neural Networks (ANN)
)( tXwIYi
ii
Perceptron Model
)( tXwsignYi
ii
or
• Model is an assembly
of inter-connected
nodes and weighted
links
• Output node sums up
each of its input value
according to the
weights of its links
• Compare output node
against some
threshold t
X1
X2
X3
Y
Black box
w1
t
Output
node
Input
nodes
w2
w3
General Structure of ANN
Activation
function
g(Si )
Si
Oi
I1
I2
I3
wi1
wi2
wi3
Oi
Neuron iInput Output
threshold, t
Input
Layer
Hidden
Layer
Output
Layer
x1
x2
x3
x4
x5
y
Training ANN means learning the
weights of the neurons as well as t
1( )
1 xsigmoid x
e
( )i i
i
Y sigmoid w X t
Backpropagation algorithm
– Gradient Descent is illustrated using single weight
w1 of w
– Preferred values for w1 minimize
– Optimal value for w1 is w1*
SSE
w1L w1RW1* W1
2
( , )i i
i
SSE Y f w X
Backpropagation algorithm
– Direction for adjusting wCURRENT is negative sign of
derivative at SSE at wCURRENT
– To adjust, use magnitude of the derivative of SSE at
wCURRENT
– When curve steep, adjustment large
– When curve nearly flat, adjustment small
– Learning Rate η has values [0, 1]
)(CURRENTw
SSEsign
)(CURRENT
CURRENTw
SSEw
Support Vector Machines
• Find hyperplane maximizes the margin => B1 is better than B2
B1
B2
b11
b12
b21
b22
margin
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1 ( ) if w x b 1( )
1 ( ) if w x b 1
positive classf x
negative class
2||||
2 Margin
w
Support Vector Machines
• We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
– This is a constrained optimization problem.
Numerical approaches, e.g. quadratic
programming, can be used to solve it.
2||||
2 Margin
w
i
i
1 if w x b 1( )
1 if w x b 1if x
2
||||)(
2wwL
Support Vector Machines
• Decision Function for classifying a given data z
i i ii SV
f(z) = sign y x z + b
Nonlinear Support Vector Machines
• What if decision boundary is not linear?• What if decision boundary is not linear?
Nonlinear Support Vector Machines
• Transform data vector X into new dimensional space
• Some Kernel Functions can be used to compute the dot
product between any two given original data vectors in
the new data space (without the need of actual data
transformation).
Ensemble Methods
• Construct a set of classifiers from the training
data
• Predict class label of previously unseen records
by aggregating predictions made by multiple
classifiers
General Idea
Original
Training data
....D
1D
2 Dt-1
Dt
D
Step 1:
Create Multiple
Data Sets
C1
C2
Ct -1
Ct
Step 2:
Build Multiple
Classifiers
C*
Step 3:
Combine
Classifiers
Why does it work?
25
13
25 06.0)1(25
i
ii
i
• Suppose there are 25 base classifiers
– Each classifier has error rate, = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes
a wrong prediction:
Examples of Ensemble Methods
• How to generate an ensemble of classifiers?
– Bagging
– Boosting
Bagging
• Sampling with replacement
• Build classifier on each bootstrap sample
• Each sample has probability 1 - (1 – 1/n)n of
being selected in each round
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of boosting round
Boosting
• Records that are wrongly classified will have their
weights increased
• Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
What is Cluster Analysis?
• Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groupsInter-cluster distances are maximized
Intra-cluster distances are
minimized
Applications of Cluster Analysis
• Understanding
– Group related documents for browsing, group
customers into segments or group stocks with similar
price fluctuations
• Summarization
– Reduce the size of large data sets by sampling data
from each cluster
K-means Clustering
• Each cluster is associated with a centroid
(center point)
• Each data point is assigned to the cluster with
the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-Means Algorithm
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Limitations of K-means
• K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains
outliers.
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences
of merges or splits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
Agglomerative Clustering Algorithm
• A popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Compute the proximity matrix (similarities between
pairs of clusters)
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
May not be suitable for large datasets due to the cost
of computing and updating the proximity matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Ward’s Method uses squared
error
Proximity Matrix
Hierarchical Clustering: Comparison
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
34
1
2
3
4
5
6
1
2 5
3
41
2
3
4
5
6
1
2
3
4
5
Other Issues
• Data Cleaning
• Data Sampling
• Dimension Reduction
• Data Visualization
• Over fitting and Under fitting Problems
• Imbalance Issues