machine learning: an introduction โดย รศ.ดร.สุรพงค์ ...

Post on 11-Jan-2017

565 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The First NIDA Business Analytics and Data Sciences Contest/Conferenceวันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์

https://businessanalyticsnida.wordpress.comhttps://www.facebook.com/BusinessAnalyticsNIDA/

โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลสาขาวิชาวิทยาการข้อมูล

คณะสถิติประยุกต์ สถาบันบัณฑิตพัฒนบริหารศาสตร์

Machine Learning: An introduction

เครื่องจักรเรียนรู้ได้อย่างไร?เครื่องจักรเรียนรู้อะไรได้บ้างการเรียนรู้ของเครื่องจักรเอาไปประยุกต์ใช้งานใดได้บ้างต้องใช้คณิตศาสตร์ขั้นสูงในการเรียนรู้เรื่องการเรียนรู้ของเครื่องจักร?มี software อะไรบ้างที่ใช้ส าหรับการเรียนรู้ของเครื่องจักรประเภทของการเรียนรู้ของเครื่องจักรมีกี่ประเภท แต่ละประเภทเอาไปประยุกต์ใช้อะไร

นวมินทราธิราช 3003 วันที่ 1 กันยายน 2559 10.15-12.30 น.

Machine Learning

An Introduction

Types of Machine Learning

• Supervised Learning ( Classification,

Prediction )

• Unsupervised Learning ( Cluster Analysis )

• Association Analysis

• Reinforcement Learning

• Evolutionary Learning

Classification

• Based on Supervised Learning

• Given a collection of records (training set )

– Each record contains a set of attributes, one of the attributes is the class.

• Find a model for class attribute as a function of the values of other attributes.

• Goal: previously unseen records should be assigned a class as accurately as possible.

– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learning

algorithm

Training Set

Examples of Classification Tasks

• Predicting potential customers of a new product

• Identifying spam emails or network intrusion

connections

• Classifying credit risks of customers

• Categorizing news stories as finance,

weather, entertainment, sports, etc

Classification Techniques

• Decision Trees

• K-nearest Neighbors

• Neural Networks

• Naïve Bayes and Bayesian Belief Networks

• Support Vector Machines

• Ensemble Method

Example of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Splitting Attributes

Training Data

Model: Decision Tree

Decision Tree Classification

Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Tree

Induction

algorithm

Training Set

Decision

Tree

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Start from the root of tree.

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

Decision Boundary

y < 0.33?

: 0

: 3

: 4

: 0

y < 0.47?

: 4

: 0

: 0

: 4

x < 0.43?

Yes

Yes

No

No Yes No

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

• Border line between two neighboring regions of different classes is

known as decision boundary

• Decision boundary is parallel to axes because test condition involves

a single attribute at-a-time

Tree Induction

• Greedy strategy

– Split the training records assigned to each node

from root node to the leaf nodes based on an

attribute test that optimizes certain criterion e.g.

gains of homogeneity of training records for each

node in the tree

– Measures of homogeneity of training records for

a tree node : Entropy, GINI

– Stop splitting when some predefined criterion are

met e.g. the measures reach a predefined

certain thresholds

Measure of Impurity: GINI

• Gini Index for a given node t :

• p( j | t) is the relative frequency of class j at node t.

– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least

interesting information

– Minimum (0.0) when all records belong to one class,

implying most interesting information

j

tjptGINI 2)]|([1)(

Measure of Impurity: Entropy

• Entropy at a given node t:

• p( j | t) is the relative frequency of class j at node t.

– Measures impurity of a node.

• Maximum (log nc) when records are equally distributed

among all classes implying least information

• Minimum (0.0) when all records belong to one class,

implying most information

j

tjptjptEntropy )|(log)|()(

Nearest Neighbor Classifiers

Training

Records

Test

Record

Compute

Distance

Choose k of the

“nearest” records

Nearest-Neighbor Classifiers

Requires three things

– The set of stored records

– Distance Metric to compute

distance between records

– The value of k, the number of

nearest neighbors to retrieve

To classify an unknown record:

– Compute distance to other

training records

– Identify k nearest neighbors

– Use class labels of nearest

neighbors to determine the

class label of unknown record

(e.g., by taking majority vote)

Unknown record

Nearest Neighbor Classification

• Choosing the value of k:– If k is too small, sensitive to noise points

– If k is too large, neighborhood may include points from

other classes

X

Bayesian Classifiers

• Consider each attribute and class label as random

variables

• Given a record with attributes (A1, A2,…,An)

– Goal is to predict class C

– Specifically, we want to find the value of C that

maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from

data?

Bayesian Classifiers

• Approach:– compute the posterior probability P(C | A1, A2, …, An) for

all values of C using the Bayes theorem

– Choose value of C that maximizes P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizesP(A1, A2, …, An|C) P(C)

• How to estimate P(A1, A2, …, An | C )?

)(

)()|()|(

21

21

21

n

n

n

AAAP

CPCAAAPAAACP

Naïve Bayes Classifier

• Assume independence among attributes Ai when

class is given:

– P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New unknown record is classified to Cj if P(Cj)

P(Ai| Cj) is maximal.

Artificial Neural Networks (ANN)

)( tXwIYi

ii

Perceptron Model

)( tXwsignYi

ii

or

• Model is an assembly

of inter-connected

nodes and weighted

links

• Output node sums up

each of its input value

according to the

weights of its links

• Compare output node

against some

threshold t

X1

X2

X3

Y

Black box

w1

t

Output

node

Input

nodes

w2

w3

General Structure of ANN

Activation

function

g(Si )

Si

Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron iInput Output

threshold, t

Input

Layer

Hidden

Layer

Output

Layer

x1

x2

x3

x4

x5

y

Training ANN means learning the

weights of the neurons as well as t

1( )

1 xsigmoid x

e

( )i i

i

Y sigmoid w X t

Backpropagation algorithm

– Gradient Descent is illustrated using single weight

w1 of w

– Preferred values for w1 minimize

– Optimal value for w1 is w1*

SSE

w1L w1RW1* W1

2

( , )i i

i

SSE Y f w X

Backpropagation algorithm

– Direction for adjusting wCURRENT is negative sign of

derivative at SSE at wCURRENT

– To adjust, use magnitude of the derivative of SSE at

wCURRENT

– When curve steep, adjustment large

– When curve nearly flat, adjustment small

– Learning Rate η has values [0, 1]

)(CURRENTw

SSEsign

)(CURRENT

CURRENTw

SSEw

Support Vector Machines

• Find hyperplane maximizes the margin => B1 is better than B2

B1

B2

b11

b12

b21

b22

margin

Support Vector Machines

B1

b11

b12

0 bxw

1 bxw 1 bxw

1 ( ) if w x b 1( )

1 ( ) if w x b 1

positive classf x

negative class

2||||

2 Margin

w

Support Vector Machines

• We want to maximize:

– Which is equivalent to minimizing:

– But subjected to the following constraints:

– This is a constrained optimization problem.

Numerical approaches, e.g. quadratic

programming, can be used to solve it.

2||||

2 Margin

w

i

i

1 if w x b 1( )

1 if w x b 1if x

2

||||)(

2wwL

Support Vector Machines

• Decision Function for classifying a given data z

i i ii SV

f(z) = sign y x z + b

Nonlinear Support Vector Machines

• What if decision boundary is not linear?• What if decision boundary is not linear?

Nonlinear Support Vector Machines

• Transform data vector X into new dimensional space

• Some Kernel Functions can be used to compute the dot

product between any two given original data vectors in

the new data space (without the need of actual data

transformation).

Ensemble Methods

• Construct a set of classifiers from the training

data

• Predict class label of previously unseen records

by aggregating predictions made by multiple

classifiers

General Idea

Original

Training data

....D

1D

2 Dt-1

Dt

D

Step 1:

Create Multiple

Data Sets

C1

C2

Ct -1

Ct

Step 2:

Build Multiple

Classifiers

C*

Step 3:

Combine

Classifiers

Why does it work?

25

13

25 06.0)1(25

i

ii

i

• Suppose there are 25 base classifiers

– Each classifier has error rate, = 0.35

– Assume classifiers are independent

– Probability that the ensemble classifier makes

a wrong prediction:

Examples of Ensemble Methods

• How to generate an ensemble of classifiers?

– Bagging

– Boosting

Bagging

• Sampling with replacement

• Build classifier on each bootstrap sample

• Each sample has probability 1 - (1 – 1/n)n of

being selected in each round

Original Data 1 2 3 4 5 6 7 8 9 10

Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9

Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2

Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Boosting

• An iterative procedure to adaptively change

distribution of training data by focusing more on

previously misclassified records

– Initially, all N records are assigned equal

weights

– Unlike bagging, weights may change at the

end of boosting round

Boosting

• Records that are wrongly classified will have their

weights increased

• Records that are classified correctly will have

their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10

Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3

Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2

Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify

• Its weight is increased, therefore it is more

likely to be chosen again in subsequent rounds

What is Cluster Analysis?

• Finding groups of objects such that the objects in

a group will be similar (or related) to one another

and different from (or unrelated to) the objects in

other groupsInter-cluster distances are maximized

Intra-cluster distances are

minimized

Applications of Cluster Analysis

• Understanding

– Group related documents for browsing, group

customers into segments or group stocks with similar

price fluctuations

• Summarization

– Reduce the size of large data sets by sampling data

from each cluster

K-means Clustering

• Each cluster is associated with a centroid

(center point)

• Each data point is assigned to the cluster with

the closest centroid

• Number of clusters, K, must be specified

• The basic algorithm is very simple

K-Means Algorithm

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Limitations of K-means

• K-means has problems when clusters are of

differing

– Sizes

– Densities

– Non-globular shapes

• K-means has problems when the data contains

outliers.

Hierarchical Clustering

• Produces a set of nested clusters organized as a

hierarchical tree

• Can be visualized as a dendrogram

– A tree like diagram that records the sequences

of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Agglomerative Clustering Algorithm

• A popular hierarchical clustering technique

• Basic algorithm is straightforward

1. Compute the proximity matrix (similarities between

pairs of clusters)

2. Let each data point be a cluster

3. Repeat

4. Merge the two closest clusters

5. Update the proximity matrix

6. Until only a single cluster remains

May not be suitable for large datasets due to the cost

of computing and updating the proximity matrix

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

• MIN

• MAX

• Group Average

• Distance Between Centroids

• Ward’s Method uses squared

error

Proximity Matrix

Hierarchical Clustering: Comparison

Group Average

Ward’s Method

1

2

3

4

5

6

1

2

5

3

4

MIN MAX

1

2

3

4

5

6

1

2

5

34

1

2

3

4

5

6

1

2 5

3

41

2

3

4

5

6

1

2

3

4

5

Other Issues

• Data Cleaning

• Data Sampling

• Dimension Reduction

• Data Visualization

• Over fitting and Under fitting Problems

• Imbalance Issues

top related