an introduction to support vector machine (svm) presenter : ahey date : 2007/07/20 the slides are...

An Introduction to Support Vector Machine (SVM)

Presenter : AheyDate : 2007/07/20

The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung

Outline

BackgroundLinear Separable SVMLagrange Multiplier MethodKarush-Kuhn-Tucker (KKT) ConditionsNon-linear SVM: KernelNon-Separable SVMlibsvm

Background – Classification Problem

The goal of classification is to organize and categorize data into distinct classes A model is first created based on the previous d

ata (training samples) This model is then used to classify new data (u

nseen samples) A sample is characterized by a set of features Classification is essentially finding the best bound

ary between classes

Background – Classification Problem

Applications: Personal Identification Credit Rating Medical Diagnosis Text Categorization Denial of Service Detection Character recognition Biometrics Image classification

Classification Formulation Given

an input space a set of classes ={ }

the Classification Problem is to define a mapping f: where each x in is

assigned to one class This mapping function is called a Decision

Function

c ,...,, 21

Decision Function

The basic problem in classification problem is to find c decision functions

with the property that, if a pattern x belongs to class i, then

is some similarity measure between x and class i, such as distance or probability concept

xdxdxd c,...,, 21

xdxd ji ijcji ;,...2,1,

xdi

Decision Function

Example

d3=d2

d1=d3

d1=d2

d1,d3<d2

Class 1

Class 3

Class 2

d2,d3<d1

d1,d2<d3

Single Classifier

Most popular single classifiers: Minimum Distance Classifier Bayes Classifier K-Nearest Neighbor Decision Tree Neural Network Support Vector Machine

Minimum Distance Classifier

Simplest approach to selection of decision boundaries

Each class is represented by a prototype (or mean) vector:

where = the number of pattern vectors from A new unlabelled sample is assigned to a class w

hose prototype is closest to the sample

jxj

j xN

m

1Mj ,...,2,1

jN j

jj mxxd

Bayes Classifier

Bayes rule

is the same for each class, therefore

Assign x to class j if for all i

xP

xdxd ji

xPPxP

xP jjj

||

jjj PxPxd |

Bayes Classifier

The following information must be known: The probability density functions of the patterns

in each class The probability of occurrence of each class

Training samples may be used to obtain estimations on these probability functions

Samples assumed to follow a known distribution pattern

jP jxP |

K-Nearest Neighbor

K-Nearest Neighbor Rule (k-NNR) Examine the labels of the k-nearest samples

and classify by using a majority voting scheme

0 2 4 6 8 10

2

4

6

8

10 (7, 3)

1NN

3NN

5NN

7NN

9NN

Decision Tree

The decision boundaries are hyper-planes parallel to the feature-axis

A sequential classification procedure may be developed by considering successive partitions of R

Decision Trees

Example

Neural Network

A Neural Network generally maps a set of inputs to a set of outputs

Number of inputs/outputs vary The network itself is composed of an arbitrary num

ber of nodes with an arbitrary topology It is an universal approximator

Neural Network

Input 0 Input 1 Input n...

Output 0 Output 1 Output m...

fH(x)

Input 0 Input 1 Input n...

W0 W1 Wn

+

Output

+

...

Wb

NodeNode

ConnectionConnection

Neural Network

A popular NN is the feed forward neural network E.g.

Multi-layer Perceptron (MLP)Radial-Based Function (RBF)

Learning algorithm: back propagation Weights of nodes are adjusted based on how well

the current weights match an objective

Support Vector Machine

Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992)

Which line is optimal?

Support Vector Machine

Training vectors : xi , i=1….n

Consider a simple case with two classes : Define a vector y yi = 1 if xi in class 1

= -1 if xi in class 2

A hyperplane which separates all data

r

ρ

Separating plane

Margin

Class 1

Class 2

Support Vector (Class 1)

Support Vector (Class 2)

Linear Separable SVM

Label the training data

Suppose we have some hyperplanes which separates the “+” from “-” examples (a separating hyperplane)

x which lie on the hyperplane, satisfy w is noraml to hyperplane, |b|/||w|| is the perpendi

cular distance from hyperplane to origin

Linear Separable SVM

Define two support hyperplane as

H1:wTx = b +δ and H2:wTx = b –δ To solve over-parameterized problem, set δ=1 Define the distance between OSH and two support

hyperplanes as

Margin = distance between H1 and H2 = 2/||w||

The Primal problem of SVM

Goal: Find a separating hyperplane with largest margin. A SVM is to find w and b that satisfy

(1) minimize ||w||/2 = wTw/2

(2) yi(xi·w+b)-1 ≥ 0

Switch the above problem to a Lagrangian formulation for two reason

(1) easier to handle by transforming into quadratic eq.(2) training data only appear in form of dot products b

etween vectors => can be generalized to nonlinear case

Langrange Muliplier Method

a method to find the extremum of a multivariate function f(x1,x2,…xn) subject to the constraint g(x1,x2,…xn) = 0

For an extremum of f to exist on g, the gradient of f must line up with the gradient of g .

for all k = 1, ...,n , where the constant λis called the Lagrange multiplier

The Lagrangian transformation of the problem is

Langrange Muliplier Method

To have , we need to find the gradient of L with respect to w and b.

(1)

(2) Substitute them into Lagrangian form, we have a

dual problem

Inner product form => Can be generalize to nonlinear case by applying kernel

KKT Conditions

Since the problems for SVM is convex, the KKT conditions are necessary and sufficient for w, b and αto be a solution.

w is determinded by training procedure. b is easily found by using KKT complementary conditions,

by choosing any i for which αi≠ 0

Complementary slackness

Non-Linear Separable SVM : Kernal

To extend to non-linear case, we need to the data to some other Euclidean space.

Kernal

Φ is a mapping function. Since the training algorithm only depend on data

thru dot products. We can use a “kernal function” K such that

One commonly used example is radial based function (RBF)

A RBF is a real-valued function whose value depends only on the distance from the origin, so that Φ(x)= Φ(||x||) ; or alternatively on the distance from some other point c, called a center, so that Φ(x,c)= Φ(||x-c||).

Non-separable SVM

Real world application usually have no OSH. We need to add an error term ζ.

=>

To give penalty to error term, define New Lagrangian form is

Non-separable SVM New KKT Conditions

an introduction to support vector machine (svm) presenter : ahey date : 2007/07/20 the slides are...

Documents