an introduction to support vector machine (svm) presenter : ahey date : 2007/07/20 the slides are...
TRANSCRIPT
An Introduction to Support Vector Machine (SVM)
Presenter : AheyDate : 2007/07/20
The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung
Outline
BackgroundLinear Separable SVMLagrange Multiplier MethodKarush-Kuhn-Tucker (KKT) ConditionsNon-linear SVM: KernelNon-Separable SVMlibsvm
Background – Classification Problem
The goal of classification is to organize and categorize data into distinct classes A model is first created based on the previous d
ata (training samples) This model is then used to classify new data (u
nseen samples) A sample is characterized by a set of features Classification is essentially finding the best bound
ary between classes
Background – Classification Problem
Applications: Personal Identification Credit Rating Medical Diagnosis Text Categorization Denial of Service Detection Character recognition Biometrics Image classification
Classification Formulation Given
an input space a set of classes ={ }
the Classification Problem is to define a mapping f: where each x in is
assigned to one class This mapping function is called a Decision
Function
c ,...,, 21
Decision Function
The basic problem in classification problem is to find c decision functions
with the property that, if a pattern x belongs to class i, then
is some similarity measure between x and class i, such as distance or probability concept
xdxdxd c,...,, 21
xdxd ji ijcji ;,...2,1,
xdi
Single Classifier
Most popular single classifiers: Minimum Distance Classifier Bayes Classifier K-Nearest Neighbor Decision Tree Neural Network Support Vector Machine
Minimum Distance Classifier
Simplest approach to selection of decision boundaries
Each class is represented by a prototype (or mean) vector:
where = the number of pattern vectors from A new unlabelled sample is assigned to a class w
hose prototype is closest to the sample
jxj
j xN
m
1Mj ,...,2,1
jN j
jj mxxd
Bayes Classifier
Bayes rule
is the same for each class, therefore
Assign x to class j if for all i
xP
xdxd ji
xPPxP
xP jjj
||
jjj PxPxd |
Bayes Classifier
The following information must be known: The probability density functions of the patterns
in each class The probability of occurrence of each class
Training samples may be used to obtain estimations on these probability functions
Samples assumed to follow a known distribution pattern
jP jxP |
K-Nearest Neighbor
K-Nearest Neighbor Rule (k-NNR) Examine the labels of the k-nearest samples
and classify by using a majority voting scheme
0 2 4 6 8 10
2
4
6
8
10 (7, 3)
1NN
3NN
5NN
7NN
9NN
Decision Tree
The decision boundaries are hyper-planes parallel to the feature-axis
A sequential classification procedure may be developed by considering successive partitions of R
Neural Network
A Neural Network generally maps a set of inputs to a set of outputs
Number of inputs/outputs vary The network itself is composed of an arbitrary num
ber of nodes with an arbitrary topology It is an universal approximator
Neural Network
Input 0 Input 1 Input n...
Output 0 Output 1 Output m...
fH(x)
Input 0 Input 1 Input n...
W0 W1 Wn
+
Output
+
...
Wb
NodeNode
ConnectionConnection
Neural Network
A popular NN is the feed forward neural network E.g.
Multi-layer Perceptron (MLP)Radial-Based Function (RBF)
Learning algorithm: back propagation Weights of nodes are adjusted based on how well
the current weights match an objective
Support Vector Machine
Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992)
Which line is optimal?
Support Vector Machine
Training vectors : xi , i=1….n
Consider a simple case with two classes : Define a vector y yi = 1 if xi in class 1
= -1 if xi in class 2
A hyperplane which separates all data
r
ρ
Separating plane
Margin
Class 1
Class 2
Support Vector (Class 1)
Support Vector (Class 2)
Linear Separable SVM
Label the training data
Suppose we have some hyperplanes which separates the “+” from “-” examples (a separating hyperplane)
x which lie on the hyperplane, satisfy w is noraml to hyperplane, |b|/||w|| is the perpendi
cular distance from hyperplane to origin
Linear Separable SVM
Define two support hyperplane as
H1:wTx = b +δ and H2:wTx = b –δ To solve over-parameterized problem, set δ=1 Define the distance between OSH and two support
hyperplanes as
Margin = distance between H1 and H2 = 2/||w||
The Primal problem of SVM
Goal: Find a separating hyperplane with largest margin. A SVM is to find w and b that satisfy
(1) minimize ||w||/2 = wTw/2
(2) yi(xi·w+b)-1 ≥ 0
Switch the above problem to a Lagrangian formulation for two reason
(1) easier to handle by transforming into quadratic eq.(2) training data only appear in form of dot products b
etween vectors => can be generalized to nonlinear case
Langrange Muliplier Method
a method to find the extremum of a multivariate function f(x1,x2,…xn) subject to the constraint g(x1,x2,…xn) = 0
For an extremum of f to exist on g, the gradient of f must line up with the gradient of g .
for all k = 1, ...,n , where the constant λis called the Lagrange multiplier
The Lagrangian transformation of the problem is
Langrange Muliplier Method
To have , we need to find the gradient of L with respect to w and b.
(1)
(2) Substitute them into Lagrangian form, we have a
dual problem
Inner product form => Can be generalize to nonlinear case by applying kernel
KKT Conditions
Since the problems for SVM is convex, the KKT conditions are necessary and sufficient for w, b and αto be a solution.
w is determinded by training procedure. b is easily found by using KKT complementary conditions,
by choosing any i for which αi≠ 0
Complementary slackness
Non-Linear Separable SVM : Kernal
To extend to non-linear case, we need to the data to some other Euclidean space.
Kernal
Φ is a mapping function. Since the training algorithm only depend on data
thru dot products. We can use a “kernal function” K such that
One commonly used example is radial based function (RBF)
A RBF is a real-valued function whose value depends only on the distance from the origin, so that Φ(x)= Φ(||x||) ; or alternatively on the distance from some other point c, called a center, so that Φ(x,c)= Φ(||x-c||).
Non-separable SVM
Real world application usually have no OSH. We need to add an error term ζ.
=>
To give penalty to error term, define New Lagrangian form is