-
An introduction to Support Vector Machine
報告者:黃立德
References:Simon Haykin, "Neural Networks: a comprehensive foundation, second edition“, 1999, Chapter 2,6
Nello Christanini, John Shawe-Tayer, “An Introduction to Support Vector Machines”, 2000, Chapter 3~6
-
2
Outline• Drawbacks of learning• Overview of SVM• The Empirical Risk Minimization Principle• VC-dimension• Structural Risk Minimization• Linearly separable patterns• Non-linearly separable patterns• How to build a SVM for pattern recognition• Example: XOR problem• Properties and expansions of SVM• Conclusion• Applications of SVM• LIBSVM
-
3
Drawbacks of learning
• The choice of the class of functions from which the input/output mapping must be sought. – Learning in three-node neural networks is
known to be NP-complete
-
4
Drawbacks of learning (cont.)• In practice, there are following problems
– The learning algorithm may prove inefficient as for example in the case of local minima
– The size of output hypothesis can frequently become very large and impractical
– If there are only a limited number of training examples, the hypothesis found by learning algorithm will lead to overfitting and hence poor generalization
– The learning algorithm is usually controlled by a large number of parameters that are often chosen by turning heuristics, making the system difficult and unreliable to use
-
5
Overview of SVM
• What is SVM?– A linear machine with some very nice properties– The goal is to construct a decision surface such
that the margin of separation between positive and negative samples is maximized.
– SVM is a learning system that uses a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory
-
6
The Empirical Risk Minimization Principle
• Given a set of data
• Also given a set of decision functions
• The expected risk is
( ) ( ) { }1,1,,,,...,, 11 −∈∈ inNN yRxyxyx
{ } { }1,1:,: −→∈ nRfwhereIf λλ λ
( ) ( ) ( )∫ −= yxdPyxfR ,λλ
-
7
The Empirical Risk Minimization Principle(cont.)
• The approximation (empirical risk)
• Theory of uniform convergence in probability
{ }lim sup( ( ) ( )) 0, 0empN IP R Rλ λ λ ε ε→∞ ∈ − > = ∀ >
( ) ( )∑=
−=N
iiiemp yxfN
R1
1λλ
-
8
Vapnik-Chervonenkis dimension
• VC-dimension– It is a measure of the capacity or expressive power
of the family of classification functions realized by the learning machine
-
9
Structure Risk Minimization
• Let be a subset of
• Define a structure of nested subset
• Each subset satisfies the condition
kI I{ }:k kS f Iλ λ= ∈
1 2 ... ...nS S S⊂ ⊂ ⊂ ⊂
1 2 ... ...nh h h≤ ≤ ≤ ≤ , hi: VC dimension
-
10
Structure Risk Minimization (cont.)
• Implementing SRM can be difficult because the VC dimension of could be hard to compute.
• Support Vector Machine, SVM, are able to achieve the goal of minimizing the upper bound of by minimizing a bound on the VC dimension h and at the same time.
min ( ) nemphRN
λ⎛ ⎞
+⎜ ⎟⎜ ⎟⎝ ⎠
nS
( )R λ( )empR λ
-
11
Concepts of SVM
• SVM is an approximate implementation of the method of structural risk minimization
• It does not incorporate problem-domain knowledge
-
12
Linearly separable patterns
• A training sample• The patterns are linear separable.• The equation of a decision surface that
does the separation is
• And we can write
{ } 1( , )N
i i ix d
=
0tw x b+ =
0t iw x b+ ≥ for 1id = +0t iw x b+ < for 1id = −
( ) 1 1, 2,...,ti id w x b for i N+ ≥ =
-
13
Linearly separable patterns (cont.)
• The discriminant function of the optimal hyperplaneis
• Maximum Margin Rule– We select the hyperplane
with maximum nearest the data point (support vectors).
0 0( )tg x w x b= +
-
14
Linearly separable patterns (cont.)
0 0 1tw x b+ =
0 0 0tw x b+ =
0 0 1tw x b+ = −
0
2:Marginw
-
15
Linearly separable patterns (cont.)
• The final goal is to minimizes the cost function
• We may solve the constrained optimization problem using the method of Lagrange multipliers.
1( )2
tw w wΦ =
Lagrangian function
1
1( , , ) ( ) 12
Nt t
i i ii
J w b w w d w w bα α=
⎡ ⎤= − + −⎣ ⎦∑:i Lagrange multipliersα
-
16
Linearly separable patterns (cont.)
• dual form– Given the training sample ,find the
Lagrange multipliers that maximize the objective function
{ } 1( , )N
i i ix d
=
{ } 1N
i iα
=
1 1 1
1( )2
N N Nt
i i j i j i ji i j
Q d d x xα α α α= = =
= −∑ ∑∑
subject to the constraints
(1)
(2)1
0N
i ii
dα=
=∑0 1,2,...,i for i Nα ≥ =
-
17
Linearly separable patterns (cont.)
• Having determined the optimum Lagrange multipliers, we can compute the optimum weight vector and bias.
0 ,1
N
o i i ii
w d xα=
=∑
( ) ( )0 01 1
t s sb w x for d= − =
-
18
Non-linearly separable patterns
• allow training errors• The definition of decision surface is
• At last, we only have to minimizing the following function
( ) 1 1, 2,...,ti i id w x b for i Nξ+ ≥ − =
1
1( , )2
Nt
ii
w w w Cξ ξ=
Φ = + ∑
-
19
Non-linearly separable patterns (cont.)
• dual form– Given the training sample ,find the
Lagrange multipliers that maximize the objective function
{ } 1( , )N
i i ix d
=
{ } 1N
i iα
=
1 1 1
1( )2
N N Nt
i i j i j i ji i j
Q d d x xα α α α= = =
= −∑ ∑∑subject to the constraints(1)
(2)1
0N
i ii
dα=
=∑0 1,2,...,i C for i Nα≤ ≤ =where C is a user-specified positive parameter
-
20
Non-linearly separable patterns (cont.)
• After the optimum Lagrange multipliers have determined, we can compute the optimum weight vector and bias.
0 ,1
sN
o i i ii
w d xα=
=∑( ) ( )
0 01 1t s sb w x for d= − =
-
21
How to build a SVM for pattern recognition
• Steps for constructing a SVM– Nonlinear mapping of input vectors into a high dimensional
feature space that is hidden from both the input and output.– Construction of an optimal hyperplane for separating the
features.
-
22
How to build a SVM for pattern recognition (cont.)
• x denotes a vector from the input space.• denotes a set of nonlinear mapping
from the input space to the feature space.• Define a hyperplane as following
• Define the vector
{ } 11
( )m
j jxϕ
=
1
1( ) 0
m
j jj
w x bϕ=
+ =∑1
0( ) 0
m
j jj
w xϕ=
=∑
10 1( ) ( ), ( ),..., ( )
t
mx x x xϕ ϕ ϕ ϕ⎡ ⎤= ⎣ ⎦
-
23
How to build a SVM for pattern recognition (cont.)
• We can write the equation in the compact form.
• Because the features are linear separable, we may write
• Substituting Eq.2 in Eq.1,we get
( ) 0t xϕ =w
1( )
N
i i ii
d xα ϕ=
=∑w
1
2
1( ) ( ) 0
Nt
i i ii
d x xα ϕ ϕ=
=∑
-
24
• Define the inner-product kernel denoted by
• Now, we may use the inner-product kernel to construct the optimal decision surface in the feature space without considering the feature space in explicit form.
How to build a SVM for pattern recognition (cont.)
( , ) ( ) ( )ti iK x x x xϕ ϕ=1
0( ) ( ) 1,...,
m
j j ij
x x for i Nϕ ϕ=
= =∑
1( , ) 0
N
i i ii
d K x xα=
=∑
-
25
• Optimum design of a SVM (dual form)– Given the training sample ,find the
Lagrange multipliers that maximize the objective function
{ } 1( , )N
i i ix d
=
{ } 1N
i iα
=
1 1 1
1( ) ( , )2
N N N
i i j i j i ji i j
Q d d K x xα α α α= = =
= −∑ ∑∑subject to the constraints(1)
(2)1
0N
i ii
dα=
=∑0 1,2,...,i C for i Nα≤ ≤ =where C is a user-specified positive parameter
How to build a SVM for pattern recognition (cont.)
-
26
• We may view as the ij-th element of a symmetric N-by-N matrix K
• Having found the optimum values of , we can get
How to build a SVM for pattern recognition (cont.)
( , )i jK x x
{ }( , ) 1
( , )N
i j i jK x x
==K
,o iα
,1
( )N
o o i i ii
d xα ϕ=
=∑w
-
27
How to build a SVM for pattern recognition (cont.)
-
28
Example: XOR problem
• First, we choose kernel as
• With and ,we get( )2( , ) 1 ti iK x x x x= +
[ ]1 2,tx x x= [ ]1 2,
ti i ix x x=
2 2 2 21 1 1 2 1 2 2 2 1 1 2 2( , ) 1 2 2 2i i i i i i iK x x x x x x x x x x x x x x= + + + + +
2 21 1 2 2 1 2( ) 1, , 2 , , 2 , 2
tx x x x x x xϕ ⎡ ⎤= ⎣ ⎦
2 21 1 2 2 1 2( ) 1, , 2 , , 2 , 2 , 1,2,3,4
t
i i i i i i ix x x x x x x iϕ ⎡ ⎤= =⎣ ⎦
-
29
Example: XOR problem (cont.)
• We also find that
• The objective function for the dual form is
9 1 1 11 9 1 11 1 9 11 1 1 9
⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦
K
21 2 3 4 1 1 2 1 3 1 4
1( ) (9 2 2 22
Q α α α α α α α α α α α α= + + + − − − +
2 2 22 2 3 2 4 3 3 4 49 2 2 9 2 9 )α α α α α α α α α+ + − + − +
-
30
Example: XOR problem (cont.)
• Optimizing
1 2 3 41
( ) 9 1Q α α α α αα
∂⇒ − − + =
∂
( )Q α
1 2 3 42
( ) 9 1Q α α α α αα
∂⇒ − + + − =
∂
1 2 3 43
( ) 9 1Q α α α α αα
∂⇒ − + + − =
∂
1 2 3 44
( ) 9 1Q α α α α αα
∂⇒ − − + =
∂
-
31
Example: XOR problem (cont.)
• The optimum values of are,o iα
,1 ,2 ,3 ,418o o o o
α α α α= = = =
1( )4o
Q α =
21 1 12 4 2o o
= ⇒ =w w
-
32
Example: XOR problem (cont.)
• We find that the optimum weight vector is[ ]1 2 3 4
1 ( ) ( ) ( ) ( )8o
x x x xϕ ϕ ϕ ϕ= − + + −w
1 1 1 1 01 1 1 1 02 2 2 21 1 2
1 1 1 18 02 2 2 2 0
02 2 2 2
⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − −⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥= − + + − = ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− −⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⎣ ⎦⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦
-
33
Example: XOR problem (cont.)
• The optimal hyperplane is defined by( ) 0to xϕ =w
21
1 22
2
1
2
1
210,0, ,0,0,0 02
2
2
x
x xx
x
x
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥−⎡ ⎤ ⎢ ⎥ =⎢ ⎥ ⎢ ⎥⎣ ⎦⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
1 2 0x x− =
-
34
Properties and expansions of SVM
• Two important features:– Duality is the first feature of SVM– Operate in a kernel induced feature space
• Several expansions of SVM:– C-Support Vector Classification (binary case)– v-Support Vector Classification (binary case)– Distribution Estimation (one-class SVM)– -Support Vector Regression ( -SVR)– v-Support Vector Regression (v-SVR)ε ε
-
35
Conclusion• The SVM is an elegant and highly principled
learning method for the design of classifying nonlinear input data.
• Compared with back-propagation algorithm– Only operate in a batch mode– Whatever the learning task, it provide a method for
controlling model complexity independently of dimensionality
– It is guaranteed to find a global extremum of the error surface
– The computation can be performed efficiently– By using a suitable inner-product kernel, the SVM
computes all the important network parameters automatically.
-
36
Applications of SVM
• Classification• Regression• Recognition• Bioinformatics
-
37
LIBSVM
• A Library for Support Vector Machines• Made by Chih-Jen Lin and Chih-Chung
Chang• Both C++ and Java sources • http://www.csie.ntu.edu.tw/~cjlin/
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown
/Description >>> setdistillerparams> setpagedevice