nicholas ruozzi university of texas at dallasnrr150130/cs6375/2015fa/lects/lecture_3_svm.pdf ·...
TRANSCRIPT
![Page 1: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/1.jpg)
Support Vector Machines
Nicholas Ruozzi
University of Texas at Dallas
Slides adapted from David Sontag and Vibhav Gogate
![Page 2: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/2.jpg)
Announcements
• Homework 1 is now available online
• Join the Piazza discussion group
• Reminder: my office hours are 11am-12pm on Tuesdays
![Page 3: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/3.jpg)
Binary Classification
• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}
• We can think of the observations as points in ℝ𝑚 with an associated
sign (either +/- corresponding to 0/1)
• An example with 𝑚 = 2
++
++
+
+
+
+
+
++ +
_
_
_ _
_
_
_
_ _
_
𝑤𝑇𝑥 + 𝑏 = 0
𝑤𝑇𝑥 + 𝑏 < 0
𝑤𝑇𝑥 + 𝑏 > 0
![Page 4: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/4.jpg)
Binary Classification
• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}
• We can think of the observations as points in ℝ𝑚 with an associated
sign (either +/- corresponding to 0/1)
• An example with 𝑚 = 2
++
++
+
+
+
+
+
++ +
_
_
_ _
_
_
_
_ _
_
𝑤𝑇𝑥 + 𝑏 = 0
𝑤𝑇𝑥 + 𝑏 < 0
𝑤𝑇𝑥 + 𝑏 > 0
𝑤 is called the vector of weights and 𝑏 is called the bias
![Page 5: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/5.jpg)
What If the Data Isn‘t Separable?
• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}
• We can think of the observations as points in ℝ𝑚 with an associated
sign (either +/- corresponding to 0/1)
• An example with 𝑚 = 2
+
+
++
+
+
++
+
+
+
+
_
_
_ __ _
_
_ _
_
![Page 6: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/6.jpg)
What If the Data Isn‘t Separable?
• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}
• We can think of the observations as points in ℝ𝑚 with an associated
sign (either +/- corresponding to 0/1)
• An example with 𝑚 = 2
+
+
++
+
+
++
+
+
+
+
_
_
_ __ _
_
_ _
_
![Page 7: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/7.jpg)
Adding Features
• The idea:
– Given the observations 𝑥(1), … , 𝑥(𝑛), construct a feature vector
𝜙(𝑥)
– Use 𝜙 𝑥(1) , … , 𝜙 𝑥(𝑛) instead of 𝑥(1), … , 𝑥(𝑛) in the
learning algorithm
– Goal is to choose 𝜙 so that 𝜙 𝑥(1) , … , 𝜙 𝑥(𝑛) are linearly
separable
– Learn linear separators of the form 𝑤𝑇𝜙 𝑥 (instead of 𝑤𝑇𝑥)
![Page 8: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/8.jpg)
Adding Features
• Sometimes it is convenient to group the bias together with
the weights
• To do this
– Let 𝜙 𝑥1, 𝑥2 =𝑥1𝑥21
and 𝑤 =
𝑤1
𝑤2
𝑏,
– This gives
𝑤𝑇𝜙 𝑥1, 𝑥2 = 𝑤1𝑥1 +𝑤2𝑥2 + 𝑏 = 𝑤𝑇𝑥 + 𝑏
![Page 9: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/9.jpg)
Support Vector Machines
+
+
+
+
+
+
+
+
+
+++
_
_
_
_
_
_
_
_ _
_
• How can we decide between perfect classifiers?
![Page 10: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/10.jpg)
Support Vector Machines
+
+
+
+
+
+
+
+
+
+++
_
_
_
_
_
_
_
_ _
_
• How can we decide between perfect classifiers?
![Page 11: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/11.jpg)
Support Vector Machines
+
+
+
+
+
+
+
+
+
+++
_
_
_
_
_
_
_
_ _
_
• Define the margin to be the distance of the closest data
point to the classifier
![Page 12: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/12.jpg)
• Support vector machines (SVMs)
• Choose the classifier with the largest margin
– Has good practical and theoretical performance
Support Vector Machines
+
+
++
+
+
+
+
+
++
+_
_
_
_
_
_
_
_ _
_
![Page 13: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/13.jpg)
• In 𝑛 dimensions, a hyperplane is a solution to the equation
𝑤𝑇𝑥 + 𝑏 = 0
with 𝑤 ∈ ℝ𝑛, 𝑏 ∈ ℝ
• The vector 𝑤 is sometimes called the normal vector of the
hyperplane
Some Geometry
𝑤𝑇𝑥 + 𝑏 = 0
𝑤
![Page 14: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/14.jpg)
• In 𝑛 dimensions, a hyperplane is a solution to the equation
𝑤𝑇𝑥 + 𝑏 = 0
• Note that this equation is scale invariant for any scalar 𝑐
𝑐 ⋅ 𝑤𝑇𝑥 + 𝑏 = 0
Some Geometry
𝑤𝑇𝑥 + 𝑏 = 0
𝑤
![Page 15: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/15.jpg)
• The distance between a point 𝑦 and a hyperplane 𝑤𝑇 +𝑏 = 0 is the length of the vector perpendicular to the line
through the point 𝑦
𝑦 − 𝑧 = 𝑦 − 𝑧𝑤
𝑤
Some Geometry
𝑤𝑇𝑥 + 𝑏 = 0
𝑧
𝑦
![Page 16: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/16.jpg)
• By scale invariance, we can assume that 𝑐 = 1
• The maximum margin is always attained by choosing
𝑤𝑇𝑥 + 𝑏 = 0 so that it is equidistant from the closest
data point classified as +1 and the closest data point
classified as -1
Scale Invariance
𝑤𝑇𝑥 + 𝑏 = 0
𝑧
𝑦
𝑤𝑇𝑥 + 𝑏 = 𝑐
![Page 17: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/17.jpg)
• We want to maximize the margin subject to the constraints
that
𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1
• But how do we compute the size of the margin?
Scale Invariance
𝑤𝑇𝑥 + 𝑏 = 0
𝑧
𝑦
𝑤𝑇𝑥 + 𝑏 = 𝑐 𝑤𝑇𝑥 + 𝑏 = −𝑐
![Page 18: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/18.jpg)
Putting it all together
𝑦 − 𝑧 = 𝑦 − 𝑧𝑤
𝑤
and
𝑤𝑇𝑦 + 𝑏 = 1𝑤𝑇𝑧 + 𝑏 = 0
Some Geometry
𝑤𝑇 𝑦 − 𝑧 = 1
and
𝑤𝑇 𝑦 − 𝑧 = 𝑦 − 𝑧 𝑤
which gives
𝑦 − 𝑧 = 1/ 𝑤
𝑤𝑇𝑥 + 𝑏 = 0
𝑧
𝑦
𝑤𝑇𝑥 + 𝑏 = 1 𝑤𝑇𝑥 + 𝑏 = −1
![Page 19: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/19.jpg)
SVMs
• This analysis yields the following optimization problem
max𝑤
1
𝑤
such that
𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1, for all 𝑖
• Or, equivalently,
min𝑤
𝑤 2
such that
𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1, for all 𝑖
![Page 20: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/20.jpg)
SVMs
min𝑤
𝑤 2
such that
𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1, for all 𝑖
• This is a standard quadratic programming problem
– Falls into the class of convex optimization problems
– Can be solved with many specialized optimization tools (e.g.,
quadprog() in MATLAB)
![Page 21: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/21.jpg)
SVMs
• Where does the name come from?
– The set of all data points such that 𝑦𝑖(𝑤𝑇𝑥(𝑖) + 𝑏) = 1 are
called support vectors
𝑤𝑇𝑥 + 𝑏 = 0
𝑧
𝑦
𝑤𝑇𝑥 + 𝑏 = 1 𝑤𝑇𝑥 + 𝑏 = −1
![Page 22: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and](https://reader031.vdocuments.pub/reader031/viewer/2022020214/5ae1473b7f8b9af05b8ec89e/html5/thumbnails/22.jpg)
SVMs
• What if the data isn’t linearly separable?
– Use feature vectors
• What if we want to do more than just binary classification
(i.e., if 𝑦 ∈ {1,2,3})?
– One versus all: for each class, compute a linear separator
between this class and all other classes
– All versus all: for each pair of classes, compute a linear separator