lect-26.pdf
TRANSCRIPT
-
7/25/2019 Lect-26.pdf
1/14
Machine Learning
1
Computational
Learning Theory
Machine Learning
-
7/25/2019 Lect-26.pdf
2/14
Machine Learning
2
Introduction
Computational learning theory
Is it possible to identify classes of learning problems that are inherently easyor difficult?
Can we characterize the number of training examples necessary or sufficient to assure
successful learning?
How is this number affected if the learner is allowed to pose queriesto the trainer ?
Can we characterize the number of mistakesthat a learner will make before learning
the target function?
Can we characterize the inherent computational complexity of classes of learning
problems? General answers to all these questions are not yet known.
This chapter
Sample complexity
Computational complexity Mistake bound
-
7/25/2019 Lect-26.pdf
3/14
Machine Learning
3
Probably Approximately Correct (PAC) Learning [1]
Problem Setting for concept learning:
X : Set of all possible instances over which target functions may be defined.
Training and Testing instances are generated from X according some unknown distribution D.
We assume that Dis stationary.
C : Set of target concepts that our learner might be called upon to learn. Target concept is a Boolean function c : X {0,1}.
H : Set of all possible hypotheses.
Goal : Producing hypothesis h H which is an estimate of c.
Evaluation: Performance of h measured over new samples drawn randomly using distribution D.
Error of a Hypothesis :
The training error(sample error ) of hypothesis hwith respect to target concept cand data sample S of size n is.
The true error(denoted errorD(h)) of hypothesis hwith respect to target concept cand distribution D is the
probability that hwill misclassify an instance drawn at random according to D.
errorD(h) depends strongly of the D. D(x) is prob. of presenting x.
h is approximatelycorrect if errorD(h)
( ) ( ) ( )[ ]
Sx
S
nxhxcherror
1
( ) ( ) ( )[ ]xhxcPrherrorx
D
D
c h
Instance Space X
++
--
-Where c
and h
disagree
( )
)()(
)(xcxh
D xDherror
-
7/25/2019 Lect-26.pdf
4/14
Machine Learning
4
Probably Approximately Correct (PAC) Learning [2]
We are trying to characterize the number of training examples needed to learna hypothesis hfor which errorD (h)=0.
Problems :
May be multiple consistent hypotheses and the learner can not pickup one of them.
Since training set is chosen randomly, the true error may not be zero.
To accommodate these difficulties, we will not require that :
True error may not be zero, We require that true error is bounded by some constant .
The learner succeed for every training sequence, we require that the failure probability is
bounded by some constant . Is confidence parameter.
In short, we require only that the learner probably learns a hypothesis that isapproximately correct.
PAC Learnability :
Definition : Cis PAC-learnableby Lusing Hif for all c C, distributions D over X, thereexists an such that 0 < < 1/2, and a such that 0 < < 1/2, learner Lwill, with probability
at least (1 -), output a hypothesis h Hsuch that errorD(h) in time that is polynomial in(1/ ), (1/ ), n, and |C|.
If L requires some minimum processing time per training example, then for C to be
PAC-Learnable by L, L must learn from a polynomial number of training examples.
-
7/25/2019 Lect-26.pdf
5/14
Machine Learning
5
Probably Approximately Correct (PAC) Learning [3]
Sample Complexity : The growth in the number of required training exampleswith problem size.
The most limiting factor for success of a learner is the limited availability of trainingdata.
Consistent learner : A learner is consistentif it outputs hypotheses thatperfectlyfitthe training data, whenever possible.
Our Concern : Can we bound true error of h (given training error of h)?
Definition : Version space VSH,D is said to be -exhaustedwith respect to cand D, if all
hVSH,D has true errorless than with respect to cand D ( h VSH,D . errorD(h) < )
Hypothesis Space H
error= 0.1r= 0.2
error= 0.3r= 0.1
error= 0.2r= 0.0
error= 0.1r= 0.0
VSH,D
error= 0.3r= 0.4
error= 0.2r= 0.3
(r= training error, error= true error)
-
7/25/2019 Lect-26.pdf
6/14
Machine Learning
6
Probably Approximately Correct (PAC) Learning [4]
Theorem [Haussler, 1988]
If the hypothesis space His finite, and Dis a sequence of m 1 independent random
examples of some target concept c, then for any 0 1, the probability that the
version space with respect to Hand Dis not -exhausted (with respect to c) is less
than or equal to | H | e - m
Important Result!
Bounds the probabilitythat any consistent learner will output a hypothesis hwith
error(h)
Want this probability to be below a specified threshold
| H | e - m
To achieve, solve inequality for m: let
m 1/ (ln |H| + ln (1/))
Need to see at leastthis many examples
It is possible that | H | e - m > 1.
-
7/25/2019 Lect-26.pdf
7/14
Machine Learning
7
Probably Approximately Correct (PAC) Learning [5]
Example :
H: conjunctions of constraints on up to nboolean attributes (nboolean literals)
| H| = 3n, m 1/ (ln 3n+ ln (1/)) = 1/ (nln 3 + ln (1/))
How About EnjoySport?
Has given in EnjoySport (conjunctive concepts with dont cares)
| H | = 973
m 1/ (ln |H| + ln (1/))
Example goal: probability 1 - = 95% of hypotheses with errorD(h) < 0.1
m 1/0.1 (ln 973 + ln (1/0.05)) 98.8
Example Sky AirTemp
Humidity Wind Water Forecast EnjoySport
0 Sunny Warm Normal Strong Warm Same Yes1 Sunny Warm High Strong Warm Same Yes2 Rainy Cold High Strong Warm Change No3 Sunny Warm High Strong Cool Change Yes
-
7/25/2019 Lect-26.pdf
8/14
Machine Learning
8
Probably Approximately Correct (PAC) Learning [6]
Unbiased Learner
Recall: sample complexity bound m 1/ (ln | H| + ln (1/))
Sample complexity not always polynomial
Example: for unbiased learner, | H| = 2 | X|
Suppose Xconsists of nbooleans (binary-valued attributes)
| X| = 2n, | H| = 22n
m
1/
(2
n
ln 2 + ln (1/))
Sample complexity for this His exponential in n
Agnostic Learner : A learner that make no assumption that the target concept isrepresentable by Hand that simply finds the hypothesis with minimum error.
How Hard Is This? Sample complexity: m 1/22 (ln |H| + ln (1/))
Derived from Hoeffding bounds: P [TrueErrorD(h) > TrainingErrorD(h) + ] e-2m2
-
7/25/2019 Lect-26.pdf
9/14
Machine Learning
9
Probably Approximately Correct (PAC) Learning [7]
Drawbacks of sample complexity
The bound is not tight, when |H| is large and probability may be grater than 1.
When |H| is infinite.
Vapnik-Chervonekis dimension (VC (H)) VC-dimension measures complexity of hypothesis space H, not by the number of distinct hypotheses |H|,
but by the number of distinct instances from X that can be completely discriminated using H.
Dichotomies: A dichotomy(concept)of a set Sis a partition of Sinto two subsets S1 and S2
Shattering A set of instances Sis shatteredby hypothesis space Hif and only if for every dichotomy
(concept) of S, there exists a hypothesis in Hconsistent with this dichotomy
Intuition: a richset of functions shatters a largerinstance space
Instance
Space X
-
7/25/2019 Lect-26.pdf
10/14
Machine Learning
10
Vapnik-Chervonekis Dimension [1]
From Chapter 2, unbiased hypotheses space is capable of
representing every possible concept (dichotomy) defined over the
instance space X.
Unbiased hypotheses space H can shatter instance space X.
If H cannot shatter X, but can shatter some large subset S of X, what
happens? This is defined by VC (H).
Vapnik-Chervonekis dimension (VC (H))
VC (H) of hypotheses space H defined over the instance space X is the size of
largest finite subset of X shattered by H. If arbitrary large finite sets of X can be
shattered by H, then VC (H) = .
For any finite H, VC (H) log2 |H|
-
7/25/2019 Lect-26.pdf
11/14
Machine Learning
11
Vapnik-Chervonekis Dimension [2]
Example :
X = R : The set of real numbers.
H : The set of intervals on real line in form of a < x < b, where aand bmay be any real constants.
What is VC (H)?
Let S = {3.1,5.7}, can S be shattered by H?
1 < x < 2
1 < x < 4
4 < x < 7
1 < x < 7
Let S = {x0, x1, x2} (x0 < x1 < x2), can S be shattered by H?
The dichotomy that includes x0 and x2 but not x1 cannot be shattered by H. Thus VC (H) = 2.
-
7/25/2019 Lect-26.pdf
12/14
Machine Learning
12
Example :
X :The set of instances corresponding to points in x-y plane.
H : The set of all linear decision surfaces in the plane (such as decision function of perceptron).
What is VC (H)?
Colinear points cannot be shattered!
Thus VC (H) = 2 or 3 or .
VC (H) > 3. To show VC (H) < d, we must show that no set of size d can be shattered.
In this example, no set of size 4, can be shattered, hence VC (H) = 3
It can be shown that VC-dimension of linear decision surfaces in r-dimensional is r+1.
Example :
X :The set of instances corresponding to points in x-y plane.
H : The set of all axis-aligned rectangles in two dimensions.
What is VC (H)?
Vapnik-Chervonekis Dimension [3]
-
7/25/2019 Lect-26.pdf
13/14
Machine Learning
13
Example : Feed forward Neural Networks with Nfree parameters
VC for a neural network with linear activation function is O(N).
VC for a neural network with Threshold activation function is O(N log N).
VC for a neural network with sigmoid activation function is O(N2).
Vapnik-Chervonekis Dimension [4]
-
7/25/2019 Lect-26.pdf
14/14
Machine Learning
14
Mistake Bounds
How many mistakes will the learner make in its prediction before it learns
the target concept.
Example : Find-S
Suppose Hbe conjunction of up to n Boolean literals and their negations
Find-S
Initialize hto the most specific hypothesis l1 l1 l2 l2 ln ln
For each positive training instance xdo
remove from hany literal that is not satisfied by x
Output hypothesis h
How Many Mistakes before Converging to Correct h?
Once a literal is removed, it is never put back
No false positives (started with most restrictive h): count false negatives
First example will remove ncandidate literals
Worst case: every remaining literal is also removed (incurring 1 mistake each)
Find-S makes at most n+ 1 mistakes