lect-26.pdf

7/25/2019 Lect-26.pdf

1/14

Machine Learning

1

Computational

Learning Theory

Machine Learning

7/25/2019 Lect-26.pdf

2/14

Machine Learning

2

Introduction

Computational learning theory

Is it possible to identify classes of learning problems that are inherently easyor difficult?

Can we characterize the number of training examples necessary or sufficient to assure

successful learning?

How is this number affected if the learner is allowed to pose queriesto the trainer ?

Can we characterize the number of mistakesthat a learner will make before learning

the target function?

Can we characterize the inherent computational complexity of classes of learning

problems? General answers to all these questions are not yet known.

This chapter

Sample complexity

Computational complexity Mistake bound

7/25/2019 Lect-26.pdf

3/14

Machine Learning

3

Probably Approximately Correct (PAC) Learning [1]

Problem Setting for concept learning:

X : Set of all possible instances over which target functions may be defined.

Training and Testing instances are generated from X according some unknown distribution D.

We assume that Dis stationary.

C : Set of target concepts that our learner might be called upon to learn. Target concept is a Boolean function c : X {0,1}.

H : Set of all possible hypotheses.

Goal : Producing hypothesis h H which is an estimate of c.

Evaluation: Performance of h measured over new samples drawn randomly using distribution D.

Error of a Hypothesis :

The training error(sample error ) of hypothesis hwith respect to target concept cand data sample S of size n is.

The true error(denoted errorD(h)) of hypothesis hwith respect to target concept cand distribution D is the

probability that hwill misclassify an instance drawn at random according to D.

errorD(h) depends strongly of the D. D(x) is prob. of presenting x.

h is approximatelycorrect if errorD(h)

( ) ( ) ( )[ ]

Sx

S

nxhxcherror

1

( ) ( ) ( )[ ]xhxcPrherrorx

D

D

c h

Instance Space X

++

--

-Where c

and h

disagree

( )

)()(

)(xcxh

D xDherror

7/25/2019 Lect-26.pdf

4/14

Machine Learning

4


We are trying to characterize the number of training examples needed to learna hypothesis hfor which errorD (h)=0.

Problems :

May be multiple consistent hypotheses and the learner can not pickup one of them.

Since training set is chosen randomly, the true error may not be zero.

To accommodate these difficulties, we will not require that :

True error may not be zero, We require that true error is bounded by some constant .

The learner succeed for every training sequence, we require that the failure probability is

bounded by some constant . Is confidence parameter.

In short, we require only that the learner probably learns a hypothesis that isapproximately correct.

PAC Learnability :

Definition : Cis PAC-learnableby Lusing Hif for all c C, distributions D over X, thereexists an such that 0 < < 1/2, and a such that 0 < < 1/2, learner Lwill, with probability

at least (1 -), output a hypothesis h Hsuch that errorD(h) in time that is polynomial in(1/ ), (1/ ), n, and |C|.

If L requires some minimum processing time per training example, then for C to be

PAC-Learnable by L, L must learn from a polynomial number of training examples.

7/25/2019 Lect-26.pdf

5/14

Machine Learning

5


Sample Complexity : The growth in the number of required training exampleswith problem size.

The most limiting factor for success of a learner is the limited availability of trainingdata.

Consistent learner : A learner is consistentif it outputs hypotheses thatperfectlyfitthe training data, whenever possible.

Our Concern : Can we bound true error of h (given training error of h)?

Definition : Version space VSH,D is said to be -exhaustedwith respect to cand D, if all

hVSH,D has true errorless than with respect to cand D ( h VSH,D . errorD(h) < )

Hypothesis Space H

error= 0.1r= 0.2

error= 0.3r= 0.1

error= 0.2r= 0.0

error= 0.1r= 0.0

VSH,D

error= 0.3r= 0.4

error= 0.2r= 0.3

(r= training error, error= true error)

7/25/2019 Lect-26.pdf

6/14

Machine Learning

6


Theorem [Haussler, 1988]

If the hypothesis space His finite, and Dis a sequence of m 1 independent random

examples of some target concept c, then for any 0 1, the probability that the

version space with respect to Hand Dis not -exhausted (with respect to c) is less

than or equal to | H | e - m

Important Result!

Bounds the probabilitythat any consistent learner will output a hypothesis hwith

error(h)

Want this probability to be below a specified threshold

| H | e - m

To achieve, solve inequality for m: let

m 1/ (ln |H| + ln (1/))

Need to see at leastthis many examples

It is possible that | H | e - m > 1.

7/25/2019 Lect-26.pdf

7/14

Machine Learning

7


Example :

H: conjunctions of constraints on up to nboolean attributes (nboolean literals)

| H| = 3n, m 1/ (ln 3n+ ln (1/)) = 1/ (nln 3 + ln (1/))

How About EnjoySport?

Has given in EnjoySport (conjunctive concepts with dont cares)

| H | = 973

m 1/ (ln |H| + ln (1/))

Example goal: probability 1 - = 95% of hypotheses with errorD(h) < 0.1

m 1/0.1 (ln 973 + ln (1/0.05)) 98.8

Example Sky AirTemp

Humidity Wind Water Forecast EnjoySport

0 Sunny Warm Normal Strong Warm Same Yes1 Sunny Warm High Strong Warm Same Yes2 Rainy Cold High Strong Warm Change No3 Sunny Warm High Strong Cool Change Yes

7/25/2019 Lect-26.pdf

8/14

Machine Learning

8


Unbiased Learner

Recall: sample complexity bound m 1/ (ln | H| + ln (1/))

Sample complexity not always polynomial

Example: for unbiased learner, | H| = 2 | X|

Suppose Xconsists of nbooleans (binary-valued attributes)

| X| = 2n, | H| = 22n

m

1/

(2

n

ln 2 + ln (1/))

Sample complexity for this His exponential in n

Agnostic Learner : A learner that make no assumption that the target concept isrepresentable by Hand that simply finds the hypothesis with minimum error.

How Hard Is This? Sample complexity: m 1/22 (ln |H| + ln (1/))

Derived from Hoeffding bounds: P [TrueErrorD(h) > TrainingErrorD(h) + ] e-2m2

7/25/2019 Lect-26.pdf

9/14

Machine Learning

9


Drawbacks of sample complexity

The bound is not tight, when |H| is large and probability may be grater than 1.

When |H| is infinite.

Vapnik-Chervonekis dimension (VC (H)) VC-dimension measures complexity of hypothesis space H, not by the number of distinct hypotheses |H|,

but by the number of distinct instances from X that can be completely discriminated using H.

Dichotomies: A dichotomy(concept)of a set Sis a partition of Sinto two subsets S1 and S2

Shattering A set of instances Sis shatteredby hypothesis space Hif and only if for every dichotomy

(concept) of S, there exists a hypothesis in Hconsistent with this dichotomy

Intuition: a richset of functions shatters a largerinstance space

Instance

Space X

7/25/2019 Lect-26.pdf

10/14

Machine Learning

10

Vapnik-Chervonekis Dimension [1]

From Chapter 2, unbiased hypotheses space is capable of

representing every possible concept (dichotomy) defined over the

instance space X.

Unbiased hypotheses space H can shatter instance space X.

If H cannot shatter X, but can shatter some large subset S of X, what

happens? This is defined by VC (H).

Vapnik-Chervonekis dimension (VC (H))

VC (H) of hypotheses space H defined over the instance space X is the size of

largest finite subset of X shattered by H. If arbitrary large finite sets of X can be

shattered by H, then VC (H) = .

For any finite H, VC (H) log2 |H|

7/25/2019 Lect-26.pdf

11/14

Machine Learning

11


Example :

X = R : The set of real numbers.

H : The set of intervals on real line in form of a < x < b, where aand bmay be any real constants.

What is VC (H)?

Let S = {3.1,5.7}, can S be shattered by H?

1 < x < 2

1 < x < 4

4 < x < 7

1 < x < 7

Let S = {x0, x1, x2} (x0 < x1 < x2), can S be shattered by H?

The dichotomy that includes x0 and x2 but not x1 cannot be shattered by H. Thus VC (H) = 2.

7/25/2019 Lect-26.pdf

12/14

Machine Learning

12

Example :

X :The set of instances corresponding to points in x-y plane.

H : The set of all linear decision surfaces in the plane (such as decision function of perceptron).

What is VC (H)?

Colinear points cannot be shattered!

Thus VC (H) = 2 or 3 or .

VC (H) > 3. To show VC (H) < d, we must show that no set of size d can be shattered.

In this example, no set of size 4, can be shattered, hence VC (H) = 3

It can be shown that VC-dimension of linear decision surfaces in r-dimensional is r+1.

Example :

X :The set of instances corresponding to points in x-y plane.

H : The set of all axis-aligned rectangles in two dimensions.

What is VC (H)?


7/25/2019 Lect-26.pdf

13/14

Machine Learning

13

Example : Feed forward Neural Networks with Nfree parameters

VC for a neural network with linear activation function is O(N).

VC for a neural network with Threshold activation function is O(N log N).

VC for a neural network with sigmoid activation function is O(N2).


7/25/2019 Lect-26.pdf

14/14

Machine Learning

14

Mistake Bounds

How many mistakes will the learner make in its prediction before it learns

the target concept.

Example : Find-S

Suppose Hbe conjunction of up to n Boolean literals and their negations

Find-S

Initialize hto the most specific hypothesis l1 l1 l2 l2 ln ln

For each positive training instance xdo

remove from hany literal that is not satisfied by x

Output hypothesis h

How Many Mistakes before Converging to Correct h?

Once a literal is removed, it is never put back

No false positives (started with most restrictive h): count false negatives

First example will remove ncandidate literals

Worst case: every remaining literal is also removed (incurring 1 mistake each)

Find-S makes at most n+ 1 mistakes

lect-26.pdf

Documents