lect-26.pdf

Upload: gunnersregister

Post on 24-Feb-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 Lect-26.pdf

    1/14

    Machine Learning

    1

    Computational

    Learning Theory

    Machine Learning

  • 7/25/2019 Lect-26.pdf

    2/14

    Machine Learning

    2

    Introduction

    Computational learning theory

    Is it possible to identify classes of learning problems that are inherently easyor difficult?

    Can we characterize the number of training examples necessary or sufficient to assure

    successful learning?

    How is this number affected if the learner is allowed to pose queriesto the trainer ?

    Can we characterize the number of mistakesthat a learner will make before learning

    the target function?

    Can we characterize the inherent computational complexity of classes of learning

    problems? General answers to all these questions are not yet known.

    This chapter

    Sample complexity

    Computational complexity Mistake bound

  • 7/25/2019 Lect-26.pdf

    3/14

    Machine Learning

    3

    Probably Approximately Correct (PAC) Learning [1]

    Problem Setting for concept learning:

    X : Set of all possible instances over which target functions may be defined.

    Training and Testing instances are generated from X according some unknown distribution D.

    We assume that Dis stationary.

    C : Set of target concepts that our learner might be called upon to learn. Target concept is a Boolean function c : X {0,1}.

    H : Set of all possible hypotheses.

    Goal : Producing hypothesis h H which is an estimate of c.

    Evaluation: Performance of h measured over new samples drawn randomly using distribution D.

    Error of a Hypothesis :

    The training error(sample error ) of hypothesis hwith respect to target concept cand data sample S of size n is.

    The true error(denoted errorD(h)) of hypothesis hwith respect to target concept cand distribution D is the

    probability that hwill misclassify an instance drawn at random according to D.

    errorD(h) depends strongly of the D. D(x) is prob. of presenting x.

    h is approximatelycorrect if errorD(h)

    ( ) ( ) ( )[ ]

    Sx

    S

    nxhxcherror

    1

    ( ) ( ) ( )[ ]xhxcPrherrorx

    D

    D

    c h

    Instance Space X

    ++

    --

    -Where c

    and h

    disagree

    ( )

    )()(

    )(xcxh

    D xDherror

  • 7/25/2019 Lect-26.pdf

    4/14

    Machine Learning

    4

    Probably Approximately Correct (PAC) Learning [2]

    We are trying to characterize the number of training examples needed to learna hypothesis hfor which errorD (h)=0.

    Problems :

    May be multiple consistent hypotheses and the learner can not pickup one of them.

    Since training set is chosen randomly, the true error may not be zero.

    To accommodate these difficulties, we will not require that :

    True error may not be zero, We require that true error is bounded by some constant .

    The learner succeed for every training sequence, we require that the failure probability is

    bounded by some constant . Is confidence parameter.

    In short, we require only that the learner probably learns a hypothesis that isapproximately correct.

    PAC Learnability :

    Definition : Cis PAC-learnableby Lusing Hif for all c C, distributions D over X, thereexists an such that 0 < < 1/2, and a such that 0 < < 1/2, learner Lwill, with probability

    at least (1 -), output a hypothesis h Hsuch that errorD(h) in time that is polynomial in(1/ ), (1/ ), n, and |C|.

    If L requires some minimum processing time per training example, then for C to be

    PAC-Learnable by L, L must learn from a polynomial number of training examples.

  • 7/25/2019 Lect-26.pdf

    5/14

    Machine Learning

    5

    Probably Approximately Correct (PAC) Learning [3]

    Sample Complexity : The growth in the number of required training exampleswith problem size.

    The most limiting factor for success of a learner is the limited availability of trainingdata.

    Consistent learner : A learner is consistentif it outputs hypotheses thatperfectlyfitthe training data, whenever possible.

    Our Concern : Can we bound true error of h (given training error of h)?

    Definition : Version space VSH,D is said to be -exhaustedwith respect to cand D, if all

    hVSH,D has true errorless than with respect to cand D ( h VSH,D . errorD(h) < )

    Hypothesis Space H

    error= 0.1r= 0.2

    error= 0.3r= 0.1

    error= 0.2r= 0.0

    error= 0.1r= 0.0

    VSH,D

    error= 0.3r= 0.4

    error= 0.2r= 0.3

    (r= training error, error= true error)

  • 7/25/2019 Lect-26.pdf

    6/14

    Machine Learning

    6

    Probably Approximately Correct (PAC) Learning [4]

    Theorem [Haussler, 1988]

    If the hypothesis space His finite, and Dis a sequence of m 1 independent random

    examples of some target concept c, then for any 0 1, the probability that the

    version space with respect to Hand Dis not -exhausted (with respect to c) is less

    than or equal to | H | e - m

    Important Result!

    Bounds the probabilitythat any consistent learner will output a hypothesis hwith

    error(h)

    Want this probability to be below a specified threshold

    | H | e - m

    To achieve, solve inequality for m: let

    m 1/ (ln |H| + ln (1/))

    Need to see at leastthis many examples

    It is possible that | H | e - m > 1.

  • 7/25/2019 Lect-26.pdf

    7/14

    Machine Learning

    7

    Probably Approximately Correct (PAC) Learning [5]

    Example :

    H: conjunctions of constraints on up to nboolean attributes (nboolean literals)

    | H| = 3n, m 1/ (ln 3n+ ln (1/)) = 1/ (nln 3 + ln (1/))

    How About EnjoySport?

    Has given in EnjoySport (conjunctive concepts with dont cares)

    | H | = 973

    m 1/ (ln |H| + ln (1/))

    Example goal: probability 1 - = 95% of hypotheses with errorD(h) < 0.1

    m 1/0.1 (ln 973 + ln (1/0.05)) 98.8

    Example Sky AirTemp

    Humidity Wind Water Forecast EnjoySport

    0 Sunny Warm Normal Strong Warm Same Yes1 Sunny Warm High Strong Warm Same Yes2 Rainy Cold High Strong Warm Change No3 Sunny Warm High Strong Cool Change Yes

  • 7/25/2019 Lect-26.pdf

    8/14

    Machine Learning

    8

    Probably Approximately Correct (PAC) Learning [6]

    Unbiased Learner

    Recall: sample complexity bound m 1/ (ln | H| + ln (1/))

    Sample complexity not always polynomial

    Example: for unbiased learner, | H| = 2 | X|

    Suppose Xconsists of nbooleans (binary-valued attributes)

    | X| = 2n, | H| = 22n

    m

    1/

    (2

    n

    ln 2 + ln (1/))

    Sample complexity for this His exponential in n

    Agnostic Learner : A learner that make no assumption that the target concept isrepresentable by Hand that simply finds the hypothesis with minimum error.

    How Hard Is This? Sample complexity: m 1/22 (ln |H| + ln (1/))

    Derived from Hoeffding bounds: P [TrueErrorD(h) > TrainingErrorD(h) + ] e-2m2

  • 7/25/2019 Lect-26.pdf

    9/14

    Machine Learning

    9

    Probably Approximately Correct (PAC) Learning [7]

    Drawbacks of sample complexity

    The bound is not tight, when |H| is large and probability may be grater than 1.

    When |H| is infinite.

    Vapnik-Chervonekis dimension (VC (H)) VC-dimension measures complexity of hypothesis space H, not by the number of distinct hypotheses |H|,

    but by the number of distinct instances from X that can be completely discriminated using H.

    Dichotomies: A dichotomy(concept)of a set Sis a partition of Sinto two subsets S1 and S2

    Shattering A set of instances Sis shatteredby hypothesis space Hif and only if for every dichotomy

    (concept) of S, there exists a hypothesis in Hconsistent with this dichotomy

    Intuition: a richset of functions shatters a largerinstance space

    Instance

    Space X

  • 7/25/2019 Lect-26.pdf

    10/14

    Machine Learning

    10

    Vapnik-Chervonekis Dimension [1]

    From Chapter 2, unbiased hypotheses space is capable of

    representing every possible concept (dichotomy) defined over the

    instance space X.

    Unbiased hypotheses space H can shatter instance space X.

    If H cannot shatter X, but can shatter some large subset S of X, what

    happens? This is defined by VC (H).

    Vapnik-Chervonekis dimension (VC (H))

    VC (H) of hypotheses space H defined over the instance space X is the size of

    largest finite subset of X shattered by H. If arbitrary large finite sets of X can be

    shattered by H, then VC (H) = .

    For any finite H, VC (H) log2 |H|

  • 7/25/2019 Lect-26.pdf

    11/14

    Machine Learning

    11

    Vapnik-Chervonekis Dimension [2]

    Example :

    X = R : The set of real numbers.

    H : The set of intervals on real line in form of a < x < b, where aand bmay be any real constants.

    What is VC (H)?

    Let S = {3.1,5.7}, can S be shattered by H?

    1 < x < 2

    1 < x < 4

    4 < x < 7

    1 < x < 7

    Let S = {x0, x1, x2} (x0 < x1 < x2), can S be shattered by H?

    The dichotomy that includes x0 and x2 but not x1 cannot be shattered by H. Thus VC (H) = 2.

  • 7/25/2019 Lect-26.pdf

    12/14

    Machine Learning

    12

    Example :

    X :The set of instances corresponding to points in x-y plane.

    H : The set of all linear decision surfaces in the plane (such as decision function of perceptron).

    What is VC (H)?

    Colinear points cannot be shattered!

    Thus VC (H) = 2 or 3 or .

    VC (H) > 3. To show VC (H) < d, we must show that no set of size d can be shattered.

    In this example, no set of size 4, can be shattered, hence VC (H) = 3

    It can be shown that VC-dimension of linear decision surfaces in r-dimensional is r+1.

    Example :

    X :The set of instances corresponding to points in x-y plane.

    H : The set of all axis-aligned rectangles in two dimensions.

    What is VC (H)?

    Vapnik-Chervonekis Dimension [3]

  • 7/25/2019 Lect-26.pdf

    13/14

    Machine Learning

    13

    Example : Feed forward Neural Networks with Nfree parameters

    VC for a neural network with linear activation function is O(N).

    VC for a neural network with Threshold activation function is O(N log N).

    VC for a neural network with sigmoid activation function is O(N2).

    Vapnik-Chervonekis Dimension [4]

  • 7/25/2019 Lect-26.pdf

    14/14

    Machine Learning

    14

    Mistake Bounds

    How many mistakes will the learner make in its prediction before it learns

    the target concept.

    Example : Find-S

    Suppose Hbe conjunction of up to n Boolean literals and their negations

    Find-S

    Initialize hto the most specific hypothesis l1 l1 l2 l2 ln ln

    For each positive training instance xdo

    remove from hany literal that is not satisfied by x

    Output hypothesis h

    How Many Mistakes before Converging to Correct h?

    Once a literal is removed, it is never put back

    No false positives (started with most restrictive h): count false negatives

    First example will remove ncandidate literals

    Worst case: every remaining literal is also removed (incurring 1 mistake each)

    Find-S makes at most n+ 1 mistakes