perceptron - kangwon

40
Perceptron Some slides from CS546 …

Upload: others

Post on 04-May-2022

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Perceptron - Kangwon

Perceptron

Some slides from CS546 …

Page 2: Perceptron - Kangwon

2

Linear Functions

f (x) = 1 if w1 x1 + w2 x2 +. . . wn xn >= θ 0 Otherwise

{ •  Disjunctions: y = x1 ∨ x3 ∨ x5

y = ( 1• x1 + 1• x3 + 1• x5 >= 1) •  At least m of n: y = at least 2 of {x1 , x3 , x5}

y = ( 1• x1 + 1• x3 + 1• x5 >=2)

•  Exclusive-OR: y = (x1 Λ x2) v (x1 Λ x2)

•  Non-trivial DNF: y = (x1 Λ x2) v (x3 Λ x4)

Page 3: Perceptron - Kangwon

3

Linear Functions

w � x = 0

- - - - - - - - -

- -

- - -

-

w � x = θ

Page 4: Perceptron - Kangwon

Some Biology

•  Very loose inspiration: human neurons

Page 5: Perceptron - Kangwon

Perceptrons abstract from the details of real neurons

•  Conductivity delays are neglected •  An output signal is either discrete (e.g., 0 or 1) or it is a

real-valued number (e.g., between 0 and 1) •  Net input is calculated as the weighted sum of the input

signals •  Net input is transformed into an output signal via a

simple function (e.g., a threshold function)

Page 6: Perceptron - Kangwon

Different Activation Functions

•  Threshold Activation Function (step) •  Piecewise Linear Activation Function •  Sigmoid Activation Function •  Gaussian Activation Function

–  Radial Basis Function

BIAS UNIT With X0 = 1

Page 7: Perceptron - Kangwon

Types of Activation functions

Page 8: Perceptron - Kangwon

The Perceptron

Features

LTU Sigmoid

Page 9: Perceptron - Kangwon

The Binary Perceptron

•  Inputs are features •  Each feature has a weight •  Sum is the activation

•  If the activation is: –  Positive, output 1 –  Negative, output 0

Σ f1

f2

f3

w1

w2

w3 >0?

Page 10: Perceptron - Kangwon

10

Perceptron learning rule

•  On-line, mistake driven algorithm. •  Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fixed input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule Perceptron == Linear Threshold Unit

1 2

6

3 4 5

7

6w

1w

∑T

y

1x

6x

xwTii

i xwy ==∑ˆ

Page 11: Perceptron - Kangwon

11

Perceptron learning rule

•  We learn f:X→{-1,+1} represented as f = sgn{w•x) Where X= or X= w∈ n{0,1} nR nR•  Given Labeled examples: )}y,(x),...,y,(x),y,{(x mm2211

1.  Initialize w=0∈ 2. Cycle through all examples

a. Predict the label of instance x to be y’ = sgn{w•x) b. If y’≠y, update the weight vector:

w = w + r y x (r - a constant, learning rate) Otherwise, if y’=y, leave weights unchanged.

nR

Page 12: Perceptron - Kangwon

12

Footnote About the Threshold

•  On previous slide, Perceptron has no threshold •  But we don’t lose generality:

θ−⇔

∀⇔

,

1,

ww

xxx

0x

1x

θ=•xw

0x

1x

θ 01,, =•− xw θ

Page 13: Perceptron - Kangwon

13

Geometric View

Page 14: Perceptron - Kangwon

14

Page 15: Perceptron - Kangwon

15

Page 16: Perceptron - Kangwon

16

Page 17: Perceptron - Kangwon

Deriving the delta rule

•  Define the error as the squared residuals summed over all training cases:

•  Now differentiate to get error derivatives for weights

•  The batch delta rule changes the weights in proportion to their error derivatives summed over all training cases

E = 12 (ynn∑ − yn )

2

∂E∂wi

= 12

∂yn∂wi

∂En∂ynn

= − xi ,nn∑ (yn − yn )

Δwi = −ε∂E∂wi

Page 18: Perceptron - Kangwon

18

Perceptron Learnability

•  Obviously can’t learn what it can’t represent –  Only linearly separable functions

•  Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations –  Parity functions can’t be learned (XOR) –  In vision, if patterns are represented with local features,

can’t represent symmetry, connectivity •  Research on Neural Networks stopped for years

•  Rosenblatt himself (1959) asked,

•  “What pattern recognition problems can be transformed so as to become linearly separable?”

Page 19: Perceptron - Kangwon

19

Perceptron Convergence •  Perceptron Convergence Theorem: If there exist a set of weights that are consistent with the (I.e., the data is linearly separable) the perceptron learning algorithm will converge -- How long would it take to converge ? •  Perceptron Cycling Theorem: If the training data is not linearly the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop. -- How to provide robustness, more expressivity ?

Page 20: Perceptron - Kangwon

20

•  Maintains a weight vector w ∈ RN, w0=(0,…,0). •  Upon receiving an example x ∈ RN •  Predicts according to the linear threshold function w•x ≥ 0. Theorem [Novikoff,1963] Let (x1; y1),…,: (xt; yt), be a sequence of

labeled examples with xi ∈ RN, ||xi|| ≤ R and yi ∈{-1,1} for all i. Let u ∈ RN, γ > 0 be such that, ||u|| = 1 and yi u • xi ≥ γ for all i. Then Perceptron makes at most R2 / γ 2 mistakes on this example

sequence. (see additional notes)

Perceptron: Mistake Bound Theorem

Margin

Complexity Parameter

Page 21: Perceptron - Kangwon

21

Perceptron-Mistake Bound Proof: Let vk be the hypothesis before the k-th mistake. Assume that the k-th mistake occurs on the input example (xi, yi).

Assumptions

v1 = 0

||u|| ≤ 1 yi u • xi ≥ γ

K < R2 / γ 2

Multiply by u

By definition of u

By induction

Projection

Page 22: Perceptron - Kangwon

22

Dual Perceptron

Page 23: Perceptron - Kangwon

23

Dual Perceptron - We can replace xi � xj with K(xi ,xj) which can be regarded a dot product in some large (or infinite) space

-  K(x,y) - often can be computed efficiently without computing mapping to this space

Page 24: Perceptron - Kangwon

24

Efficiency

•  Dominated by the size of the feature space

•  Could be more efficient since work is done in the original feature space.

•  In practice: explicit Kernels (feature space blow-up) is often more efficient.

∑=i

ii )K(x,xcf(x)

•  Additive algorithms allow the use of Kernels No need to explicitly generate the complex features

kn ) (x)... (x), (x), (x) n321 >>Χ→ χχχχ(),...,,( 321 kxxxxX

•  Most features are functions (e.g., conjunctions) of raw attributes

Page 25: Perceptron - Kangwon

Which vi should we use?

Maybe the last one?

Here it’s never gotten any test cases right! (Experimentally, the classifiers move around a lot.)

Maybe the “best one”?

But we “improved” it with later mistakes…

Voted-Perceptron

Page 26: Perceptron - Kangwon

Voted-Perceptron Idea two: keep around intermediate hypotheses, and have them “vote” [Freund

and Schapire, 1998] n = 1 w1 = 0 c1 = 0 for k = 1 to K for i = 1 to m if (xi,yi) is misclassified: wn+1 = wn + yi xi cn+1 = 1 n = n + 1 else cn = cn + 1

At the end, a collection of linear separators w0, w1, w2, …, along with survival times: cn = amount of time that wn survived.

Page 27: Perceptron - Kangwon

Idea two: keep around intermediate hypotheses, and have them “vote” [Freund and Schapire, 1998]

At the end, a collection of linear separators w0, w1, w2, …, along

with survival times: cn = amount of time that wn survived. This cn is a good measure of the reliability of wn. To classify a test point x, use a weighted majority vote:

Voted-Perceptron – cont’d

Page 28: Perceptron - Kangwon

Problem: need to keep around a lot of wn vectors Solutions: (i) Find “representatives” (ii)  Alternative prediction rule:

wavg

Voted-Perceptron – cont’d

Page 29: Perceptron - Kangwon

From Freund & Schapire, 1998: Classifying digits with VP

Page 30: Perceptron - Kangwon

30

•  In general – regularization is used to bias the learner in the direction of a low-expressivity (low VC dimension) separator • Averaged Perceptron

•  Returns a weighted average of a number of earlier hypothesis; •  The weights are a function of the length of no-mistakes stretch.

Extensions: Regularization

The two most important extensions for Perceptron turns out to be: Averaged Perceptron & Thick Separator

Page 31: Perceptron - Kangwon

31

Regularization: Thick Separator

•  Thick Separator (Perceptron)

–  Promote if: w�x > θ+γ –  Demote if: w�x < θ-γ

w�x = 0

- - - - - - - - -

- -

- - -

-

w�x = θ

Page 32: Perceptron - Kangwon

Multiclass Classification

What if there are k classes?

1

3

2

Reduce to binary: all-vs-one

Page 33: Perceptron - Kangwon

Winnow Algorithm

33

Page 34: Perceptron - Kangwon

34

SNoW •  A learning architecture that supports several linear update

rules (Perceptron, Winnow, naïve Bayes) •  Allows regularization; voted Winnow/Perceptron; pruning;

many options •  True multi-class classification •  Variable size examples; very good support for large scale

domains in terms of number of examples and number of features.

•  “Explicit” kernels (blowing up feature space). •  Very efficient (1-2 order of magnitude faster than SVMs) •  Stand alone, implemented in LBJ

[Dowload from: http://L2R.cs.uiuc.edu/~cogcomp ]

Page 35: Perceptron - Kangwon

35

Passive-Aggressive: Motivation

•  Perceptron: No guaranties of margin after the update

•  PA: Enforce a minimal non-zero margin after the update

•  In particular: §  If the margin is large enough, then do nothing §  If the margin is less then unit, update such that the

margin after the update is enforced to be unit

Page 36: Perceptron - Kangwon

36

Aggressive Update Step

•  Set to be the solution of the following optimization problem:

•  Closed-form update:

where,

Page 37: Perceptron - Kangwon

37

Passive-Aggressive Update

Page 38: Perceptron - Kangwon

Online Passive-Aggressive Algorithms

38

Page 39: Perceptron - Kangwon

Online Passive-Aggressive Algorithms – cont’d

39

Page 40: Perceptron - Kangwon

40

Unrealizable Case