neural networks 2nd edition simon haykin 柯博昌 chap 3. single-layer perceptrons
DESCRIPTION
3 Unconstrained Optimization Techniques Let C(w) be a continuously differentiable function of some unknown weight (parameter) vector w. C(w) maps w into real numbers. Goal: Find an optimal solution w* that satisfies C(w*) C(w) Minimize C(w) with respect to w. Necessary Condition for optimality: C(w*)=0 ( is the gradient operator) A class of unconstrained optimization algorithm: Starting with an initial guess denoted by w(0), generate a sequence of weight vectors w(1), w(2), …, such that the cost function C(w) is reduced at each iteration of the algorithm.TRANSCRIPT
Neural Networks 2nd EditionSimon Haykin
柯博昌Chap 3. Single-Layer Perceptrons
2
Adaptive Filtering Problem
Dynamic System The external behavior of the system:T: {x(i), d(i); i=1, 2, …, n, …}
where x(i)=[x1(i), x2(i), …, xm(i)]T
x(i) can arise from: Spatial: x(i) is a snapshot of data. Temporal: x(i) is uniformly spaced in time.Signal-flow Graph of the
Adaptive Filter
Filtering Process y(i) is produced in response to x(i). e(i) = d(i) - y(i)
Adaptive Process Automatic Adjustment of the synaptic
weights in accordance with e(i).
m
kkk ixiwiviy
1
)()()()( Tm21
T
iwiwiw(i) where
iiiy
)(),...,(),(
)()()(
w
wx )()()( iyidie
3
Unconstrained Optimization Techniques
Let C(w) be a continuously differentiable function of some unknown weight (parameter) vector w.
C(w) maps w into real numbers. Goal: Find an optimal solution w* that satisfies C(w*)C(w) Minimize C
(w) with respect to w.Necessary Condition for optimality: C(w*)=0 ( is the gradient operator)
T
mwww
,...,,21
T
mwC
wC
wCC
,...,,21
w
A class of unconstrained optimization algorithm:Starting with an initial guess denoted by w(0), generate a sequence of weight vectors w(1), w(2), …, such that the cost function C(w) is reduced at each iteration of the algorithm.
4
Method of Steepest Descent
The successive adjustments applied to w are in the direction of steepest descent, that is, in a direction opposite to the gradient vector C(w).
wg CLetThe steepest descent algorithm: w(n+1)=w(n)-g(n)
: a positive constant called the stepsize or learning-rate parameter.w(n) = w(n+1) - w(n) = -g(n)
Small Overdamp the transient response.
Large Underdamp the transient response.
If exceeds a certain value, the algorithm becomes unstable.
5
Newton’s Method
Applying second-order Taylor series expansion of C(w) around w(n).
nnHnnn
nCnCnC
TT wwwg
www
21
1
2
2
2
2
1
2
2
2
22
2
12
21
2
21
2
21
2
2
mmm
m
m
wC
wwC
wwC
wwC
wC
wwC
wwC
wwC
wC
CH
w
C(w) is minimized when
0
nnHn
nnC wg
ww
nnHn gw 1
nnHn
nnn
gwwww
1
1
Generally speaking, Newton’s method converges quickly
Minimize the quadratic approximation of the cost function C(w) around the current point w.
6
Gauss-Newton Method
Let
n
i
ieC1
2
21w
nm
m
m
wne
wne
wne
we
we
we
we
we
we
n
ww
J
21
21
21222
111
n1,2,...,i nieieieT
n
,)(),()(
www
www
Te(n)e(2),...,e(1),(n) wherennnn ewwJewe ,)(),(
Gauss-Newton method is applicable to a cost function C(w) that is the sum of error squares.
The Jacobian J(n) is [e(n)]T )(),...,2(),1()( neeen e
Goal:
2),(
21arg1 wew
wnmin n
7
Gauss-Newton Method (Cont.)
nnnnnnnnn
scalars.are them of both and nnnnnn
nnnnnnnnnnn
nnnnnn
nnnnnn
nnn
TTTT
TTTT
TTTTTT
TTT
T
T
wwJJwwwwJeewe
eJwwwwJe
wwJJwweJwwwwJee
wwJeJwwe
wwJewwJe
wewewe
21)()(
21),(
21
)()(
)()()(21
)()(21
)()(21
),(),(21),(
21
22
2
2
Differentiating this expression with respect to w and setting the result to be zero. 0)( nnnnn TT wwJJeJ )(1
1nnnnnn TT eJJJww
To guard against the possibility that J(n) is rank deficient.
)(11
nnnnnn TT eJIJJww
8
Linear Least-Squares Filter
Characteristics of Linear Least-Squares Filter– The single neuron around which it is built is linear.– The cost function C(w) consists of the sum of error squares.
nnn
nnnn T
wXdwxxxde
)()()(),...,2(),1()()(
where d(n)=[d(1), d(2),…, d(n)]T X(n)=[x(1), x(2),…, x(n)]T
)()()( nnnn TXe
we
Substituting it into equation derived from Gauss-Newton Method
)()( nn XJ
)(
)(11
1
nnnn
nnnnnnnnTT
TT
dXXX
wXdXXXww
nnnn Let TT XXXX 1 )(1 nnn dXw
9
Wiener Filter Limiting form of the Linear Least-Squares Filter for an Ergodic Environment
Let w0 denote the Wiener solution to the linear optimum filtering problem.
d
T
n
T
n
TT
nn
nnnn
nnnnn
xx rR
dXXX
dXXXww
1
1
10
)(limlim
)(lim1lim
Let Rx denote the Correlation Matrix of input vector x(i).
)()(1lim)()(1lim)()(1
nnn
iin
iiE T
n
n
i
T
n
T XXxxxxRx
Let rxd denote the Cross-correlation Vector of x(i) and d(i).
)()(1lim)()(1lim)()(1
nnn
idin
idiEr T
n
n
ind dXxxx
10
nennnnC n
nne xg
wx
w
ˆ
Least-Mean-Square (LMS) Algorithm
neC 2
21
wLMS is based on instantaneous values for the cost function
e(n) is the error signal measured at time n.
www
neneC nnndne because T wx
nennn xww ˆ1ˆ
is used in place of w(n) to emphasize that LMS produces an estimate of w that result from the method of steepest descent. nw
n(n)e(n)1)(n (n)(n)-d(n)e(n)
compute,1,2,n For n.Computatio (0) Settion.Initializa
:parameter selected-Userd(n) response Desired
(n) vector signalInput: SampleTraining
T
xwwxw
0w
x
ˆˆˆ
ˆSummary of the LMS Algorithm
11
Virtues and Limitations of LMS
Virtues– Simplicity
Limitations– Slow rate of convergence– Sensitivity to variations in the eigenstructure of th
e input
12
Learning Curve
13
Learning Rate Annealing
Normal Approach: n all for n 0
Stochastic Approximation: constant a is c ncn
There is a danger of parameter blowup for small n when c is large.
Search-then-converge schedule: constants are and n
n 0
/1
0
14
Perceptron
bxwvm
iii
1
x1
x2
xm
......
Bias, b
vk Outputyk
w1w2
j×
wm
Hard liniterInputs
Let x0=1 and b=w0
nn
nxnwnv
T
m
iii
xw
0
The simplest form used for the classification of patterns said to be linearly separable.
Goal: Classify the set {x(1), x(2), …, x(n)} into one of two classes, C1 or C2.
Decision Rule: Assign x(i) to class C1 if y=+1 and to class C2 if y=-1.
wTx > 0 for every input vector x belonging to class C1
wTx 0 for every input vector x belonging to class C2
15
Perceptron (Cont.)
Algorithms:
1. w(n+1)=w(n) if wTx(n) > 0 and x(n) belongs to class C1
w(n+1)=w(n) if wTx(n) 0 and x(n) belongs to class C2
2. w(n+1)=w(n)-(n)x(n) if wTx(n) > 0 and x(n) belongs to class C2
w(n+1)=w(n)+(n)x(n) if wTx(n) 0 and x(n) belongs to class C1
Let
2
1
C class to belongs (n) ifC class to belongs (n) if
ndxx
11
w(n+1) = w(n) + [d(n)-y(n)]x(n) (Error-correction learning rule form)
Smaller provides stable weight estimates. Larger provides fast adaption.