2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 1 신경망의 기울기 강하...
TRANSCRIPT
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 1
신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー
박혜영Lab. for Mathematical NeuroscienceBrain Science Institute, RIKEN, JAPAN
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 2
Overview(1/2) Introduction
Feed forward neural networks Learning of neural networks Plateau problem
Geometrical Approach to Learning Geometry of neural networks Information Geometry Information Geometry for neural networks
Natural Gradient Superiority of natural gradient Natural gradient and plateau Problem of natural gradient learning
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 3
Overview(2/2) Adaptive Natural Gradient Learning(ANGL)
Basic formula ANGL for regression problem ANGL for classification problem Computational experiments
Comparison with Second Order Method Newton method Gauss-Newton method and Levenberg-Marquardt method Natural gradient vs. Gauss-Newton method
Conclusions
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 4
Feed Forward Neural Networks A network model
1x
ix
Nx
11
1z
Mz
1y
ky
Ly
ijw
jb
jkv
kobjz
jji
N
i ijh zbxw )(
1 kki
M
j jko ydzv )(
1
kk yf ),( θx
),...,2,1(
),(),(1
Lk
bzvfy koj
M
j jkokkk
θx
),...,2,1(
),(1
Mj
bxwz ji
N
i ijhj
)( θx,fy
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 5
Learning of Neural Networks Data set Error Function
Squared error function Negative log likelihood Training error
Learning (Gradient Descent Learning) Goal : find an optimal parameter Search an estimate of step by step
on-line mode
batch mode
2)()( θx,yθy,x, fs
)|(log)( θx,yθy,x, pl
NnnnD 1)}{( *y,x
)( θy,x,e
N
nnne
NE
1
)*,(1
)( θy,xθ
)}({minarg* θθ θ Eθ̂ *θ
θ
θy,xθθ
)*,(
)()1( ttt
ett
θ
θθθ
)(
)()1(E
tt t
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 6
Plateau problem Typical learning curve of neural networks
Plateaus make learning extremely slow
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 7
Why Do Plateaus Appear? (1/2) [Saad and Solla 1995]
Analyze dynamics of parameters in learning statistical mechanics
In the early stage of learning, the learning network is drawn into a suboptimal symmetric phase, which means that all hidden nodes have same weight values
It takes dominant time to break the symmetry → Plateau
1x
ix
Nx
1z
3z
y2z
starting point
optimal point
1w2w
3w
3w
2w
1w
suboptimal symmetric
phase
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 8
Why Do Plateaus Appear? (2/2) [Fukumizu and Amari 1999]
Spaces of smaller networks are subspaces of larger networks → critical subspace
Global/local minima of smaller networks can be local minima or saddles of the the larger network.
The saddle points is a main reason of making plateaus.
v
w
1v
1w
2v
2w
1v
1w
2v
2w
1v
1w
2v
2w
ww
22
1
,
0
vv
vww
w
22
1
,
0
vv vvv
21
21 www
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 9
Hierarchical Structure of Space of NN
Space of smaller network
Space of larger network
minimum
critical subspace
saddle points, local minima
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 10
Geometrical Structure of Neural Manifold
Which is the fastest way to optimal point? How to find an efficient path?
Neural Manifold Error surface on parameter space
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 11
Information Geometry Study on the space of probability density function which spec
ified by parameter ,p(x; ) Basic characteristics
A Reimannian space Need local metric for distance measure
The corresponding metric is given by Fisher information matrix
Steepest descent of a function e ( ) on the space is given by the natural gradient
θθθθθθ )())(),((2
Gppdp T
θ
θθ
)(
)( 1 eG
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 12
An Example of Riemannian Space
xxxxxxx
xxx
x
)(
coscossin
cossinsin)()()()(
2
22
2 Gd TTT
TT
22
2 )()(,10
00)( xdG
x
2212
12
21
21
21
21
)()(,)( xxdG
x
)sin,(cos),( 21 xxx
))sin(),(cos(),( 2211 xxxxxx
Curved Space and Locally Euclidean Space
Metric for the space
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 13
Information Geometry for Neural Networks
Stochastic Neural Networks Neural Networks can be considered as probability density
functions.
Gradient in the space of neural networks
Natural Gradient Learning
)(),|( xθxy qp
θ
θθθθθ
)()()()()(
~ 11 eGeGe
)(~
1 θθθ ennn
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 14
Why Natural Gradient? - Related
Researches - [Amari 1998]
Show that NG gives the steepest direction of a loss function on an arbitrary point in the manifold of the probability distributions
NG learning achieves the best asymptotic performance that any unbiased learning algorithm can achieve
[Park et al. 1999] Suggest the possibility of avoiding plateaus or quickly
escaping from them using natural gradient learning Show experimental evidence of avoiding plateaus
[Rattray and Saad, 1999] Confirm the possibility of avoiding plateaus through statistical
mechanical analysis
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 15
Why Natural Gradient? - Intuitive
explanations -
Intuitive explanations Consider movement of parameter around critical subspace
Standard gradient descent learning method
Natural gradient method
jiji
iit
itiii
vfy
ettt
wwww
xxwθx
w
θwww
if
,
)()},({
)()()1()(
ji
iit
itiii
G
vfyG
eGttt
wwθ
xxwθxθ
w
θθwww
if
,0)(
)()},(){(
)()()()1()(
1
1
w
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 16
Why Natural Gradient? - Experimental evidence -
(1/3)
Toy model with 2-dim Parameter Space Model
Assumption : input x ~ N(0,I ) , noise ~ N(0,0.1 ) Reduction of number of parameters
Training data : given from teacher network
of same structure with true
parameter1
2
true point1
*,2*) initial point
1o,2
o)
parameter space
critical subspace1= 2
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 17
Why Natural Gradient? - Experimental evidence -
(2/3)
Dynamics of Ordinary Gradient Learning
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 18
Why Natural Gradient? - Experimental evidence -
(3/3)
Dynamics of Natural Gradient Learning
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 19
Problem of Natural Gradient Learning
Updating Rule
Calculation of Fisher Information matrix Need to know input distribution Estimated by sample mean
Calculation of the inverse of Fisher information matrix High computational cost Need adaptive estimation method
θ
θθθθθ
)()()(
~ 11
tttt
eGe
N
n
Tnn dp
pp
NG
1
)(),|(log),|(log1
)(ˆ yθx,|yθ
θxy
θ
θxyθ
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 20
Adaptive Natural Gradient Learning(1/2)
Stochastic neural networks
Fisher information matrix
),...,(
)),()(),((1
]))([(E][E
]][E[E
]][[EE)(
),(),(
1
loglog
1
21
21
21
21
2
θθx
θθx
tt
xx
yx
θθyx
θxθx
θ
Lff
t
T
TT
TT
fr
fr
r
Trr
F
RFRFT
FRFRFFR
FF
G
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 21
Adaptive Natural Gradient Learning(2/2)
Adaptive estimation of Fisher information matrix
Inverse of Fisher information matrix
Adaptive natural gradient learning
1111
1
11
111
11
111
ˆˆˆ)1(ˆ
ˆ)()(ˆ)()1(ˆˆˆ 21
21
21
21
tT
ttttttt
tT
tttT
ttttttt
GFRFGGG
GRFRFGRFIRFGGGt
t
t
or
θ
θθθθθ
)(ˆ)(~ 1
1 1
ttttttt
eGe
t
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 22
Implementation of ANGL Consider two types of practical applications
Regression problem give a prediction of output values for given input time series prediction, nonlinear system identification generally continuous output
Classification problem assign a given input to one of classes pattern recognition, data mining binary output
Use different stochastic model for each type. Regression problem
additive noise model (squared error function) Classification problem
flipping coin model (cross entropy error)
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 23
ANGL for Regression problem(1/3) Stochastic model of neural networks
Additive noise subject to a probability distribution
Error function negative log likelihood noise subject to Gaussian with scalar variance 2
L
iiii fyrpe ))(()(log)( θx,θx,|yθy,x,
L
iii
fy
fyL
iii 2
2
))((
))((2
1})log(exp{)( 2
2
θx,θy,x,θx,
h
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 24
ANGL for Regression problem(2/3) Estimation of Fisher information matrix and
adaptive natural gradient learning algorithm
θ
θθθθθ
)(ˆ)(~ 1
1 1
ttttttt
eGe
t
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 25
ANGL for Regression problem(3/3) Case of Gaussian additive noise with scalar variance
t
L
ttt
fffF
θ
θ,x
θ
θ,x
θ
θ,x tttttt )(,...,
)(,
)(~ 21
θ
θθθθθ
)(ˆ)(~ 1
1 1
ttttttt
eGe
t
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 26
ANGL for classification problem(1/4) Classification problems
An output node represents a class (binary values) Need different stochastic models from that for regression
Stochastic model I (case of 2 classes) Output 1 for class 1, output 0 for class 2
Error function (Cross-entropy error function)))(1log()1()(log)(log)( θx,θx,θx,|θ,x, fyfyypye
hoo
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 27
ANGL for classification problem(2/4) Estimation of Fisher information matrix
Adaptive natural gradient learning algorithm
θ
θθθθθ
)(ˆ)(~ 1
1 1
ttttttt
eGe
t
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 28
ANGL for classification problem(3/4)
Stochastic model II (case of multiple classes (L)) Need L output nodes so that each output node represents
each class
Error function (Cross-entropy error function)
L
iii fyypye
1
)(log)(log)( θx,θx,|θ,x,
o
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 29
ANGL for classification problem(4/4) Estimation of Fisher information matrix
Adaptive natural gradient learning algorithm
θ
θθθθθ
)(ˆ)(~ 1
1 1
ttttttt
eGe
t
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 30
Experiments on Regression Problems (1/3)
Mackey-Glass time series prediction generation of time series
input : 4 previous values ;
output : 1 future value ; learning data : 500 test data : 500 noise of output:
subject to Gaussian distribution
x(t-18),x(t-12),x(t-6),x(t) x(t+6)
(t=200,…,700)(t=5000,…,5500)
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 31
Experiments on Regression Problems (2/3)
Experimental results (average results over 10 trials)
OGL (BP with momentum)
ANGL (Adaptive Natural Gradient)
사용자정의 파라미터 = 0.1 , = 0.1 = 0.005, = 1/t
신경망 구조 4 101 4 101
수행 성공률 10/10 10/10
학습사이클(MSE< 2 ) 836,480 502
예측오차(테스트데이터) 7.6265 2.4716
상대적 처리시간 1.0 0.064
X 10-5
X 10-5
X 10-5
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 32
Learning curve
Experiments on Regression Problems (3/3)
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 33
Experiments on Classification Problems
- case of two classes (1/3)
Extended XOR problems 2 classes use stochastic model I learning data : 1800 test data : 900
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 34
Experiments on Classification Problems
- case of two classes (2/3)
Experimental results( average results over 10 trials)
OGL (Ordinary Gradient)
ANGL (Adaptive Natural Gradient)
사용자정의 파라미터 = 0.005 , = 0 = 0.00002, = 1/t
신경망 구조 2 8 1 2 8 1
수행 성공률 9/10 10/10
학습사이클(MSE< 0.03) 182,440 686
분류율(테스트데이터) 94.77% 94.71%
상대적 처리시간 1.0 0.086
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 35
Experiments on Classification Problems
- case of two classes (3/3)
Learning curve
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 36
Experiments on Classification Problems
- case of multiple classes (1/3)
IRIS classification problem classify three different species of iris flower input : 4 attributes about the shape of the plant (4 input
nodes) output: 3 classes of the flower (3 input nodes) use stochastic model II learning data: 90 (30 for each class) test data: 60 (20 for each class)
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 37
Experiments on Classification Problems
- case of multiple classes (2/3)
Experimental results(average results over 10 trials)
OGL (Ordinary Gradient)
ANGL (Adaptive Natural Gradient)
사용자정의 파라미터 = 0.02 , = 0 = 0.005, = 1/t
신경망 구조 4 4 3 4 4 3
수행 성공률 10/10 10/10
학습사이클(MSE< 0.03) 83,586 108
분류율(테스트데이터) 94.38% 94.99%
상대적 처리시간 1.0 0.097
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 38
Experiments on Classification Problems
- case of multiple classes (3/3)
Learning curve
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 39
Comparison with Second Order Method(1/3)
Newton Method Use second order Taylor expansion of Error function around optimal point *
updating rule
effective only around the optimal point unstable according to the condition of Hessian matrix
θ
θθθθ
θθθθ
θ
θθθθθθ
θθθ
θθθ
θ
θθθθθ
)(*)(*
*)*)(()(
*)*)((*)(*)(
*)(*)(
*)(*)(
*)(*)()(
1
2
2
EH
HE
HE
EEEE
T
TT
*
)(θE
θ
θθθθ
)()( 1
1n
nnn
EH
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 40
Comparison with Second Order Method (2/3)
Gauss-Newton Method Consider square of error function
Gauss-Newton approximation of Hessian
Updating rule
Levenberg-Marquardt Method
))*,,(),...,*,,(()(,)(2
1)}*,,({
2
1)( 11
2
1
2 θyxθyxθeθeθyxθ NN
N
nnn eeeE
N
n
Tnnnn
T
T
eeH
EH
1
2
2
2
2
})*,,()*,,(
{)()(
)(~
)}()()()(
{*)(
)(
θ
θyx
θ
θyx
θ
θe
θ
θeθ
θeθ
θe
θ
θe
θ
θe
θ
θθ
θ
θθθθ
)()(
~ 11
nnnn
EH
θ
θθθθ
)())(
~( 1
1n
nnn
EIH
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 41
Comparison with Second Order Method (3/3)
Natural Gradient Learning Updating rule
On-line/batch learning Use general error function Consider geometrical
characteristics of NN space
Gauss-Newton Method Updating Rule
Batch learning Assume sum of square error Use quadratic approximation
and approximation of Hessian
θ
θθθθ
)()(
~ 11
tttt
EH θ
θθθθ
)()(ˆ 1
1t
ttt
eG
Under the assumption of additive Gaussian noise with scalar variance, natural gradient method gives theoretical justification of the Gauss-Ne
wton approximation natural gradient method is a generalization of Gauss-Newton Method
)(~
)(ˆtt HG θθ batch
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 42
Conclusions A study on an efficient learning method
Consider plateau problem in learning Consider geometrical structure of space of neural networks Take information geometrical approach to solve the plateau
problems Present natural gradient learning method as a solution of
plateau problems Present adaptive natural gradient learning for realizing
natural gradient in the field of neural networks Show practical advantages of adaptive natural gradient
learning Compare with second order method
When is the natural gradient learning is good? Problems with Large data and small network size Problems requiring fine accuracy of approximation
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 43
References Plateau Problem
Fukumizu, F. & Amari, S. (2000). Local Minima and Plateaus in Hierachical Structures of Multilayer Perceptrons, Neural Networks, 13, 317-328.
Saad, D. & Solla, S. A. (1995). On-line Learning in Soft Committee Machines, Physical Review E, 52, 4225-4243, 1995.
Second Order method and learning theory Bishop, C. (1995). Neural networks for pattern recognition, Oxford University
Press. LeCun, Y., Bottou, L., Orr G. B., & Müller, K. -R. (1998). In G. B. Orr and K.
R. Müller, Neural networks: tricks of the trade, Springer Lecture Notes in Computer Sciences, vol. 1524, Heidelberg: Springer.
Information Geometry Amari, S. & Nagaoka, H. (1999). Information geometry, AMS and Oxford Univ
ersity Press.
2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 44
References Basic Concept of Natural Gradient
Amari, S, (1998). Natural gradient works efficiently in learning, Neural Computation, 10, 251-276.
Natural Gradient for Neural Networks Amari, S., Park, H., & Fukumizu, F. (2000). Adaptive method of realizing nat
ural gradient learning for multilayer perceptrons, Neural Computation, 12, 1399-1409.
Park, H., Amari, S., & Lee, Y. (1999). An Information Geometrical Approach on Plateau Problemes in Multilayer Perceptron Learning, Journal of KISS(B): Software and Applications, 26(4),546-556. (in Korean)
Park, H., Amari, S. & Fukumizu, K. (2000), Adaptive Natural Gradient Learning Algorithms for Various Stochastic Models, Neural Networks, 13, 755-764.
Rattray, M., Saad D., & Amari, S. (1998). Natural gradient descent for on-line learning, Physical Review Letters, 81, 5461-5464.