2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 1 신경망의 기울기 강하...

2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 1

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

박혜영Lab. for Mathematical NeuroscienceBrain Science Institute, RIKEN, JAPAN


Overview(1/2) Introduction

Feed forward neural networks Learning of neural networks Plateau problem

Geometrical Approach to Learning Geometry of neural networks Information Geometry Information Geometry for neural networks

Natural Gradient Superiority of natural gradient Natural gradient and plateau Problem of natural gradient learning


Overview(2/2) Adaptive Natural Gradient Learning(ANGL)

Basic formula ANGL for regression problem ANGL for classification problem Computational experiments

Comparison with Second Order Method Newton method Gauss-Newton method and Levenberg-Marquardt method Natural gradient vs. Gauss-Newton method

Conclusions


Feed Forward Neural Networks A network model

1x

ix

Nx

11

1z

Mz

1y

ky

Ly

ijw

jb

jkv

kobjz

jji

N

i ijh zbxw )(

1 kki

M

j jko ydzv )(

1

kk yf ),( θx

),...,2,1(

),(),(1

Lk

bzvfy koj

M

j jkokkk

θx

),...,2,1(

),(1

Mj

bxwz ji

N

i ijhj

)( θx,fy


Learning of Neural Networks Data set Error Function

Squared error function Negative log likelihood Training error

Learning (Gradient Descent Learning) Goal : find an optimal parameter Search an estimate of step by step

on-line mode

batch mode

2)()( θx,yθy,x, fs

)|(log)( θx,yθy,x, pl

NnnnD 1)}{( *y,x

)( θy,x,e

N

nnne

NE

1

)*,(1

)( θy,xθ

)}({minarg* θθ θ Eθ̂ *θ

θ

θy,xθθ

)*,(

)()1( ttt

ett

θ

θθθ

)(

)()1(E

tt t


Plateau problem Typical learning curve of neural networks

Plateaus make learning extremely slow


Why Do Plateaus Appear? (1/2) [Saad and Solla 1995]

Analyze dynamics of parameters in learning statistical mechanics

In the early stage of learning, the learning network is drawn into a suboptimal symmetric phase, which means that all hidden nodes have same weight values

It takes dominant time to break the symmetry → Plateau

1x

ix

Nx

1z

3z

y2z

starting point

optimal point

1w2w

3w

3w

2w

1w

suboptimal symmetric

phase


Why Do Plateaus Appear? (2/2) [Fukumizu and Amari 1999]

Spaces of smaller networks are subspaces of larger networks → critical subspace

Global/local minima of smaller networks can be local minima or saddles of the the larger network.

The saddle points is a main reason of making plateaus.

v

w

1v

1w

2v

2w

1v

1w

2v

2w

1v

1w

2v

2w

ww

22

1

,

0

vv

vww

w

22

1

,

0

vv vvv

21

21 www


Hierarchical Structure of Space of NN

Space of smaller network

Space of larger network

minimum

critical subspace

saddle points, local minima


Geometrical Structure of Neural Manifold

Which is the fastest way to optimal point? How to find an efficient path?

Neural Manifold Error surface on parameter space


Information Geometry Study on the space of probability density function which spec

ified by parameter ,p(x; ) Basic characteristics

A Reimannian space Need local metric for distance measure

The corresponding metric is given by Fisher information matrix

Steepest descent of a function e ( ) on the space is given by the natural gradient

θθθθθθ )())(),((2

Gppdp T

θ

θθ

)(

)( 1 eG


An Example of Riemannian Space

xxxxxxx

xxx

x

)(

coscossin

cossinsin)()()()(

2

22

2 Gd TTT

TT

22

2 )()(,10

00)( xdG

x

2212

12

21

21

21

21

)()(,)( xxdG

x

)sin,(cos),( 21 xxx

))sin(),(cos(),( 2211 xxxxxx

Curved Space and Locally Euclidean Space

Metric for the space


Information Geometry for Neural Networks

Stochastic Neural Networks Neural Networks can be considered as probability density

functions.

Gradient in the space of neural networks

Natural Gradient Learning

)(),|( xθxy qp

θ

θθθθθ

)()()()()(

~ 11 eGeGe

)(~

1 θθθ ennn


Why Natural Gradient? - Related

Researches - [Amari 1998]

Show that NG gives the steepest direction of a loss function on an arbitrary point in the manifold of the probability distributions

NG learning achieves the best asymptotic performance that any unbiased learning algorithm can achieve

[Park et al. 1999] Suggest the possibility of avoiding plateaus or quickly

escaping from them using natural gradient learning Show experimental evidence of avoiding plateaus

[Rattray and Saad, 1999] Confirm the possibility of avoiding plateaus through statistical

mechanical analysis


Why Natural Gradient? - Intuitive

explanations -

Intuitive explanations Consider movement of parameter around critical subspace

Standard gradient descent learning method

Natural gradient method

jiji

iit

itiii

vfy

ettt

wwww

xxwθx

w

θwww

if

,

)()},({

)()()1()(

ji

iit

itiii

G

vfyG

eGttt

wwθ

xxwθxθ

w

θθwww

if

,0)(

)()},(){(

)()()()1()(

1

1

w


Why Natural Gradient? - Experimental evidence -

(1/3)

Toy model with 2-dim Parameter Space Model

Assumption : input x ~ N(0,I ) , noise ~ N(0,0.1 ) Reduction of number of parameters

Training data : given from teacher network

of same structure with true

parameter1

2

true point1

*,2*) initial point

1o,2

o)

parameter space

critical subspace1= 2



(2/3)

Dynamics of Ordinary Gradient Learning



(3/3)

Dynamics of Natural Gradient Learning


Problem of Natural Gradient Learning

Updating Rule

Calculation of Fisher Information matrix Need to know input distribution Estimated by sample mean

Calculation of the inverse of Fisher information matrix High computational cost Need adaptive estimation method

θ

θθθθθ

)()()(

~ 11

tttt

eGe

N

n

Tnn dp

pp

NG

1

)(),|(log),|(log1

)(ˆ yθx,|yθ

θxy

θ

θxyθ


Adaptive Natural Gradient Learning(1/2)

Stochastic neural networks

Fisher information matrix

),...,(

)),()(),((1

]))([(E][E

]][E[E

]][[EE)(

),(),(

1

loglog

1

21

21

21

21

2

θθx

θθx

tt

xx

yx

θθyx

θxθx

θ

Lff

t

T

TT

TT

fr

fr

r

Trr

F

RFRFT

FRFRFFR

FF

G


Adaptive Natural Gradient Learning(2/2)

Adaptive estimation of Fisher information matrix

Inverse of Fisher information matrix

Adaptive natural gradient learning

1111

1

11

111

11

111

ˆˆˆ)1(ˆ

ˆ)()(ˆ)()1(ˆˆˆ 21

21

21

21

tT

ttttttt

tT

tttT

ttttttt

GFRFGGG

GRFRFGRFIRFGGGt

t

t

or

θ

θθθθθ

)(ˆ)(~ 1

1 1

ttttttt

eGe

t


Implementation of ANGL Consider two types of practical applications

Regression problem give a prediction of output values for given input time series prediction, nonlinear system identification generally continuous output

Classification problem assign a given input to one of classes pattern recognition, data mining binary output

Use different stochastic model for each type. Regression problem

additive noise model (squared error function) Classification problem

flipping coin model (cross entropy error)


ANGL for Regression problem(1/3) Stochastic model of neural networks

Additive noise subject to a probability distribution

Error function negative log likelihood noise subject to Gaussian with scalar variance 2

L

iiii fyrpe ))(()(log)( θx,θx,|yθy,x,

L

iii

fy

fyL

iii 2

2

))((

))((2

1})log(exp{)( 2

2

θx,θy,x,θx,

h


ANGL for Regression problem(2/3) Estimation of Fisher information matrix and

adaptive natural gradient learning algorithm

θ

θθθθθ

)(ˆ)(~ 1

1 1

ttttttt

eGe

t


ANGL for Regression problem(3/3) Case of Gaussian additive noise with scalar variance

t

L

ttt

fffF

θ

θ,x

θ

θ,x

θ

θ,x tttttt )(,...,

)(,

)(~ 21

θ

θθθθθ

)(ˆ)(~ 1

1 1

ttttttt

eGe

t


ANGL for classification problem(1/4) Classification problems

An output node represents a class (binary values) Need different stochastic models from that for regression

Stochastic model I (case of 2 classes) Output 1 for class 1, output 0 for class 2

Error function (Cross-entropy error function)))(1log()1()(log)(log)( θx,θx,θx,|θ,x, fyfyypye

hoo


ANGL for classification problem(2/4) Estimation of Fisher information matrix

Adaptive natural gradient learning algorithm

θ

θθθθθ

)(ˆ)(~ 1

1 1

ttttttt

eGe

t


ANGL for classification problem(3/4)

Stochastic model II (case of multiple classes (L)) Need L output nodes so that each output node represents

each class

Error function (Cross-entropy error function)

L

iii fyypye

1

)(log)(log)( θx,θx,|θ,x,

o


ANGL for classification problem(4/4) Estimation of Fisher information matrix

Adaptive natural gradient learning algorithm

θ

θθθθθ

)(ˆ)(~ 1

1 1

ttttttt

eGe

t


Experiments on Regression Problems (1/3)

Mackey-Glass time series prediction generation of time series

input : 4 previous values ;

output : 1 future value ; learning data : 500 test data : 500 noise of output:

subject to Gaussian distribution

x(t-18),x(t-12),x(t-6),x(t) x(t+6)

(t=200,…,700)(t=5000,…,5500)



Experimental results (average results over 10 trials)

OGL (BP with momentum)

ANGL (Adaptive Natural Gradient)

사용자정의 파라미터 = 0.1 , = 0.1 = 0.005, = 1/t

신경망 구조 4 101 4 101

수행 성공률 10/10 10/10

학습사이클(MSE< 2 ) 836,480 502

예측오차(테스트데이터) 7.6265 2.4716

상대적 처리시간 1.0 0.064

X 10-5

X 10-5

X 10-5


Learning curve



Experiments on Classification Problems

- case of two classes (1/3)

Extended XOR problems 2 classes use stochastic model I learning data : 1800 test data : 900




Experimental results( average results over 10 trials)

OGL (Ordinary Gradient)


사용자정의 파라미터 = 0.005 , = 0 = 0.00002, = 1/t

신경망 구조 2 8 1 2 8 1

수행 성공률 9/10 10/10

학습사이클(MSE< 0.03) 182,440 686

분류율(테스트데이터) 94.77% 94.71%

상대적 처리시간 1.0 0.086




Learning curve



- case of multiple classes (1/3)

IRIS classification problem classify three different species of iris flower input : 4 attributes about the shape of the plant (4 input

nodes) output: 3 classes of the flower (3 input nodes) use stochastic model II learning data: 90 (30 for each class) test data: 60 (20 for each class)




Experimental results(average results over 10 trials)

OGL (Ordinary Gradient)


사용자정의 파라미터 = 0.02 , = 0 = 0.005, = 1/t

신경망 구조 4 4 3 4 4 3

수행 성공률 10/10 10/10

학습사이클(MSE< 0.03) 83,586 108

분류율(테스트데이터) 94.38% 94.99%

상대적 처리시간 1.0 0.097




Learning curve


Comparison with Second Order Method(1/3)

Newton Method Use second order Taylor expansion of Error function around optimal point *

updating rule

effective only around the optimal point unstable according to the condition of Hessian matrix

θ

θθθθ

θθθθ

θ

θθθθθθ

θθθ

θθθ

θ

θθθθθ

)(*)(*

*)*)(()(

*)*)((*)(*)(

*)(*)(

*)(*)(

*)(*)()(

1

2

2

EH

HE

HE

EEEE

T

TT

*

)(θE

θ

θθθθ

)()( 1

1n

nnn

EH


Comparison with Second Order Method (2/3)

Gauss-Newton Method Consider square of error function

Gauss-Newton approximation of Hessian

Updating rule

Levenberg-Marquardt Method

))*,,(),...,*,,(()(,)(2

1)}*,,({

2

1)( 11

2

1

2 θyxθyxθeθeθyxθ NN

N

nnn eeeE

N

n

Tnnnn

T

T

eeH

EH

1

2

2

2

2

})*,,()*,,(

{)()(

)(~

)}()()()(

{*)(

)(

θ

θyx

θ

θyx

θ

θe

θ

θeθ

θeθ

θe

θ

θe

θ

θe

θ

θθ

θ

θθθθ

)()(

~ 11

nnnn

EH

θ

θθθθ

)())(

~( 1

1n

nnn

EIH


Comparison with Second Order Method (3/3)

Natural Gradient Learning Updating rule

On-line/batch learning Use general error function Consider geometrical

characteristics of NN space

Gauss-Newton Method Updating Rule

Batch learning Assume sum of square error Use quadratic approximation

and approximation of Hessian

θ

θθθθ

)()(

~ 11

tttt

EH θ

θθθθ

)()(ˆ 1

1t

ttt

eG

Under the assumption of additive Gaussian noise with scalar variance, natural gradient method gives theoretical justification of the Gauss-Ne

wton approximation natural gradient method is a generalization of Gauss-Newton Method

)(~

)(ˆtt HG θθ batch


Conclusions A study on an efficient learning method

Consider plateau problem in learning Consider geometrical structure of space of neural networks Take information geometrical approach to solve the plateau

problems Present natural gradient learning method as a solution of

plateau problems Present adaptive natural gradient learning for realizing

natural gradient in the field of neural networks Show practical advantages of adaptive natural gradient

learning Compare with second order method

When is the natural gradient learning is good? Problems with Large data and small network size Problems requiring fine accuracy of approximation


References Plateau Problem

Fukumizu, F. & Amari, S. (2000). Local Minima and Plateaus in Hierachical Structures of Multilayer Perceptrons, Neural Networks, 13, 317-328.

Saad, D. & Solla, S. A. (1995). On-line Learning in Soft Committee Machines, Physical Review E, 52, 4225-4243, 1995.

Second Order method and learning theory Bishop, C. (1995). Neural networks for pattern recognition, Oxford University

Press. LeCun, Y., Bottou, L., Orr G. B., & Müller, K. -R. (1998). In G. B. Orr and K.

R. Müller, Neural networks: tricks of the trade, Springer Lecture Notes in Computer Sciences, vol. 1524, Heidelberg: Springer.

Information Geometry Amari, S. & Nagaoka, H. (1999). Information geometry, AMS and Oxford Univ

ersity Press.


References Basic Concept of Natural Gradient

Amari, S, (1998). Natural gradient works efficiently in learning, Neural Computation, 10, 251-276.

Natural Gradient for Neural Networks Amari, S., Park, H., & Fukumizu, F. (2000). Adaptive method of realizing nat

ural gradient learning for multilayer perceptrons, Neural Computation, 12, 1399-1409.

Park, H., Amari, S., & Lee, Y. (1999). An Information Geometrical Approach on Plateau Problemes in Multilayer Perceptron Learning, Journal of KISS(B): Software and Applications, 26(4),546-556. (in Korean)

Park, H., Amari, S. & Fukumizu, K. (2000), Adaptive Natural Gradient Learning Algorithms for Various Stochastic Models, Neural Networks, 13, 755-764.

Rattray, M., Saad D., & Amari, S. (1998). Natural gradient descent for on-line learning, Physical Review Letters, 81, 5461-5464.

2001-04-28 한국정보과학회 춘계학술대회 튜토리얼 1 신경망의 기울기 강하...

Documents