11. neural nets - kocwcontents.kocw.net/kocw/document/2014/hanyang/huhseon/8.pdf · 2016. 9. 9. ·...

2014-10-27

1

Hanyang

University

Quest Lab.

Chapter 11

Neural nets

Fall, 2014Data Mining

Hanyang UniversityQuest Lab.

• Human learning and memory의 특성을 모방, 일반화

• Financial applications (bankruptcy prediction, currency market trading,

picking stocks, commodity trading, detecting fraud in credit card and

monetary transaction, CRM, etc.)

• Engineering applications (autonomous vehicle driving, etc.)

• High predictive performance

11.1 Introduction

2014-10-27

2


발상과 전개

• 컴퓨터와 두뇌의 비교

– 폰 노이만 컴퓨터• 순차 명령어 처리기

– 두뇌• 뉴런으로 구성 (약 1011개, 약 1014 연결 (시냅스))• 고도의 병렬 명령어 처리기

Real Neural NetworksNeuron Neural Network

2014-10-27

3


발상과 전개

• 간략한 역사

– 1943, McCulloch과 Pitts 최초 신경망 제안– 1958, Rosenblatt 퍼셉트론– Widrow와 Hoff, Adaline과 Madaline– 1960대, 신경망의 과대 포장– 1969, Minsky와 Papert, Perceptrons라는 저서에서 퍼셉트론 한계 지적

• 퍼셉트론은 선형 분류기에 불과하고 XOR도 해결 못함• 이후 신경망 연구 퇴조

– 1986, Rumelhart, Hinton, 그리고 Williams, 다층 퍼셉트론과 오류 역전파 학습 알고리즘

• 필기 숫자 인식같은 복잡하고 실용적인 문제에 높은 성능• 신경망 연구 다시 활기 찾음• 현재 가장 널리 활용되는 문제 해결 도구


2. 수학적 모델로서의 신경망

• 신경망 특성– 학습 가능– 뛰어난 일반화 능력– 병렬 처리 가능– 현실적 문제에서 우수한 성능– 다양한 문제 해결 도구 (분류, 예측, 함수 근사화, 합성, 평가, …)

• 절반의 성공– 인간 지능에 필적하는 컴퓨터 만들지 못함– 제한된 환경에서 실용적인 시스템 만드는데 크게 기여 (실용적인 수학적

모델로서 자리매김)

2014-10-27

4


11.2 Concepts and structure of a neural network

• Combine the input info to capture complicated relationships among

predictors and between them and response variable.

• Linear regression은 linear relationship을 가정하나 대부분의 경우는

linear 관계가 아니거나 그 관계를 알지 못함

• 특별한 관계 형태(form)를 user가 밝힐 필요 없이 data로부터 learn

• Linear regression이나 logistic regression은 hidden layer가 없는 NN의

특별한 경우임

• 가장 많이 쓰이는 것은 multilayer feedforward networks

Idea:

• 구조– 입력층: d+1개의 노드 (특징 벡터 x=(x1,…,xd)T)– 출력층: 한 개의 노드 (따라서 2-부류 분류기)– 에지와 가중치

y = 1 if net ≥ 0-1 otherwise{

Perceptron

2014-10-27

5

input x1

input x2 output

0 0 00 1 11 0 1

1 1 1 x2x1

w2=1w1=1

w0= -0.5

1

Boolean OR

1 20

0.5n

i ii

net w x x x×=

= = + -å

x1

x2

+-

+ +OR

1 2 0.5 0x x+ - >

1 2 0.5 0x x+ - <

10

Boolean AND

input x1

input x2 output

0 0 00 1 01 0 0

1 1 1 x2x1

w2=1w1=1

w0= -1.5

1

1 20

1.5n

i ii

net w x x x×=

= = + -å

x1

x2

--

- +

AND

1 2 1.5 0x x+ - >

1 2 1.5 0x x+ -

2014-10-27

6

• 선형 분리 불가능한 상황– 퍼셉트론의 한계– 그림 4.5(b)에서 퍼셉트론으로 최대 몇 개까지 맞출 수 있을까?

Multi-layer Perceptron

2014-10-27

7

13

Boolean XOR

x2x1

Value = 1 iff x1 ≠ x2.

input x1

input x2 output

0 0 00 1 11 0 1

1 1 0

x1

x2

+-

+ -

XORLinear Separability – unseparable

• XOR 문제– 퍼셉트론은 75% 정인식률이 한계– 이 한계를 어떻게 극복?

• 두 개의 퍼셉트론 (결정 직선) 사용

Multi-layer Perceptron

2014-10-27

8


MLP - 구조와 원리• 다층 퍼셉트론 (MLP; Multi-layer perceptron)


MLP - 구조와 원리

• 다층 퍼셉트론의 아키텍처– 입력층, 은닉층, 출력층– 가중치: u와 v

2014-10-27

9

Different Activation Functions

Sigmoid Function

x

s(x)xe

x -+=

11)(s

))(1)(()( xxx sss -=¢

2014-10-27

10


11.2 Concepts and structure of a neural network

- input layer: input value를 받아들이기만 하는 node들로 구성됨

- output layer: last layer

- hidden layer: between input and output layers

- no cycle, one-way, connected

Idea:


4.3.1 구조와 원리

• FFMLP (Feed-Forward MLP) 의 아키텍처– 은닉층은 몇 개로?– 층간의 연결은 어떻게?– 각 층의 노드는 몇 개로?– 어떤 활성 함수 사용할까?

2014-10-27

11


11.3 Fitting a network to data

Predictors: fat score, salt score (scaled 0 ~ 1)

Response: consumer's taste acceptance (1=OK, 0=not OK)

Example 1 : Tasting score for cheese



Nodes 1, 2=input layer, 3~5=hidden layer, 6=output layer

Weights: numbers on the arrows, denoted by (from node to )Bias: intercept for the output from node , denoted by

Example 1 : Tasting score for cheese

2014-10-27

12



Computing output of nodes

• Input layer node:

- predictor의 value를 input으로 받고, output은 input과 동일

- no. of nodes in input layer = no. of predictors

e.g., for record #1:

Fat input = output = x1 = 0.2

Salt input = output = x2 = 0.9




• Hidden layer node:

- input layer node로부터 input을 받음

- output은 function of weighted sum of inputs

- 즉, node 로부터의 output은1

outputp

j i ij ji

g x w q=

æ ö= +ç ÷

è øå

2014-10-27

13




• 와 는 처음에는 랜덤하게 정해지나 “learn"하면서 update 되는weights

• = “weights” (like coefficients, subject to iterative adjustment)• = “bias”, node 의 contribution level을 조정하는 constant




• (∙) : “transfer function”, usually, logistic(sigmodal) is popular참고: sigmodal 함수의 경우

1( )1 x

g xe-

=+

( )1

1output1 j i ij

p

j j i ij x wi

g x we q

q- +

=

æ ö= + =ç ÷ åè ø +

å

( )'( ) ( ) 1 ( )g x g x g x= -

2014-10-27

14




Initializing the weights:

• Initially, and are typically initialized to random values in the range -0.05 to +0.05 → represents "no knowledge“

• These initial weights are used in the first round of training




Example of weights:

2014-10-27

15




3 = 11 + −[−0.3+(0.05)(0.2)+(0.01)(0.9)] = 0.43 ( )1

1output1 j i ij

p

j j i ij x wi

g x we q

q- +

=

æ ö= + =ç ÷ åè ø +

å



Computing output of nodes 6 = 11 + −[−0.015+(0.01)(0.43)+(0.05)(0.507)+(0.015)(0.511)] = 0.506

2014-10-27

16




• Hidden layer가 더 많아도 동일한 방법으로 계산이 이루어짐

• Output layer node: output 값으로 class 결정(ex. cutoff value=0.5인데

output=0.506이므로 class 1로 판정)




Relation to linear and logistic regression:

To make it simpler, assume no hidden layer and one output node.

Then output is: ( + Σ)If () = (identity function) -> multiple linear regressionIf is logistic function -> logistic regression

ˆ i iy w xq= +å

µ( )1( 1)

1 i iw xP Y

e q- += =

å+

2014-10-27

17



• NN은 predictor나 response variable의 값이 [0,1]일 때 best performance

• 따라서 모든 variable을 다음과 같이 [0,1]로 re-scale할 필요가 있음 < < 범위의 수 : ←• Binary variable: OK

• Ordinal categorical variable with m categories: map them to

0, 1 , … , − 1 , 1• Nominal categorical variable with categories: transform to − 1

dummy variables

Preprocessing the data



• Best predictive result를 얻을 수 있도록 와 를 estimate하는 것• 모든 obs.에 대해 output을 계산하고 실제 response value와 비교

• 이 error는 중간의 hidden layer에 나누어주고 각 weight를 iteratively

update함

→ Back propagation of error

Training the model

2014-10-27

18



Training the model

Back propagation of error

(1) Output node 에서의 weight를 updateLet

where is actual value, and

Recall that for sigmodal function we have

Therefore,

ˆ ˆ ˆerr (1 )( ),k k k k ky y y y= - -

( )ˆ ( )ky k g kº =output from node input to node

ky

1( ) ,1 x

g xe-

=+

( )'( ) ( ) 1 ( )g x g x g x= -

ˆ ˆ'( ) (1 )k kg k y y= -input to node



Training the model

ˆ ˆ ˆerr (1 )( )k k k k ky y y y= - -실제값과 예측값과의차이 (오차)

'( )g kinput to node Error의 방향, 즉

Remember this is error at the output node k.

Ex. Error associated with output node (node 6) for the 1st obs:

err6= (0.506)(1-0.506)(1-0.506)=0.123

2014-10-27

19



Output node 에서의 weights의 update:

ℓ= learning rate, or weight decay, (0 < ℓ < 1) ℓ controls the amount of change= output from hidden node j

Training the model

( )errnew oldk k kq q= + l

( ) ˆerrnew oldjk jk k jw w x= + ×l

ˆ jx



Ex. Since err6 = 0.123, if we let ℓ = 0.5,6= -0.015+(0.5)(0.123) = 0.047 3,6 = 0.01+(0.5)(0.123)(0.43) = 0.036 4,6 = 0.05+(0.5)(0.123)(0.51) = 0.0815,6= 0.015+(0.5)(0.123)(0.51) = 0.018

Training the model

2014-10-27

20



Training the model

(2) Hidden node j 에서의 weight update

Let

Then error associated with the hidden node j :

ˆ jy jº output from node

ˆ ˆerr (1 ) errj j j k jkk output

y y wÎ

= - ×å



Training the model

ˆ ˆerr (1 ) errj j j k jkk output

y y wÎ

= - ×åOutput node k에서의 error를weight에 맞추어 hidden node j 에 나누어 줌

Remember this is error at the hidden node j.

2014-10-27

21



Hidden node j 에서의 weights의 update:

= output from input node i

Training the model

( )errnew oldj j jq q= + l

( ) êrrnew oldij ij j iw w x= + ×l

îx



Ex.

• 3 =−0.3 + (0.5)(0.123)err =• 4 = 0.2 + (0.5)(0.123)err =• 5 =• 1,3 = 0.05 + (0.5)(err)(0.2) =• 24 = 0.03 + (0.5)(err)(0.9) =• =⋯

Training the model

3 3 3 6 36ˆ êrr (1 ) err 0.123 0.036 0.43 (1 0.43)y y w= - × × = ´ ´ ´ -

4 4 4 6 46ˆ êrr (1 ) err 0.123 0.081 0.51 (1 0.51)y y w= - × × = ´ ´ ´ -

5err =

2014-10-27

22



Case updating (or, pattern updating):

각 obs.을 run한 후(one trial) 매번 weight를 update

→ update until all obs. are used.

→ more accurate but require longer run time

• Completion of all records through the network is one epoch (also

called sweep or iteration)

• After one epoch is completed, return to first record and repeat the

process

Training the model



Batch updating:

weight update를 하기 전에 전체 training data를 run thru

이 경우, errk는 모든 obs에서의 error들의 합

언제 weight update를 stop하는가:

(1) 새로 update된 weight가 이전 것과 차이가 별로 없을 경우,

(2) misclassification rate가 threshold에 이르렀을 때,

(3) run 횟수가 미리 정한 limit에 이르렀을 경우

Training the model

2014-10-27

23



Training the model

오타: 모두 0임



Training the model

Fat/Salt Example: Final Weights

2014-10-27

24



999 records, 600 training records, 4 predictors, 3 classes (no injury,

injury, fatality)

Example 2: Classifying accident severity



• 4 nodes(=4 predictors) input layers, 3 neurons(=3 classes) output

layers

• single hidden layer increasing 4 to 8 nodes -> examining their

confusion matrices

• 5 nodes gave good balance between performances of training and

validation sets

• more than five nodes perform as well as five nodes

• 4×5=20 and 5×3=15 connections between input-hidden, hidden-

output


2014-10-27

25



• 매 record마다 다음의 작업을 실시 (=one iteration)

√ 각 record의 predictor를 input layer에 present

√ weight를 사용하여 hidden node와 output node의 output을 계산

√ output node의 output value를 사용하여 error 계산

√ back propagation을 이용하여 모든 connection의 weight 조정

• 600 iterations = one epoch, train 30 epochs -> 18,000 iterations




• NN의 단점은 overfitting -> validation data와 new data에서 error

rate가 높게 나옴

• Training epoch의 수를 조절해서 overtrain하지 않도록 함

• Training을 해 가면서 주기적으로 validation data의 performance를

check

• validation data의 error rate가 감소하다가 다시 증가하는 시점이 best

number of epochs

• To avoid overfitting:

Track error in validation data

Limit iterations

Limit complexity of network

Avoiding overfitting

2014-10-27

26


11.4 Required user input

① Numbers of hidden layers and nodes in each layer:

• Many algorithms exist (training하면서 node의 수를 늘려가거나 줄여가는

방법들) and are being developed

• Using past experience or trial-and-error runs on different structures is

currently the best (hurl ~)

• Guidelines for choosing structure:

- Number of hidden layers → “one” layer is the most popular (enough!)

- Size of hidden layer → if too few, underfitting (too simple), if too

many, overfitting → start with p (number of predictors) and de- or

increase while checking for overfitting (rule of thumb)

- Number of output nodes → one if binary, if classes, one if numerical response


11.4 Required user input

② Choice of predictors:

- carefully choose using domain knowledge, variable selection, dimension

reduction tech. before using NN

③ Learning rate (a.k.a. weight decay):

- to avoiding overfitting (downweighting new info)

- outlier의 영향을 줄여주며 local optima에 빠짐을 회피

- many different suggestions of starting value of ℓ

ex. start with large number and slowly decrease as iterations go,

ℓ = 1/(current number of iterations), XLMiner default is ℓ = 0

2014-10-27

27


부연 설명

• 매개변수 설정

– 일반적인 경우에 적용되는 보편 규칙은 없다.

– 경험과 실험을 통해 설정해야 한다.

– 신경망 성능이 매개변수에 아주 민감하지는 않기 때문에 어느 정도의

실험과 경험을 통해 설정 가능


11.6 Advantages and weaknesses of neural networks

• Ads: NN is prominent of its good predictive performance.

• Disads: NN is “black box” - no pattern in the data is given from the

output – being criticized because of this

• 주의점:

- predictor 선정을 위해서 다른 방법들을 사용해야 함 (PCA 등)

- training하기 위한 data가 충분히 많아야 함

- weight가 best fit에 이르는 값으로 converge 하지 않을 수 있음(local optima)

- computation time이 다른 방법에 비해 많이 소요

2014-10-27

28


• 11.1 (Credit Card Use)– 간단하게 NN 훈련

• 11.4 (direct mailing)

Ch. 11 Problems

11. neural nets - kocwcontents.kocw.net/kocw/document/2014/hanyang/huhseon/8.pdf · 2016. 9. 9. ·...

Documents