제7강 학습하는 기계 · 1980 1985 1990 1995 2000 2005 2010 2015 mlp svm cnn 신경망 모델...

제7강학습하는 기계

<인공지능 입문> 강의 노트

장 병 탁서울대학교 컴퓨터공학부&인지과학/뇌과학 협동과정http://bi.snu.ac.kr/~btzhang/

Version: 20180423

목차

머신러닝 …………………………..……..…………………………….………….. 3머신러닝의 종류 ………………….…….….…….…….……………………. 9학습 시스템 설계 절차 ……………..………………..………..…….….... 11감독학습 ……………..….…….………….………………..………..…….….... 15무감독학습 ……………………………….…..……………………..………….. 37과다학습과 정규화 ………...……...……………..……………….………. 45Reading Assignments ….……………..……..…….….………….….……… 48

2© 2018 Byoung-‐Tak Zhang, Seoul National University

머신러닝• 학습시스템: “환경 E와의상호작용으로부터획득한경험적인데이터D를바탕으로모델M을자동으로구성하여스스로성능 P를향상하는시스템”– 환경 E– 데이터D– 모델M– 성능 P

장교수의 딥러닝, 홍릉과학출판사, 2017

1.1머신러닝의 특성머신러닝제7강

1.1머신러닝의 특성• 인공지능,머신러닝,딥러닝

– 인공지능: 사람처럼 생각하고 사람처럼 행동하는 기계를 만드는 연구 – 머신러닝: 기계가 학습을할 수 있도록하는 인공지능 연구의 한 분야– 딥러닝: 깊은 신경망 구조 기반의 머신러닝

Artificial Intelligence

Machine Learning

Deep Learning

4

머신러닝과 인공지능제7강

• 일반적인 컴퓨터 프로그램

– 사람이 알고리듬설계 및 코딩

– 주어진 문제(데이터)에대한답을출력

• 머신러닝 프로그램

– 기계가 알고리듬을자동설계

(Automatic Programming)

– 주어진 문제(데이터)에대한답을 주는 프로그램을출력

5

프로그래밍 방식과의 차이점제7강


• 머신러닝이 필요한 문제– 명시적 문제해결 지식의 부재 (알고리듬 부재)– 프로그래밍이 어려운 문제 (예:음성인식)– 지속적으로 변화하는 문제 (예:자율이동로봇)

• 머신러닝 더욱 중요해지는 이유– 빅데이터의 존재 (학습의 소재)– 컴퓨팅 성능의 향상 (고난도 학습이 가능)– 서비스와 직접 연결 (비지니스적 효과)– 비즈니스 가치 창출 (회사 가치 향상)

6

1.1머신러닝의 특성머신러닝의 중요성제7강


• 머신러닝의 다양한 활용 분야

7

1.1머신러닝의 특성활용 사례제7강


1980 1990 2000 20101985 1995 2005 2015

MLP SVM CNN

확률통계적 모델 딥러닝 모델신경망 모델

웹,데이터마이닝정보검색,전자상거래

스마트폰 자율주행차PC의 보급

PGMDT

PASCAL ImageNetMNISTData:

IT Infra:

Model:

Algorithm:

1.1머신러닝의 특성역사와 발전 동향제7강

MLP = Multilayer Perceptron, DT = Decision Tree, SVM = Support Vector MachinePGM = Probabilistic Graphical Model, CNN = Convolutional Neural Network

2012 (c) SNU Biointelligence Lab, http://bi.snu.ac.kr/

• Supervised Learning– Estimate an unknown mapping from known input and target output pairs– Learn fw from training set D = (x,y) s.t.– Classification: y is discrete– Regression: y is continuous

• Unsupervised Learning– Only input values are provided– Learn fw from D = (x) s.t.– Density estimation and compression– Clustering, dimension reduction

• Sequential (Reinforcement) Learning– Not target, but rewards (critiques) are provided “sequentially”– Learn a heuristic function fw from Dt = (st,at,rt) | t = 1, 2, … s.t.– With respect to the future, not just past– Sequential decision-‐making– Action selection and policy learning

)()( xxw fyf ==

xxw =)(f

( , , )t t tf a rw s

Zhang, B.-T., Next-Generation Machine Learning Technologies, Communications of KIISE, 25(3), 2007 9

1.1머신러닝의 특성머신러닝의 종류제7강

10

모델 구조 표 현 기계학습 모델 예

논리식 명제 논리, 술어논리, Prolog 프로그램

Version Space, 귀납적 논리 프로그래밍(ILP)

규칙 If-Then 규칙, 결정규칙 AQ

함수 Sigmoid, 다항식, 커널 신경망, RBF 망,SVM, 커널 머신

트리 유전자 프로그램,Lisp 프로그램

결정 트리, 유전자프로그래밍, 뉴럴트리

그래프 방향성/무방향성 그래프, 네트워크

확률그래프 모델, 베이지안망, HMM

학습 방법 학습 문제의 예

감독 학습 인식, 분류, 진단, 예측, 회귀분석

무감독 학습 군집화, 밀도추정, 차원 축소, 특징추출

강화 학습 시행착오, 보상 함수, 동적 프로그래밍

장병탁, 차세대 기계학습 기술, 정보과학회지, 25(3), 2007

머신러닝 모델제7강

학습 시스템 설계 절차

1. Problem Description2. Choosing the Training Experience (Data)3. Choosing the Target Function (Objective

Function)4. Choosing a Representation for the Target Function

(Learning Architecture)5. Choosing a Function Approximation Algorithm

(Learning Algorithm)6. Final Design

제7강

목적 함수 (Objective Functions, Evaluation)

• Squared error• Classification error• Margin• Accuracy• Precision and recall• Likelihood• Posterior probability• Cost, utility, value• Risk• Entropy• Cross entropy• Information gain• Mutual information• KL divergence

θ MAP = argmaxθ

p(XN |θ )p(θ )p(XN )

= argmaxθ

p(XN |θ )p(θ )

Rpen (θ )

penalized risk

= Remp (θ )

empirical risk

+α Φ[ f (x,θ )]penalty

KL(p || q) = p(xi )log2p(xi )q(xi )i=1

N

∑

E(DN;𝒘) = "#∑ (𝑡𝑑 − 𝑓(𝒙𝑑,𝒘)# ./0"

제7강

학습 구조 (Architectures, Representation)

• Instances• Rules• Equations• Functions• Decision trees• Neural networks• Graphical models • Lattice models• Hierarchical models• Complex networks• Model ensembles

제7강

학습 알고리듬 (Algorithms, Optimization)

• Parameter vs. Structure• Deterministic vs. Stochastic• Discriminative vs. Generative

• Error minimization– E.g.: Least mean squares

• Convex optimization– E.g.: Gradient descent

• Stochastic search– E.g.: MCMC

• Combinatorial optimization– E.g.: Genetic algorithms

• Constrained optimization– E.g.: Linear programming

제7강

감독학습 (분류)

• 데이터 x가 주어졌을 때 해당되는 레이블 y를 찾는 문제– ex1) x: 사람의 얼굴 이미지, y: 사람의 이름– ex2) x: 혈당 수치, 혈압 수치, 심박수, y: 당뇨병 여부– ex3) x: 사람의 목소리, y: 목소리에 해당하는 문장

• x: n차원 벡터, y: 정수 (Discrete)

• 대표적인 감독학습 (분류) 알고리듬– Support Vector Machine– Decision Tree– K-Nearest Neighbor– Multi-Layer Perceptron (Artificial Neural Network;; 인공신경망)

Perceptron (1/2)

Perceptron (2/2)

x1 x2

1w1 w2

b

x1

x2w1*x1 + w2*x2 +b = 0 > 0:

< 0:

Parameter Learning in Perceptronstart: The weight vector w is generated randomlytest: A vector x ∈ P ∪N is selected randomly, If x∈P and w·∙x>0 goto test, If x∈P and w·∙x≤0 goto add,If x ∈N and w ·∙ x < 0 go to test,If x ∈N and w ·∙ x ≥ 0 go to subtract. add: Set w = w+x, goto testsubtract:Set w = w-‐x, goto test

Perceptron Learning

© 2017, SNU BioIntelligence Lab, http://bi.snu.ac.kr/

1

0

0 Σ

-.06

-.1

.05

x1 x2 y0 0 00 1 01 0 01 1 1

-.060

RIGHT

x1

x2

x0

Perceptron Learning


1

0

1 Σ

-.06

-.1

.05

x1 x2 y0 0 00 1 01 0 01 1 1

-.010

RIGHT

Perceptron Learning


1

1

0 Σ

-.06

-.1

.05

x1 x2 y0 0 00 1 01 0 01 1 1

-.160

RIGHT

Perceptron Learning


1

1

1 Σ

-.06

-.1

.05

x1 x2 y0 0 00 1 01 0 01 1 1

-.110

WRONG

Perceptron Learning


1

1

1 Σ

-.06

-.1

.05

x1 x2 y0 0 00 1 01 0 01 1 1

Fails to fire,so add proportion,η, to weights.

Perceptron Learning


1

1

1 Σ

-.06+.01x1

-.1+.01x1

.05+.01x1

x1 x2 y0 0 00 1 01 0 01 1 1

η = .01

Perceptron Learning


1

Σ

-.05

-.09

.06

x1 x2 y0 0 00 1 01 0 01 1 1

Perceptron Learning


1

0

1 Σ

-.05

-.09

.06

x1 x2 y0 0 00 1 01 0 01 1 1

.01 1

Decrease!

Perceptron Learning


1

0

1 Σ

-.05-.01x1

-.09

.06-.01x1

x1 x2 y0 0 00 1 01 0 01 1 1

η = .01

Perceptron Learning


1

1

1 Σ

-.06

-.09

.07

x1 x2 y0 0 00 1 01 0 01 1 1

η = .01

-‐.08

Perceptron Learning


1

1

1 Σ

-.06+.01x1

-.09+.01x1

.06+.01x1

x1 x2 y0 0 00 1 01 0 01 1 1

η = .01

• Perceptron = a linear threshold unit (LTU)– Note: Linear perceptron = linear unit (see below)

• Input: a vector of real values• Output: 1 or -1 (binary)• Activation function: threshold function

머신러닝1.1머신러닝의 특성감독학습 알고리듬제7강

Linearly Separable vs. Linearly Nonseparable

(a) Decision surface for a linearly separable set of examples (correctly classified by a straight line)

(b) A set of training examples that is not linearly separable.

제7강

Delta Rule: Least Mean Square (LMS) Error

• Linear unit (linear perceptron)• Note: output value o is a real value (not binary)

• Delta rule: learning rule for an unthresholdedperceptron (i.e. linear unit). – Delta rule is a gradient-descent rule.

제7강

od = ∑ 𝑤𝑖𝑥𝑖4

= ∑ 𝑤𝑖𝑥𝑖4

Perceptron Training Rule

• Note: output value o is +1 or -1 (not a real)

• Perceptron rule: a learning rule for a threshold unit.• Conditions for convergence

– Training examples are linearly separable.– Learning rate is sufficiently small.

제7강

Gradient Descent Method제7강

Delta Rule for Error Minimization

iiiii w

Ewwww∂∂−=ΔΔ+← η ,

∑∈

−=ΔDd

idddi xotw )(η

제7강

od = ∑ 𝑤𝑖𝑥𝑖4 = 𝒘𝒙𝑑

Perceptron Learning Algorithm제7강

(c) 2010-2012 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 37

무감독학습제7강

K-‐Means

• Given a data point v and a set of points X, define the distance from v to X

d(v, X)

as the (Euclidian) distance from v to the closest point from X.

• Given a set of n data points V=v1…vn and a set of k points X, define the Squared Error Distortion

d(V,X) = ∑d(vi, X)2 / n 1 < i < n


• Input: A set, V, consisting of n points and a parameter k• Output: A set X consisting of k points (cluster centers) that

minimizes the squared error distortion d(V,X) over all possible choices of X

• K-‐means clustering algorithm1) Pick a number (k) of cluster centers2) Assign every data point (e.g., gene) to its nearest cluster

center3) Move each cluster center to the mean of its assigned data

points (e.g., genes)4) Repeat 2-‐3 until convergence


0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1


x1

x2

x3

v

v: Data points X: Cluster centre

Iteration 0

0

1

2

3

4

5

0 1 2 3 4 5



x1

x2

x3

Iteration 1

0

1

2

3

4

5

0 1 2 3 4 5



x1

x2

x3

Iteration 2

0

1

2

3

4

5

0 1 2 3 4 5



x1

x2 x3

Iteration 3

Example: 4-‐cluster data and 4 iterations

• 모델복잡도 (model complexity)– 데이터나문제의복잡도에 비해 모델복잡도가크면훈련데이터에

대한 정확도는우수하나– 과다학습(overfitting) 문제를 야기하여 일반화 성능이저하됨

© 2017, 장교수의 딥러닝 45

과다학습과 정규화제7강

• 모델선택문제즉최적의 모델복잡도를갖는 학습 모델을찾는 문제를체계적으로 접근하는 한 가지 방법은 정규화 이론을 이용

• 구조위험최소화법 (Structural Risk Minimization, SRM)Ø 목적함수를 변경

Ø 모델의복잡도가증가함에따라오류가증가

Ø 복잡도와훈련오류를모두고려한최적의 모델을찾을 수 있음

© 2017, 장교수의 딥러닝

𝑅789[f ,γ ] = 𝑅>7?4@[f ] + γ ‖𝒘‖#

47


Reading (Watching) Assignments

© 2018 Byoung-‐Tak Zhang, Seoul National University 48

• Learning Machine Learning in 3 Months, Video Lecture, 2018. • Q: 머신러닝을 공부하기 위한 인터넷 비디오 강좌와 정보 소스는

무엇이 있는가? 이 비디오에 나오는 관련 정보를 조사하여 기술하시오.

제7강

제7강 학습하는 기계 · 1980 1985 1990 1995 2000 2005 2010 2015 mlp svm cnn 신경망 모델...

Documents