machine learning day 1 machine learning day 1 삼성전자 종합기술원 첨단기술연수소...
TRANSCRIPT
Machine LearningMachine LearningDay 1Day 1
삼성전자 종합기술원 첨단기술연수소
2010. 10. 6-8
장 병 탁Byoung-Tak Zhang
서울대학교 바이오지능기술연구센터 (CBIT)전기컴퓨터공학부 &
인지과학 , 뇌과학 , 생물정보학 협동과정 겸임
http://bi.snu.ac.kr/
(c) 2009-2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/
2
강의 개요강의 개요 1 일차 (10/6): 감독학습
Neural Nets Decision Trees Support Vector Machines
2 일차 (10/7): 무감독 학습 Self-Organizing Maps Clustering Algorithms Reinforcement Learning Evolutionary Learning
3 일차 (10/8): 확률그래프모델 Bayesian Networks Markov Random Fields Particle Filters
(c) 2009-2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/
3
질문질문 (1(1 일차일차 ))
감독학습 , 무감독학습 , 강화학습의 차이는 무엇인가 ? Neural Net (NN) 에서 오류교정에 의한 학습 방식을
설명하시오 . Decision Tree (DT) 에서 엔트로피 기반 학습 방식을
설명하시오 . k-Nearest Neighbor (kNN) 의 학습 원리는 무엇인가 ? Support Vector Machine (SVM) 의 학습 원리는
무엇인가 ? NN, DT, kNN, SVM 의 장단점을 비교하시오 . 각각
어떤 응용에 적합한가 ?
© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/4
Approaches to Artificial Approaches to Artificial IntelligenceIntelligence
Symbolic AI Rule-Based Systems
Connectionist AI Neural Networks
Evolutionary AI Genetic Algorithms
Molecular AI: DNA Computing
© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/5
Research Areas and ApproachesResearch Areas and Approaches
ArtificialIntelligence
Research
Rationalism (Logical)Empiricism (Statistical)Connectionism (Neural)Evolutionary (Genetic)Biological (Molecular)
Paradigm
Application
Intelligent AgentsInformation RetrievalElectronic CommerceData MiningBioinformaticsNatural Language Proc.Expert Systems
Learning AlgorithmsInference MechanismsKnowledge RepresentationIntelligent System Architecture
Day 1 Day 1 1. Concept of Machine 1. Concept of Machine
LearningLearning
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 7
Learning: DefinitionLearning: Definition
Learning is the improvement of performance in some environment through the acquisition of knowledge resulting from experience in that environment.
the improvement
of behavior
the improvement
of behavior
on some
performance task
on some
performance taskthrough acquisition
of knowledge
through acquisition
of knowledge
based on partial
task experience
based on partial
task experience
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/8
Activation Function Scaling Function
Output Comparison
Information Propagation
Error Backpropagation
Input x1
Input x2
Input x3
Output
Input Layer Hidden Layer Output Layer
Weights
Activation Function
Neural Network (MLP)Neural Network (MLP)
outputsk
kkd otwE 2)(2
1)(
i
iiii w
Ewwww
,
x )(xfo
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/9
Application Example:Application Example:Autonomous Land Vehicle (ALV)Autonomous Land Vehicle (ALV) NN learns to steer an autonomous vehicle. 960 input units, 4 hidden units, 30 output units Driving at speeds up to 70 miles per hour
Weight valuesfor one of the hidden units
Image of aforward -mountedcamera
ALVINN System
DARPA Grand Challenge DARPA Grand Challenge 기계학습 기반 무인자동차 운전 기술기계학습 기반 무인자동차 운전 기술
© 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/10
Stanford 팀은 무인자동차의 자동운전 기술에 기계학습 기법을 활용하여 2005 년도 Grand Challenge 에서 우승 ( 상금 2 백만 달러 ), 2007 년도 Urban Challenge 에서 준우승을 차지하였다 .
Stanford 팀은 무인자동차의 자동운전 기술에 기계학습 기법을 활용하여 2005 년도 Grand Challenge 에서 우승 ( 상금 2 백만 달러 ), 2007 년도 Urban Challenge 에서 준우승을 차지하였다 .
VideoVideo2005 년도 미션 : 사막지역 175 마일을 자동운전만으로 10 시간 이내에 주파
2005 년도 미션 : 사막지역 175 마일을 자동운전만으로 10 시간 이내에 주파
2007 년도 미션 : 도시환경에서 96km 를자동운전만으로 6 시간 이내에 주파
2007 년도 미션 : 도시환경에서 96km 를자동운전만으로 6 시간 이내에 주파
[Sebastian Thrun, Stanley & Junior, Stanford Univ.]
DARPA Grand Challenge: Final Part 1
© 2008, SNU Biointelligence Lab, http://bi.snu.ac.kr/11
헬리콥터 자동 비행헬리콥터 자동 비행강화 학습을 이용하여 RC 헬기를 자동적으로 제어하는데 성공했고
다양한 고난이도 비행도 성공적으로 행했다 .
강화 학습을 이용하여 RC 헬기를 자동적으로 제어하는데 성공했고 다양한 고난이도 비행도 성공적으로 행했다 .
강화 학습 (RL) 을 통한 자동 제어강화 학습 (RL) 을 통한 자동 제어
가속도 , 속도 센서가 달린 RC 헬리콥터가속도 , 속도 센서가 달린 RC 헬리콥터
자동 제어를 통한 고난이도 비행
자동 제어를 통한 고난이도 비행자동 제어를 통한 고난이도 비행
( 참고 : Andrew Ng, Stanford Univ.) Stanford Autonomous Helicopter - Airshow #2: http://www.youtube.com/watch?v=VCdxqn0fcnE
가정가정 //사무실 도우미 로봇사무실 도우미 로봇
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/12
http://www.youtube.com/watch?v=mgHUNfqIhAc&feature=related
PR2 Robot Plays Pool
PR2 Robot Cleans Up
http://www.youtube.com/watch?v=gYqfa-YtvW4&feature=related
PR2 Robot of Willow Garage
( 참고 : Willow Garage)
모바일 기기에서의 웹 브라우징모바일 기기에서의 웹 브라우징
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/13
작은 휴대폰 화면에서 웹페이지를 분할하여 zoom-in 하기 위해 ,Decision Tree 를 이용하여 내용을 고려한 분할을 구현하였다 .
작은 휴대폰 화면에서 웹페이지를 분할하여 zoom-in 하기 위해 ,Decision Tree 를 이용하여 내용을 고려한 분할을 구현하였다 .
화면을 균등하게 분할
Decision Tree 를 이용
→ 내용기준 구역 분할
내용을 이해하기 어려움
1 2 3
4 5 6
7 8 9
5
문장이나 그림이 같은 구역 안에 위치13
( 참고 : WWW2006, Baluja)
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/14
Drivatar – Drivatar – 레이싱 게임의 인공지능레이싱 게임의 인공지능MS XBOX 360 의 레이싱 게임인 Forza 2 의 플레이어 운전 패턴을 기계학습을 통해
모델링 한 후 확률적으로 운전을 모방함으로써 인간 수준의 플레이 실현
MS XBOX 360 의 레이싱 게임인 Forza 2 의 플레이어 운전 패턴을 기계학습을 통해 모델링 한 후 확률적으로 운전을 모방함으로써 인간 수준의 플레이 실현
도로상위치
도로상위치 주행차선주행차선 코스별
속력
코스별속력
브레이크/엑셀
브레이크/엑셀
운전자 운전 패턴 확률기반 모델링운전자 운전 패턴 확률기반 모델링
모든 경로 세그먼트화 게이머가 선택하는 최적경로 학습
(Imitation Approach)
모든 경로 세그먼트화 게이머가 선택하는 최적경로 학습
(Imitation Approach)
The Future of Racing Games http://www.youtube.com/watch?v=TaUyzlKKu-E
확률적 모델링으로 인해 동일 수준의 무한한 운전 형태를 생성
Microsoft Research in Cambridge, UK
게임 플레이 상에서의 운전 패턴
( 참고 : Thore Graepel, MS Research Cambridge) Whole-audience Control of a Racing Game http://www.youtube.com/watch?v=NS_L3Yyv2RI
뇌의 활동 신호 분석으로 생각을 읽어내기 뇌의 활동 신호 분석으로 생각을 읽어내기
응용 사례 : 거짓말 탐지기 (Lie Detector)
뇌의 활동 신호를 기계학습을 이용한 분석으로 사람의 마음 상태 ( 생각 , 지각 ) 을 알아 낼 수 있다 .
뇌의 활동 신호를 기계학습을 이용한 분석으로 사람의 마음 상태 ( 생각 , 지각 ) 을 알아 낼 수 있다 .
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/15
( 참고 : Nature Reviews Neuroscience, 2006)
© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/16
Machine Learning: Three TasksMachine Learning: Three Tasks
Supervised Learning Estimate an unknown mapping from known input and
target output pairs Learn fw from training set D={(x,y)} s.t. Classification: y is discrete Regression: y is continuous
Unsupervised Learning Only input values are provided Learn fw from D={(x)} s.t. Compression Clustering
Reinforcement Learning Not target, but rewards (critiques) are provided Learn a heuristic function hw from D={(x, a, c)} s.t. Action selection Policy learning
)()( xxw fyf
xxw )(f
( , , )h a cw x
산업적 응용 사례 모음 목록산업적 응용 사례 모음 목록
Robotics DARPA Urban challenge 헬리콥터 자동 비행 가정 /사무실 도우미 로봇
Mobile 기기 Web browsing 행동인식 폰
Web 응용서비스 Spam filtering 아마존 추천서비스 블로그 /뉴스 이용 여론조사
Computer Vision 인물사진 검색 및 매칭 천체사진 자동 분석 고고학 유뮬 자동 매칭
© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/17
Bioinformatics 게놈 구조 예측 바이오 네트워크 분석
의약 HIV 백신 설계
금융 대출고객 신용평가
건축 최적의 트러스트 구조물 탐색
컴퓨터게임 Drivatar – 레이싱 게임의 AI
Neuroscience 뇌의 신호로 마음의 상태 파악 Connectonomics– 뉴런구조
규명 소셜 네트워크 /범죄 예방
뉴욕경찰정 실시간치안선테
18
모델 구조 표 현 기계학습 모델 예
논리식 명제 논리 , 술어논리 , Prolog 프로그램
Version Space, 귀납적 논리 프로그래밍 (ILP)
규칙 If-Then 규칙 , 결정규칙 AQ
함수 Sigmoid, 다항식 , 커널 신경망 , RBF 망 ,SVM, 커널 머신
트리 유전자 프로그램 ,Lisp 프로그램
결정 트리 , 유전자프로그래밍 , 뉴럴트리
그래프 방향성 /무방향성 그래프 , 네트워크
확률그래프 모델 , 베이지안망 , HMM
학습 방법 학습 문제의 예
감독 학습 인식 , 분류 , 진단 , 예측 , 회귀분석
무감독 학습 군집화 , 밀도추정 , 차원 축소 , 특징추출
강화 학습 시행착오 , 보상 함수 , 동적 프로그래밍
기계학습기계학습 :: 종류 및 모델종류 및 모델
© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/19
기계학습 기술기계학습 기술 : : 대표적인 알고리즘대표적인 알고리즘 Symbolic Learning
Version Space Learning Case-Based Learning
Neural Learning Multilayer Perceptrons Self-Organizing Maps Support Vector Machines Kernel Machines
Evolutionary Learning Evolution Strategies Evolutionary Programming Genetic Algorithms Genetic Programming Molecular Programming
Probabilistic Learning Bayesian Networks Helmholtz Machines
Markov Random Fields
Hypernetworks Latent Variable Models Generative Topographic
Mapping Other Methods
Decision Trees Reinforcement Learning Boosting Algorithms Mixture of Experts Independent Component
Analysis
20
기계학습의 응용 분야기계학습의 응용 분야
응용 분야 적용 사례
인터넷 정보검색 텍스트 마이닝 , 웹로그 분석 , 스팸필터 , 문서 분류 , 여과 , 추출 , 요약 , 추천
컴퓨터 시각 문자 인식 , 패턴 인식 , 물체 인식 , 얼굴 인식 , 장면전환 검출 , 화상 복구
음성인식 /언어처리 음성 인식 , 단어 모호성 제거 , 번역 단어 선택 , 문법 학습 , 대화 패턴 분석
모바일 HCI 동작 인식 , 제스쳐 인식 , 휴대기기의 각종 센서 정보 인식 , 떨림 방지
생물정보 유전자 인식 , 단백질 분류 , 유전자 조절망 분석 , DNA 칩 분석 , 질병 진단
바이오메트릭스 홍채 인식 , 심장 박동수 측정 , 혈압 측정 , 당뇨치 측정 , 지문 인식
컴퓨터 그래픽 데이터기반 애니메이션 , 캐릭터 동작 제어 , 역운동학 , 행동 진화 , 가상현실
로보틱스 장애물 인식 , 물체 분류 , 지도 작성 , 무인자동차 운전 , 경로 계획 , 모터 제어
서비스업 고객 분석 , 시장 클러스터 분석 , 고객 관리 (CRM), 마켓팅 , 상품 추천
제조업 이상 탐지 , 에너지 소모 예측 , 공정 분석 계획 , 오류 예측 및 분류
[ 장병탁 , 정보과학회지 , 2007 년 3월 특집호 ]
Day 1 Day 1 2. Neural Network Learning2. Neural Network Learning
From Biological Neuron to From Biological Neuron to Artificial NeuronArtificial Neuron
Dendrite Cell Body Axon
From Biology to Artificial Neural NetsFrom Biology to Artificial Neural Nets
Properties of Artificial Neural Properties of Artificial Neural NetworksNetworks
A network of artificial neurons
Characteristics Nonlinear I/O mapping Adaptivity Generalization ability Fault-tolerance (graceful
degradation) Biological analogy
<Multilayer Perceptron Network>
Problems Appropriate for Problems Appropriate for Neural NetworksNeural Networks
Many training examples available
Outputs can be discrete or continuous-
valued or their vectors.
May contain noise in training examples
Tolerant to long training time
Fast execution time
Not necessary to explain the prediction
results
Example ApplicationsExample Applications
NETtalk [Sejnowski] Inputs: English text Output: Spoken phonemes
Phoneme recognition [Waibel] Inputs: wave form features Outputs: b, c, d,…
Robot control [Pomerleau] Inputs: perceived features Outputs: steering control
Application:Application:Autonomous Land Vehicle (ALV)Autonomous Land Vehicle (ALV)
NN learns to steer an autonomous vehicle. 960 input units, 4 hidden units, 30 output units Driving at speeds up to 70 miles per hour
Weight valuesfor one of the hidden units
Image of aforward -mountedcamera
ALVINN System
Application:Application:Data Recorrection by a Hopfield Data Recorrection by a Hopfield NetworkNetwork
original target data
corrupted input data
Recorrected data after 10 iterations
Recorrected data after 20 iterations
Fullyrecorrected data after 35 iterations
Perceptron Perceptron and and Gradient Descent AlgorithmGradient Descent Algorithm
(Simple) Perceptron:(Simple) Perceptron:A neural net with a single A neural net with a single neuronneuron
Perceptron = a linear threshold unit (LTU) Note: Linear perceptron = linear unit (see below)
Input: a vector of real values Output: 1 or -1 (binary) Activation function: threshold function
Linearly Separable Linearly Separable vs. vs. Linearly Linearly NonseparableNonseparable
(a) Decision surface for a linearly separable set of examples (correctly classified by a straight line)
(b) A set of training examples that is not linearly separable.
Perceptron Training RulePerceptron Training Rule
Note: output value o is +1 or -1 (not a real)
Perceptron rule: a learning rule for a threshold unit. Conditions for convergence
Training examples are linearly separable. Learning rate is sufficiently small.
Delta Rule: Least Mean Square (LMS) Delta Rule: Least Mean Square (LMS) ErrorError
Linear unit (linear perceptron) Note: output value o is a real value (not binary)
Delta rule: learning rule for an unthresholded perceptron (i.e. linear unit). Delta rule is a gradient-descent rule.
Gradient Descent MethodGradient Descent Method
Delta Rule for Error MinimizationDelta Rule for Error Minimization
iiiii w
Ewwww
,
Dd
idddi xotw )(
Perceptron Learning AlgorithmPerceptron Learning Algorithm
Multilayer PerceptronMultilayer Perceptron
© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/38
… …
x1
x2
xn
Inputlayer
Hiddenlayer
Outputlayer
ym
…
y1
• x: input vector • y: output vector
• Supervised learning• Gradient search• Noise immunity
• Learning algorithm: Backpropagation
Multilayer PerceptronsMultilayer Perceptrons
ji
dji w
Ew
outputsk
kkd otwE 2)(2
1)(
MultilayerMultilayer N Networks and its etworks and its Decision BoundariesDecision Boundaries
Decision regions of a multilayer feedforward network. The network was trained to recognize 1 of 10 vowel sounds
occurring in the context “h_d” The network input consists of two parameter, F1 and F2,
obtained from a spectral analysis of the sound. The 10 network outputs correspond to the 10 possible vowel
sounds.
Differentiable Threshold Differentiable Threshold UnitUnit
Sigmoid function: nonlinear, differentiable
E defined as a sum of the squared errors over all the output units k for all the training examples d.
Error surface can have multiple local minima Guarantee toward some local minimum No guarantee to the global minimum
Dd outputsk
kdkd otwE 2)(2
1)(
Error Function for BP
Backpropagation Learning Backpropagation Learning Algorithm for MLPAlgorithm for MLP
Original weight update rule for BP:
Adding momentum
Help to escape a small local minima in the error surface.
Speed up the convergence.
10 ,)1()( nwxnw jijijji
Adding Momentum
jijji xnw )(
Derivation of the BP RuleDerivation of the BP Rule
Notations
xij : the ith input to unit j
wij : the weight associated with the ith input to unit j
netj : the weighted sum of inputs for unit j
oj : the output computed by unit j
tj : the target output for unit j
: the sigmoid function
outputs : the set of units in the final layer of the network
Downstream(j) : the set of units whose immediate inputs
include the output of unit j
ji
dji w
Ew
outputsk
kkd otwE 2)(2
1)(
jij
d
ji
j
j
d
ji
d xnet
E
w
net
net
E
w
E
Derivation of the BP Rule
Error measure:
Gradient descent:
Chain rule:
j
j
j
d
j
d
net
o
o
E
net
E
)()(2
1 2jj
outputskkk
jj
d ototoo
E
)1()(
jjj
j
j
j oonet
net
net
o
jijjjjji
dji xooot
w
Ew )1()(
Case 1: Rule for Output Unit Case 1: Rule for Output Unit WeightsWeights
Step 1:
Step 2:
Step 3:
All together:
i
jijij xwnet
)(
)(
)(
)(
)1(
jDownstreamkjjkjk
jDownstreamk j
jkjk
j
j
j
k
jDownstreamkk
jDownstreamk j
j
j
k
k
d
j
d
oow
net
ow
net
o
o
net
net
o
o
net
net
E
net
E
)(
)1( where,jDownstreamk
kjkjjjjijji wooxw
Case 2: Rule for Hidden Unit Case 2: Rule for Hidden Unit WeightsWeights Step 1:
Thus:
BP for MLP: BP for MLP: revisitedrevisited
BP has an ability to discover useful intermediate representations at the hidden unit layers inside the networks which capture properties of the input spaces that are most relevant to learning the target function.
When more layers of units are used in the network, more complex features can be invented.
But the representations of the hidden layers are very hard to understand for humans.
Hidden Layer Hidden Layer RepresentationsRepresentations
Hidden Layer Representation for Hidden Layer Representation for Identity FunctionIdentity Function
Hidden Layer Representation for Hidden Layer Representation for Identity FunctionIdentity Function
The evolving sum of squared errors for each of the eight output units as the number of training iterations (epochs) increase
Hidden Layer Representation for Hidden Layer Representation for Identity FunctionIdentity Function
The evolving hidden layer representation for the input string “01000000”
Hidden Layer Representation for Hidden Layer Representation for Identity FunctionIdentity Function
The evolving weights for one of the three hidden units
Continuing training until the training error falls below some predetermined threshold is a poor strategy since BP is susceptible to overfitting. Need to measure the generalization accuracy over a
validation set (distinct from the training set).
Two different types of overffiting Generalization error first decreases, then increases,
even the training error continues to decrease. Generalization error decreases, then increases, then
decreases again, while the training error continues to decreases.
Generalization and OverfittingGeneralization and Overfitting
Two Kinds of Overfitting Two Kinds of Overfitting PhenomenaPhenomena
Techniques for Overcoming Techniques for Overcoming the Overfitting Problemthe Overfitting Problem Weight decay
Decrease each weight by some small factor during each iteration.
This is equivalent to modifying the definition of E to include a penalty term corresponding to the total magnitude of the network weights.
The motivation for the approach is to keep weight values small, to bias learning against complex decision surfaces
k-fold cross-validation Cross validation is performed k different times, each
time using a different partitioning of the data into training and validation sets
The result are averaged after k times cross validation.
Designing an Artificial Neural Designing an Artificial Neural Network for Face Recognition Network for Face Recognition ApplicationApplication
ANN for Face Recognition
960 x 3 x 4 network is trained on gray-level images of faces to predict whether a person is looking to their left, right, ahead, or up.
Possible learning tasks Classifying camera images of faces of people in various
poses. Direction, Identity, Gender, ...
Data: 624 grayscale images for 20 different people 32 images per person, varying
person’s expression (happy, sad, angry, neutral) direction (left, right, straight ahead, up) with and without sunglasses
resolution of images: 120 x128, each pixel with a grayscale intensity between 0 (black) and 255 (white)
Task: Learning the direction in which the person is facing.
Problem Definition
Factors for ANN Design in the Factors for ANN Design in the Face Recognition TaskFace Recognition Task
Input encoding
Output encoding
Network graph structure
Other learning algorithm parameters
Input Coding for Face Input Coding for Face RecognitionRecognition
Possible Solutions Extract key features using preprocessing Coarse-resolution
Features extraction edges, regions of uniform intensity, other local image
features Defect: High preprocessing cost, variable number of features
Coarse-resolution Encode the image as a fixed set of 30 x 32 pixel intensity
values, with one network input per pixel. The 30x32 pixel image is a coarse resolution summary of the
original 120x128 pixel image Coarse-resolution reduces the number of inputs and weights
to a much more manageable size, thereby reducing computational demands.
Output Coding for Face Output Coding for Face RecognitionRecognition
Possible coding schemes Using one output unit with multiple threshold values Using multiple output units with single threshold
value. One unit scheme
Assign 0.2, 0.4, 0.6, 0.8 to encode four-way classification.
Multiple units scheme (1-of-n output encoding) Use four distinct output units Each unit represents one of the four possible face
directions, with highest-valued output taken as the network prediction
Output Coding for Face RecognitionOutput Coding for Face Recognition
Advantages of 1-of-n output encoding scheme It provides more degrees of freedom to the network
for representing the target function. The difference between the highest-valued output
and the second-highest can be used as a measure of the confidence in the network prediction.
Target value for the output units in 1-of-n encoding scheme < 1, 0, 0, 0 > v.s. < 0.9, 0.1, 0.1, 0.1 > < 1, 0, 0, 0 >: will force the weights to grow without
bound. < 0.9, 0.1, 0.1, 0.1 >: the network will have finite
weights.
Network Structure for Face Network Structure for Face RecognitionRecognition
One hidden layer v.s. more hidden layers How many hidden nodes is used?
Using 3 hidden units: test accuracy for the face data = 90% Training time = 5 min on Sun Sprac 5
Using 30 hidden units: test accuracy for the face data = 91.5% Training time = 1 hour on Sun Sparc 5
Other Parameters for Face Other Parameters for Face RecognitionRecognition
Learning rate = 0.3 Momentum = 0.3 Weight initialization: small random values near
0 Number of iterations: Cross validation
After every 50 iterations, the performance of the network was evaluated over the validation set.
The final selected network is the one with the highest accuracy over the validation set
Day 1 Day 1 3. Decision Tree Learning3. Decision Tree Learning
© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/67
`
quicktime
unix
computer clinton
space
NO YES
no yes
NO
no yes no
NO
yes
YES
no yes
YES
no yes
Nodes: attributes Edges: values Terminal nodes: class
labels
Learning Algorithm: C4.5
Decision TreesDecision Trees
)()(),()(
v
AValuesv
vSEntropy
S
SSEntropyASGain
SNU Center for Bioinformation Technology (CBIT)68
Main IdeaMain Idea
Classification by Partitioning Example Space Goal : Approximating discrete-valued target
functions Appropriate Problems
Examples are represented by attribute-value pairs. The target function has discrete output value. Disjunctive description may be required. The training data may contain missing attribute values.
Day Outlook Temperature Humidity Wind PlayTennis
D1D2D3D4D5D6D7D8D9D10D11D12D13D14
SunnySunny
OvercastRainRainRain
OvercastSunnySunnyRain
SunnyOvercastOvercast
Rain
HotHotHotMildCoolCoolCoolMildCoolMildMildMildHotMild
HighHighHighHigh
NormalNormalNormal
HighNormalNormalNormal
HighNormal
High
WeakStrongWeakWeakWeakStrongStrongWeakWeakWeakStrongStrongWeakStrong
NoNoYesYesYesNoYesNoYesYesYesYesYesNo
Example Problem (Play Tennis)
SNU Center for Bioinformation Technology (CBIT)70
Example SpaceExample Space
Yes(Outlook = Overcast)
No(Outlook = Sunny &Humidity = High)
Yes(Outlook = Sunny & Humidity = Normal)
Yes(Outlook = Rain & Wind = Weak)
No(Outlook = Rain
& Wind = Strong)
SNU Center for Bioinformation Technology (CBIT)71
Decision Tree RepresentationDecision Tree Representation
OutlookOutlook
HumidityHumidity WindWind
SunnyOvercast
Rain
YES
High Normal
YESNO NO YES
SNU Center for Bioinformation Technology (CBIT)72
Basic Decision Tree LearningBasic Decision Tree Learning
Which Attribute is Best? Select the attribute that is most useful for
classifying examples. Quantitative Measure
Information Gain For Attribute A, relative to a collection
of data D
Expected Reduction of Entropy
)(
)(||
||)(),(
AValuesv
DvEntropyD
DvDEntropyADGain
SNU Center for Bioinformation Technology (CBIT)73
EntropyEntropy
Impurity of an Arbitrary Collection of Examples
Minimum number of bits of information needed to encode the classification of an arbitrary member of D
c
iii ppDEntropy
1
log)( Entropy(S
)
P 1.01.00.00.0
1.01.0
SNU Center for Bioinformation Technology (CBIT)74
Constructing Decision TreeConstructing Decision Tree
SNU Center for Bioinformation Technology (CBIT)75
Example : Play Tennis (1)Example : Play Tennis (1)
Entropy of D
940.0
14
5log
14
5
14
9log
14
9
])5,9([)(
EntropyDEntropy
SNU Center for Bioinformation Technology (CBIT)76
Example : Play Tennis (2)Example : Play Tennis (2)
Attribute Wind D = [9+,5-] Dweak = [6+,2-]
Dstrong=[3+,3-]
048.0
00.114
6811.0
14
8940.0
)(14
6)(
14
8)(
)(||
||)(),(
},{
strongweak
strongweakv
DEntropyDEntropyDEntropy
DvEntropyD
DvDEntropyWindDGain
Wind[9+,5-] : E = 0.940
Weak Strong
[6+,2-] : E = 0.811 [3+,3-] : E = 1.0
SNU Center for Bioinformation Technology (CBIT)77
Example : Play Tennis (3)Example : Play Tennis (3)
Attribute Humidity Dhigh = [3+,4-]
Dnormal=[6+,1-]
151.0
592.014
7985.0
14
7940.0
)(14
7)(
14
7)(
)(||
||)(),(
},{
normalhigh
normalhighv
DEntropyDEntropyDEntropy
DvEntropyD
DvDEntropyWindDGain
Humidity[9+,5-] : E = 0.940
High Normal
[3+,4-] : E = 0.985 [6+,1-] : E = 0.592
SNU Center for Bioinformation Technology (CBIT)78
Example : Play Tennis (4)Example : Play Tennis (4)
Best Attribute? Gain(D, Outlook) = 0.246 Gain(D, Humidity) = 0.151 Gain(D, Wind) = 0.048 Gain(D, Temperature) = 0.029
[9+,5-] : E = 0.940
OutlookSunny
Rain
[2+,3-] : (D1, D2, D8, D9, D11)
[3+,2-] : (D4, D5, D6, D10, D14)
Overcast
[4+,0-] : (D3, D7, D12, D13)
YES
SNU Center for Bioinformation Technology (CBIT)79
Example : Play Tennis (5)Example : Play Tennis (5)
Entropy Dsunny
971.0
5
3log
5
3
5
2log
5
2
])3,2([)(
EntropyDEntropy sunny
Day Outlook Temperature Humidity Wind PlayTennis
D1D2D8D9D11
SunnySunnySunnySunnySunny
HotHotMildCoolMild
HighHighHigh
NormalNormal
WeakStrongWeakWeakStrong
NoNoNoYesYes
SNU Center for Bioinformation Technology (CBIT)80
Example : Play Tennis (6)Example : Play Tennis (6)
Attribute Wind Dweak = [1+,2-]
Dstrong=[1+,1-]
020.0
00.15
2918.0
5
3971.0
)(5
2)(
5
3)(
)(||
||)(),(
},{
strongweaksunny
strongweakvsunny
DEntropyDEntropyDEntropy
DvEntropyD
DvDEntropyWindDGain
Wind[2+,3-] : E = 0.971
Weak Strong
[1+,2-] : E = 0.918 [1+,1-] : E = 1.0
SNU Center for Bioinformation Technology (CBIT)81
Example : Play Tennis (7)Example : Play Tennis (7)
Attribute Humidity Dhigh = [0+,3-]
Dnormal=[2+,0-]
971.0
00.05
200.0
5
3971.0
)(5
2)(
5
3)(
)(||
||)(),(
},{
normalhighsunny
normalhighvsunny
DEntropyDEntropyDEntropy
DvEntropyD
DvDEntropyWindDGain
Humidity[2+,3-] : E = 0.971
High Normal
[0+,3-] : E = 0.00 [2+,0-] : E = 0.00
SNU Center for Bioinformation Technology (CBIT)82
Example : Play Tennis (8)Example : Play Tennis (8)
Best Attribute? Gain(Dsunny, Humidity) = 0.971 Gain(Dsunny, Wind) = 0.020 Gain(Dsunny, Temperature) = 0.571
[9+,5-] : E = 0.940
OutlookSunny Rain
[3+,2-] : (D4, D5, D6, D10, D14)
Overcast
YESHumidity
YESNO
NormalHigh
SNU Center for Bioinformation Technology (CBIT)83
Example : Play Tennis (9)Example : Play Tennis (9)
Entropy Drain
Day Outlook Temperature Humidity Wind PlayTennis
D4D5D6D10D14
RainRainRainRainRain
MildCoolCoolMildMild
HighNormalNormalNormal
High
WeakWeakStrongWeakStrong
YesYesNoYesNo
971.0
5
2log
5
2
5
3log
5
3
])2,3([)(
EntropyDEntropy sunny
SNU Center for Bioinformation Technology (CBIT)84
Example : Play Tennis (10)Example : Play Tennis (10)
Attribute Wind Dweak = [3+,0-]
Dstrong=[0+,2-]
971.0
00.15
200.0
5
3971.0
)(5
2)(
5
3)(
)(||
||)(),(
},{
strongweakrain
strongweakvrain
DEntropyDEntropyDEntropy
DvEntropyD
DvDEntropyWindDGain
Wind[3+,2-] : E = 0.971
Weak Strong
[3+,0-] : E = 0.00 [0+,2-] : E = 0.00
SNU Center for Bioinformation Technology (CBIT)85
Example : Play Tennis (11)Example : Play Tennis (11)
Attribute Humidity Dhigh = [1+,1-]
Dnormal=[2+,1-]
020.0
918.05
300.1
5
2971.0
)(5
2)(
5
3)(
)(||
||)(),(
},{
normalhighrain
normalhighvrain
DEntropyDEntropyDEntropy
DvEntropyD
DvDEntropyWindDGain
Humidity[2+,3-] : E = 0.971
High Normal
[1+,1-] : E = 1.00 [2+,1-] : E = 0.918
SNU Center for Bioinformation Technology (CBIT)86
Example : Play Tennis (12)Example : Play Tennis (12)
Best Attribute? Gain(Drain, Humidity) = 0.020
Gain(Drain, Wind) = 0.971
Gain(Drain, Temperature) = 0.020 Outlook
Humidity Wind
Sunny Overcas
t
Rain
YES
High Normal
YESNO NO YES
SNU Center for Bioinformation Technology (CBIT)87
Avoiding Overfitting DataAvoiding Overfitting Data
Definition Given a hypothesis space H, a hypothesis h H is said
to overfit the data if there exists some alternative hypothesis h’ H, such that h has smaller error than h’ over the training examples, but h’ has a smaller error than h over entire distribution of instances.
Occam’s Razor Prefer the simplest hypothesis that fits the data.
SNU Center for Bioinformation Technology (CBIT)88
Avoiding Overfitting DataAvoiding Overfitting Data
SNU Center for Bioinformation Technology (CBIT)89
Solutions to OverfittingSolutions to Overfitting
1. Partition examples into training, test, and validation set.
2. Use all data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set.
3. Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding is minimized.
SNU Center for Bioinformation Technology (CBIT)90
SummarySummary
Decision trees provide a practical method for concept learning and discrete-valued functions.
ID3 searches a complete hypothesis space. Overfitting is an important issue in decision
tree learning.
Day 1 Day 1 4-5. 4-5. 실습실습
Plan for ML ExercisePlan for ML Exercise
Day 1: Classification Program: Weka Agenda: classification by Neural Network(NN) and Decision
Tree (DT)
Day 2: Clustering Program: Genesis Agenda: k-means /hierarchical clustering, SOM
Day 3: Bayesian Network Program: GeNIe Agenda: designing / learning / inference in Bayesian networks
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classification – The Tool for PracticeClassification – The Tool for Practice
Weka 3: Data Mining Software in Java Collection of machine learning algorithms for data mining tasks What you can do with Weka are
data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization
Weka is an open source software issued under the GNU General Public License
How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.
Day 1
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classification Using Weka - ProblemsClassification Using Weka - Problems
Pima Indians Diabetes (2-class problem)
Pima Indians have the highest prevalence of diabetes in the world
We will build classification models that diagnoses if the patient shows signs of diabetes
Handwritten Digit Recognition (10-class problem)
The MNIST database of handwritten digits contains digits written by office workers and students
We will build a recognition model based on classifiers with the reduced set of MNIST
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classification Using Weka - AlgorithmsClassification Using Weka - Algorithms
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Neural Network(Multilayer Perceptron)
Decision Tree
Classification Using WekaClassification Using Weka
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
click • load a file that contains the training data by clicking ‘Open file’ button• ‘ARFF’ or ‘CSV’ formats
are readible
• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron
• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning
Classification Using WekaClassification Using Weka
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Result output:various measures appear
Options for testing the trained model
Day 1 Day 1 6. k-Nearest Neighbor6. k-Nearest Neighbor
Different Learning MethodsDifferent Learning Methods
Eager learning Explicit description of target function on the whole training set
Instance-based learning Learning = storing all training instances Classification = assigning a target function to a new instance Referred to as “lazy” learning
Kinds of instance-based learning K-nearest neighbor algorithm Locally weighted regression Case-based reasoning
100
Local vsLocal vs.. Distributed Distributed RepresentationRepresentation
Lazy Learning? Eager Learning
K-Nearest NeighborK-Nearest Neighbor
Features
All instances correspond to points in an n-dimensional
Euclidean space
Classification is delayed till a new instance arrives
Classification done by comparing feature vectors of the
different points
Target function may be discrete or real-valued
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/102
k-Nearest Neighbor Classifier k-Nearest Neighbor Classifier (kNN)(kNN)Memory-Based LearningMemory-Based Learning
(x1, y1=1)(x2, y2=1)(x3, y3=0)(x4, y4=0)(x5, y5=1)(x6, y6=0)...(xN, yN=1)
(x, y=?)
Learning Set D Match SetQuery Pattern
11
0
1
0
Three 1’svs. Two 0’s
y=1
k=5
103
kNN as a Neural NetworkkNN as a Neural Network
Hidden(vector)
Input(vector)
Output(vector)
…xq yq
(xi, yi)
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/104
k-Nearest Neighbor (kNN)k-Nearest Neighbor (kNN)
Training • For each training example <x, f(x)>, add the example to the learning set D Classification • Given a query instance xq, denote the k instances from D that are nearest to xq • Return
where (a,b)=1 if a=b, and (a,b)=0 otherwise. Memory-based or case-based learning
+
+
+ xq
105
Generalizing kNNGeneralizing kNN
Divide the input space into local regions and learn simple (constant/linear) models in each patch
Unsupervised: Competitive, online clustering Supervised: Radial-basis func, mixture of
experts
106
Radial-Basis FunctionRadial-Basis Function Network Network Locally-tuned units
01
wpwyH
h
thh
t
2
2
2exp
h
ht
th s
pmx
Input(scalar)
Hidden(scalar)
Output(scalar)
107
Training RBFTraining RBF
Hybrid learning: First layer centers and spreads:
Unsupervised k-means Second layer weights:
Supervised gradient-descent Fully supervised (Broomhead and Lowe, 1988; Moody and
Darken, 1989)
108
RegressionRegression
3
2
2
01
2
21
|
h
ht
th
t iih
ti
tih
h
hjtjt
ht i
ihti
tihj
t
th
ti
tiih
i
H
h
thih
ti
t i
ti
tih,iihhh
spwyrs
s
mxpwyrm
pyrw
wpwy
yrw,s,E
mx
m
X
109
ClassificationClassification
k h kthkh
h ithiht
i
t i
ti
tih,iihhh
wpw
wpwy
yrw,s,E
0
0
exp
exp
log |Xm
110
01
vpwy tTH
h
thh
t
xv
Rules and ExceptionsRules and Exceptions
Default ruleExceptions
111
Rule-Based KnowledgeRule-Based Knowledge
Incorporation of prior knowledge (before training)
Rule extraction (after training) (Tresp et al., 1997)
Fuzzy membership functions and fuzzy rules
10 with
2exp
10 with2
exp2
exp
10 THEN OR AND IF
223
23
2
122
22
21
21
1
321
.ws
cxp
.ws
bxs
axp
.ycxbxax
Day 1 Day 1 7. Support Vector Machines7. Support Vector Machines
© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/113
K(x,x1)
… …
x1
x2
xn
Inputlayer
Outputneuron
yK(x,x2)
K(x,xm)
Bias b
Hidden layer of m inner-product kernels
Inner product kernel
K(x,xi)
Two-layer perceptron
Radial-basis function network
Polynomial learning machine
Type of SVM
p)1( Tyx
2
i22
1exp xx
1iT
0tanh xx
Support Vector MachinesSupport Vector Machines
Support Vector MachinesSupport Vector Machines The line that maximizes the minimum
margin is a good bet. The model class of “hyper-planes
with a margin of m” has a low VC dimension if m is big.
This maximum-margin separator is determined by a subset of the datapoints. Datapoints in this subset are called
“support vectors”. It will be useful computationally if
only a small fraction of the datapoints are support vectors, because we use the support vectors to decide which side of the separator a test case is on.
The support vectors are indicated by the circles around them.
Training a linear SVMTraining a linear SVM To find the maximum margin separator, we have to solve the
following optimization problem:
This is tricky but it’s a convex problem. There is only one optimum and we can find it without fiddling with learning rates or weight decay or early stopping. Don’t worry about the optimization problem. It has been
solved. Its called quadratic programming. It takes time proportional to N^2 which is really bad for very
big datasets so for big datasets we end up doing approximate optimization!
possibleassmallasisand
casesnegativeforb
casespositiveforbc
c
2||||
1.
1.
w
xw
xw
Testing a linear SVMTesting a linear SVM
The separator is defined as the set of points for which:
0
so if 0 say its a positive case
and if 0 say its a negative case
c
c
b
b
b
w x
w x
w x
Introducing slack variablesIntroducing slack variables Slack variables are constrained to be non-negative. When they
are greater than zero they allow us to cheat by putting the plane closer to the datapoint than the margin. So we need to minimize the amount of cheating. This means we have to pick a value for lamba (this sounds familiar!)
2
1 for positive cases
1 for negative cases
with 0 for all
|| ||and as small as possible
2
c c
c c
c
c
c
b
b
c
w x
w x
w
A picture of the best plane with a slack A picture of the best plane with a slack variablevariable
How to make a plane curvedHow to make a plane curved Fitting hyperplanes as separators
is mathematically easy. The mathematics is linear.
By replacing the raw input variables with a much larger set of features we get a nice property: A planar separator in the high-
dimensional space of feature vectors is a curved separator in the low dimensional space of the raw input variables.
A planar separator in a 20-D feature space projected back to the original 2-D space
A potential problem and a magic A potential problem and a magic solutionsolution If we map the input vectors into a very high-dimensional
feature space, surely the task of finding the maximum-margin separator becomes computationally intractable? The mathematics is all linear, which is good, but the
vectors have a huge number of components. So taking the scalar product of two vectors is very
expensive.
The way to keep things tractable is to use “the kernel trick”
The kernel trick makes your brain hurt when you first learn about it, but its actually very simple.
The kernel trickThe kernel trick For many mappings from a
low-D space to a high-D space, there is a simple operation on two vectors in the low-D space that can be used to compute the scalar product of their two images in the high-D space.
)(.)(),( baba xxxxK
Low-D
High-D
doing the scalar product in the obvious way
Letting the kernel do the work
ax
)( ax)( bx
bx
Kernel ExamplesKernel Examples
1. Gaussian kernel
2. Polynomial kernel
The classification ruleThe classification rule The final classification rule is quite simple:
All the cleverness goes into selecting the support vectors that maximize the margin and computing the weight to use on each support vector.
We also need to choose a good kernel function and we may need to choose a lambda for dealing with non-separable cases.
SVs
stests xxKwbias
0),(
The set of support vectors
Some commonly used Some commonly used kernelskernels
)(tanh),(
),(
)1.(),(
22 2/||||
x.yyx
yx
yxyx
yx
kK
eK
K pPolynomial:
Gaussian radial basis function
Neural net:
For the neural network kernel, there is one “hidden unit” per support vector, so the process of fitting the maximum margin hyperplane decides how many hidden units to use. Also, it may violate Mercer’s condition.
Parameters that the user must choose
Support Vector Machines are Support Vector Machines are Perceptrons!Perceptrons!
SVM’s use each training case, x, to define a feature K(x, .) where K is chosen by the user. So the user designs the features.
Then they do “feature selection” by picking the support vectors, and they learn how to weight the features by solving a big optimization problem.
So an SVM is just a very clever way to train a standard perceptron. All of the things that a perceptron cannot do cannot be
done by SVM’s (but it’s a long time since 1969 so people have forgotten this).
Supplement:Supplement:SVM as a Kernel MachineSVM as a Kernel Machine
Kernel Methods ApproachKernel Methods Approach
The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:
The expectation is that the feature space has a much higher
dimension than the input space.
Form of the FunctionsForm of the Functions
So kernel methods use linear functions in a feature space:
For regression this could be the function
For classification require thresholding
Controlling generalisationControlling generalisation
The critical method of controlling generalisation is to force a large margin on the training data:
Support Vector MachinesSupport Vector Machines
SVM optimization
Addresses generalization issue but not the computational
cost of dealing with large vectors
Complexity problemComplexity problem
Let’s apply the quadratic example
to a 20x30 image of 600 pixels – gives approximately
180000 dimensions! Would be computationally infeasible to work in this space
Dual RepresentationDual Representation
Suppose weight vector is a linear combination of the training examples:
can evaluate inner product with new example
Learning the dual variablesLearning the dual variables Since any component orthogonal to the space spanned by the
training data has no effect, general result that weight vectors have dual representation: the representer theorem.
Hence, can reformulate algorithms to learn dual variables rather than weight vector directly
Dual Form of SVMDual Form of SVM
The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:
Note that threshold must be determined from border examples
Using KernelsUsing Kernels
Critical observation is that again only inner products are used
Suppose that we now have a shortcut method of computing:
Then we do not need to explicitly compute the feature vectors either in training or testing
Kernel exampleKernel example
As an example consider the mapping
Here we have a shortcut:
EfficiencyEfficiency
Hence, in the pixel example rather than work with 180000 dimensional vectors, we compute a 600 dimensional inner product and then square the result! Can even work in infinite dimensional spaces, eg using the Gaussian kernel:
Constraints on the kernelConstraints on the kernel There is a restriction on the function:
This restriction for any training set is enough to guarantee function is a kernel
What Have We Achieved?What Have We Achieved?
Replaced problem of neural network architecture by kernel
definition Arguably more natural to define but restriction is a bit
unnatural Not a silver bullet as fit with data is key Can be applied to non- vectorial (or high dim) data
Gained more flexible regularization/ generalization
control Gained convex optimization problem
i.e. NO local minima!
However, choosing the right kernel remains as a design issue.
(c) 2009-2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/
140
질문질문 (1(1 일차일차 ))
감독학습 , 무감독학습 , 강화학습의 차이는 무엇인가 ? Neural Net (NN) 에서 오류교정에 의한 학습 방식을
설명하시오 . Decision Tree (DT) 에서 엔트로피 기반 학습 방식을
설명하시오 . k-Nearest Neighbor (kNN) 의 학습 원리는 무엇인가 ? Support Vector Machine (SVM) 의 학습 원리는
무엇인가 ? NN, DT, kNN, SVM 의 장단점을 비교하시오 . 각각
어떤 응용에 적합한가 ?