machine learning day 1 machine learning day 1 삼성전자 종합기술원 첨단기술연수소...

Machine LearningMachine LearningDay 1Day 1

삼성전자 종합기술원 첨단기술연수소

2010. 10. 6-8

장 병 탁Byoung-Tak Zhang

서울대학교 바이오지능기술연구센터 (CBIT)전기컴퓨터공학부 &

인지과학 , 뇌과학 , 생물정보학 협동과정 겸임

http://bi.snu.ac.kr/

(c) 2009-2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/

2

강의 개요강의 개요 1 일차 (10/6): 감독학습

Neural Nets Decision Trees Support Vector Machines

2 일차 (10/7): 무감독 학습 Self-Organizing Maps Clustering Algorithms Reinforcement Learning Evolutionary Learning

3 일차 (10/8): 확률그래프모델 Bayesian Networks Markov Random Fields Particle Filters


3

질문질문 (1(1 일차일차 ))

감독학습 , 무감독학습 , 강화학습의 차이는 무엇인가 ? Neural Net (NN) 에서 오류교정에 의한 학습 방식을

설명하시오 . Decision Tree (DT) 에서 엔트로피 기반 학습 방식을

설명하시오 . k-Nearest Neighbor (kNN) 의 학습 원리는 무엇인가 ? Support Vector Machine (SVM) 의 학습 원리는

무엇인가 ? NN, DT, kNN, SVM 의 장단점을 비교하시오 . 각각

어떤 응용에 적합한가 ?

© 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/4

Approaches to Artificial Approaches to Artificial IntelligenceIntelligence

Symbolic AI Rule-Based Systems

Connectionist AI Neural Networks

Evolutionary AI Genetic Algorithms

Molecular AI: DNA Computing


Research Areas and ApproachesResearch Areas and Approaches

ArtificialIntelligence

Research

Rationalism (Logical)Empiricism (Statistical)Connectionism (Neural)Evolutionary (Genetic)Biological (Molecular)

Paradigm

Application

Intelligent AgentsInformation RetrievalElectronic CommerceData MiningBioinformaticsNatural Language Proc.Expert Systems

Learning AlgorithmsInference MechanismsKnowledge RepresentationIntelligent System Architecture

Day 1 Day 1 1. Concept of Machine 1. Concept of Machine

LearningLearning

© 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 7

Learning: DefinitionLearning: Definition

Learning is the improvement of performance in some environment through the acquisition of knowledge resulting from experience in that environment.

the improvement

of behavior

the improvement

of behavior

on some

performance task

on some

performance taskthrough acquisition

of knowledge

through acquisition

of knowledge

based on partial

task experience

based on partial

task experience


Activation Function Scaling Function

Output Comparison

Information Propagation

Error Backpropagation

Input x1

Input x2

Input x3

Output

Input Layer Hidden Layer Output Layer

Weights

Activation Function

Neural Network (MLP)Neural Network (MLP)

outputsk

kkd otwE 2)(2

1)(

i

iiii w

Ewwww

,

x )(xfo


Application Example:Application Example:Autonomous Land Vehicle (ALV)Autonomous Land Vehicle (ALV) NN learns to steer an autonomous vehicle. 960 input units, 4 hidden units, 30 output units Driving at speeds up to 70 miles per hour

Weight valuesfor one of the hidden units

Image of aforward -mountedcamera

ALVINN System

DARPA Grand Challenge DARPA Grand Challenge 기계학습 기반 무인자동차 운전 기술기계학습 기반 무인자동차 운전 기술


Stanford 팀은 무인자동차의 자동운전 기술에 기계학습 기법을 활용하여 2005 년도 Grand Challenge 에서 우승 ( 상금 2 백만 달러 ), 2007 년도 Urban Challenge 에서 준우승을 차지하였다 .

Stanford 팀은 무인자동차의 자동운전 기술에 기계학습 기법을 활용하여 2005 년도 Grand Challenge 에서 우승 ( 상금 2 백만 달러 ), 2007 년도 Urban Challenge 에서 준우승을 차지하였다 .

VideoVideo2005 년도 미션 : 사막지역 175 마일을 자동운전만으로 10 시간 이내에 주파

2005 년도 미션 : 사막지역 175 마일을 자동운전만으로 10 시간 이내에 주파

2007 년도 미션 : 도시환경에서 96km 를자동운전만으로 6 시간 이내에 주파

2007 년도 미션 : 도시환경에서 96km 를자동운전만으로 6 시간 이내에 주파

[Sebastian Thrun, Stanley & Junior, Stanford Univ.]

DARPA Grand Challenge: Final Part 1


헬리콥터 자동 비행헬리콥터 자동 비행강화 학습을 이용하여 RC 헬기를 자동적으로 제어하는데 성공했고

다양한 고난이도 비행도 성공적으로 행했다 .

강화 학습을 이용하여 RC 헬기를 자동적으로 제어하는데 성공했고 다양한 고난이도 비행도 성공적으로 행했다 .

강화 학습 (RL) 을 통한 자동 제어강화 학습 (RL) 을 통한 자동 제어

가속도 , 속도 센서가 달린 RC 헬리콥터가속도 , 속도 센서가 달린 RC 헬리콥터

자동 제어를 통한 고난이도 비행

자동 제어를 통한 고난이도 비행자동 제어를 통한 고난이도 비행

( 참고 : Andrew Ng, Stanford Univ.) Stanford Autonomous Helicopter - Airshow #2: http://www.youtube.com/watch?v=VCdxqn0fcnE

가정가정 //사무실 도우미 로봇사무실 도우미 로봇


http://www.youtube.com/watch?v=mgHUNfqIhAc&feature=related

PR2 Robot Plays Pool

PR2 Robot Cleans Up

http://www.youtube.com/watch?v=gYqfa-YtvW4&feature=related

PR2 Robot of Willow Garage

( 참고 : Willow Garage)

모바일 기기에서의 웹 브라우징모바일 기기에서의 웹 브라우징


작은 휴대폰 화면에서 웹페이지를 분할하여 zoom-in 하기 위해 ,Decision Tree 를 이용하여 내용을 고려한 분할을 구현하였다 .

작은 휴대폰 화면에서 웹페이지를 분할하여 zoom-in 하기 위해 ,Decision Tree 를 이용하여 내용을 고려한 분할을 구현하였다 .

화면을 균등하게 분할

Decision Tree 를 이용

→ 내용기준 구역 분할

내용을 이해하기 어려움

1 2 3

4 5 6

7 8 9

5

문장이나 그림이 같은 구역 안에 위치13

( 참고 : WWW2006, Baluja)


Drivatar – Drivatar – 레이싱 게임의 인공지능레이싱 게임의 인공지능MS XBOX 360 의 레이싱 게임인 Forza 2 의 플레이어 운전 패턴을 기계학습을 통해

모델링 한 후 확률적으로 운전을 모방함으로써 인간 수준의 플레이 실현

MS XBOX 360 의 레이싱 게임인 Forza 2 의 플레이어 운전 패턴을 기계학습을 통해 모델링 한 후 확률적으로 운전을 모방함으로써 인간 수준의 플레이 실현

도로상위치

도로상위치 주행차선주행차선 코스별

속력

코스별속력

브레이크/엑셀

브레이크/엑셀

운전자 운전 패턴 확률기반 모델링운전자 운전 패턴 확률기반 모델링

모든 경로 세그먼트화 게이머가 선택하는 최적경로 학습

(Imitation Approach)

모든 경로 세그먼트화 게이머가 선택하는 최적경로 학습

(Imitation Approach)

The Future of Racing Games http://www.youtube.com/watch?v=TaUyzlKKu-E

확률적 모델링으로 인해 동일 수준의 무한한 운전 형태를 생성

Microsoft Research in Cambridge, UK

게임 플레이 상에서의 운전 패턴

( 참고 : Thore Graepel, MS Research Cambridge) Whole-audience Control of a Racing Game http://www.youtube.com/watch?v=NS_L3Yyv2RI

뇌의 활동 신호 분석으로 생각을 읽어내기 뇌의 활동 신호 분석으로 생각을 읽어내기

응용 사례 : 거짓말 탐지기 (Lie Detector)

뇌의 활동 신호를 기계학습을 이용한 분석으로 사람의 마음 상태 ( 생각 , 지각 ) 을 알아 낼 수 있다 .

뇌의 활동 신호를 기계학습을 이용한 분석으로 사람의 마음 상태 ( 생각 , 지각 ) 을 알아 낼 수 있다 .


( 참고 : Nature Reviews Neuroscience, 2006)


Machine Learning: Three TasksMachine Learning: Three Tasks

Supervised Learning Estimate an unknown mapping from known input and

target output pairs Learn fw from training set D={(x,y)} s.t. Classification: y is discrete Regression: y is continuous

Unsupervised Learning Only input values are provided Learn fw from D={(x)} s.t. Compression Clustering

Reinforcement Learning Not target, but rewards (critiques) are provided Learn a heuristic function hw from D={(x, a, c)} s.t. Action selection Policy learning

)()( xxw fyf

xxw )(f

( , , )h a cw x

산업적 응용 사례 모음 목록산업적 응용 사례 모음 목록

Robotics DARPA Urban challenge 헬리콥터 자동 비행 가정 /사무실 도우미 로봇

Mobile 기기 Web browsing 행동인식 폰

Web 응용서비스 Spam filtering 아마존 추천서비스 블로그 /뉴스 이용 여론조사

Computer Vision 인물사진 검색 및 매칭 천체사진 자동 분석 고고학 유뮬 자동 매칭


Bioinformatics 게놈 구조 예측 바이오 네트워크 분석

의약 HIV 백신 설계

금융 대출고객 신용평가

건축 최적의 트러스트 구조물 탐색

컴퓨터게임 Drivatar – 레이싱 게임의 AI

Neuroscience 뇌의 신호로 마음의 상태 파악 Connectonomics– 뉴런구조

규명 소셜 네트워크 /범죄 예방

뉴욕경찰정 실시간치안선테

18

모델 구조 표 현 기계학습 모델 예

논리식 명제 논리 , 술어논리 , Prolog 프로그램

Version Space, 귀납적 논리 프로그래밍 (ILP)

규칙 If-Then 규칙 , 결정규칙 AQ

함수 Sigmoid, 다항식 , 커널 신경망 , RBF 망 ,SVM, 커널 머신

트리 유전자 프로그램 ,Lisp 프로그램

결정 트리 , 유전자프로그래밍 , 뉴럴트리

그래프 방향성 /무방향성 그래프 , 네트워크

확률그래프 모델 , 베이지안망 , HMM

학습 방법 학습 문제의 예

감독 학습 인식 , 분류 , 진단 , 예측 , 회귀분석

무감독 학습 군집화 , 밀도추정 , 차원 축소 , 특징추출

강화 학습 시행착오 , 보상 함수 , 동적 프로그래밍

기계학습기계학습 :: 종류 및 모델종류 및 모델


기계학습 기술기계학습 기술 : : 대표적인 알고리즘대표적인 알고리즘 Symbolic Learning

Version Space Learning Case-Based Learning

Neural Learning Multilayer Perceptrons Self-Organizing Maps Support Vector Machines Kernel Machines

Evolutionary Learning Evolution Strategies Evolutionary Programming Genetic Algorithms Genetic Programming Molecular Programming

Probabilistic Learning Bayesian Networks Helmholtz Machines

Markov Random Fields

Hypernetworks Latent Variable Models Generative Topographic

Mapping Other Methods

Decision Trees Reinforcement Learning Boosting Algorithms Mixture of Experts Independent Component

Analysis

20

기계학습의 응용 분야기계학습의 응용 분야

응용 분야 적용 사례

인터넷 정보검색 텍스트 마이닝 , 웹로그 분석 , 스팸필터 , 문서 분류 , 여과 , 추출 , 요약 , 추천

컴퓨터 시각 문자 인식 , 패턴 인식 , 물체 인식 , 얼굴 인식 , 장면전환 검출 , 화상 복구

음성인식 /언어처리 음성 인식 , 단어 모호성 제거 , 번역 단어 선택 , 문법 학습 , 대화 패턴 분석

모바일 HCI 동작 인식 , 제스쳐 인식 , 휴대기기의 각종 센서 정보 인식 , 떨림 방지

생물정보 유전자 인식 , 단백질 분류 , 유전자 조절망 분석 , DNA 칩 분석 , 질병 진단

바이오메트릭스 홍채 인식 , 심장 박동수 측정 , 혈압 측정 , 당뇨치 측정 , 지문 인식

컴퓨터 그래픽 데이터기반 애니메이션 , 캐릭터 동작 제어 , 역운동학 , 행동 진화 , 가상현실

로보틱스 장애물 인식 , 물체 분류 , 지도 작성 , 무인자동차 운전 , 경로 계획 , 모터 제어

서비스업 고객 분석 , 시장 클러스터 분석 , 고객 관리 (CRM), 마켓팅 , 상품 추천

제조업 이상 탐지 , 에너지 소모 예측 , 공정 분석 계획 , 오류 예측 및 분류

[ 장병탁 , 정보과학회지 , 2007 년 3월 특집호 ]

Day 1 Day 1 2. Neural Network Learning2. Neural Network Learning

From Biological Neuron to From Biological Neuron to Artificial NeuronArtificial Neuron

Dendrite Cell Body Axon

From Biology to Artificial Neural NetsFrom Biology to Artificial Neural Nets

Properties of Artificial Neural Properties of Artificial Neural NetworksNetworks

A network of artificial neurons

Characteristics Nonlinear I/O mapping Adaptivity Generalization ability Fault-tolerance (graceful

degradation) Biological analogy

<Multilayer Perceptron Network>

Problems Appropriate for Problems Appropriate for Neural NetworksNeural Networks

Many training examples available

Outputs can be discrete or continuous-

valued or their vectors.

May contain noise in training examples

Tolerant to long training time

Fast execution time

Not necessary to explain the prediction

results

Example ApplicationsExample Applications

NETtalk [Sejnowski] Inputs: English text Output: Spoken phonemes

Phoneme recognition [Waibel] Inputs: wave form features Outputs: b, c, d,…

Robot control [Pomerleau] Inputs: perceived features Outputs: steering control

Application:Application:Autonomous Land Vehicle (ALV)Autonomous Land Vehicle (ALV)

NN learns to steer an autonomous vehicle. 960 input units, 4 hidden units, 30 output units Driving at speeds up to 70 miles per hour

Weight valuesfor one of the hidden units

Image of aforward -mountedcamera

ALVINN System

Application:Application:Data Recorrection by a Hopfield Data Recorrection by a Hopfield NetworkNetwork

original target data

corrupted input data

Recorrected data after 10 iterations

Recorrected data after 20 iterations

Fullyrecorrected data after 35 iterations

Perceptron Perceptron and and Gradient Descent AlgorithmGradient Descent Algorithm

(Simple) Perceptron:(Simple) Perceptron:A neural net with a single A neural net with a single neuronneuron

Perceptron = a linear threshold unit (LTU) Note: Linear perceptron = linear unit (see below)

Input: a vector of real values Output: 1 or -1 (binary) Activation function: threshold function

Linearly Separable Linearly Separable vs. vs. Linearly Linearly NonseparableNonseparable

(a) Decision surface for a linearly separable set of examples (correctly classified by a straight line)

(b) A set of training examples that is not linearly separable.

Perceptron Training RulePerceptron Training Rule

Note: output value o is +1 or -1 (not a real)

Perceptron rule: a learning rule for a threshold unit. Conditions for convergence

Training examples are linearly separable. Learning rate is sufficiently small.

Delta Rule: Least Mean Square (LMS) Delta Rule: Least Mean Square (LMS) ErrorError

Linear unit (linear perceptron) Note: output value o is a real value (not binary)

Delta rule: learning rule for an unthresholded perceptron (i.e. linear unit). Delta rule is a gradient-descent rule.

Gradient Descent MethodGradient Descent Method

Delta Rule for Error MinimizationDelta Rule for Error Minimization

iiiii w

Ewwww

,

Dd

idddi xotw )(

Perceptron Learning AlgorithmPerceptron Learning Algorithm

Multilayer PerceptronMultilayer Perceptron


… …

x1

x2

xn

Inputlayer

Hiddenlayer

Outputlayer

ym

…

y1

• x: input vector • y: output vector

• Supervised learning• Gradient search• Noise immunity

• Learning algorithm: Backpropagation

Multilayer PerceptronsMultilayer Perceptrons

ji

dji w

Ew

outputsk

kkd otwE 2)(2

1)(

MultilayerMultilayer N Networks and its etworks and its Decision BoundariesDecision Boundaries

Decision regions of a multilayer feedforward network. The network was trained to recognize 1 of 10 vowel sounds

occurring in the context “h_d” The network input consists of two parameter, F1 and F2,

obtained from a spectral analysis of the sound. The 10 network outputs correspond to the 10 possible vowel

sounds.

Differentiable Threshold Differentiable Threshold UnitUnit

Sigmoid function: nonlinear, differentiable

E defined as a sum of the squared errors over all the output units k for all the training examples d.

Error surface can have multiple local minima Guarantee toward some local minimum No guarantee to the global minimum

Dd outputsk

kdkd otwE 2)(2

1)(

Error Function for BP

Backpropagation Learning Backpropagation Learning Algorithm for MLPAlgorithm for MLP

Original weight update rule for BP:

Adding momentum

Help to escape a small local minima in the error surface.

Speed up the convergence.

10 ,)1()( nwxnw jijijji

Adding Momentum

jijji xnw )(

Derivation of the BP RuleDerivation of the BP Rule

Notations

xij : the ith input to unit j

wij : the weight associated with the ith input to unit j

netj : the weighted sum of inputs for unit j

oj : the output computed by unit j

tj : the target output for unit j

: the sigmoid function

outputs : the set of units in the final layer of the network

Downstream(j) : the set of units whose immediate inputs

include the output of unit j

ji

dji w

Ew

outputsk

kkd otwE 2)(2

1)(

jij

d

ji

j

j

d

ji

d xnet

E

w

net

net

E

w

E

Derivation of the BP Rule

Error measure:

Gradient descent:

Chain rule:

j

j

j

d

j

d

net

o

o

E

net

E

)()(2

1 2jj

outputskkk

jj

d ototoo

E

)1()(

jjj

j

j

j oonet

net

net

o

jijjjjji

dji xooot

w

Ew )1()(

Case 1: Rule for Output Unit Case 1: Rule for Output Unit WeightsWeights

Step 1:

Step 2:

Step 3:

All together:

i

jijij xwnet

)(

)(

)(

)(

)1(

jDownstreamkjjkjk

jDownstreamk j

jkjk

j

j

j

k

jDownstreamkk

jDownstreamk j

j

j

k

k

d

j

d

oow

net

ow

net

o

o

net

net

o

o

net

net

E

net

E

)(

)1( where,jDownstreamk

kjkjjjjijji wooxw

Case 2: Rule for Hidden Unit Case 2: Rule for Hidden Unit WeightsWeights Step 1:

Thus:

BP for MLP: BP for MLP: revisitedrevisited

BP has an ability to discover useful intermediate representations at the hidden unit layers inside the networks which capture properties of the input spaces that are most relevant to learning the target function.

When more layers of units are used in the network, more complex features can be invented.

But the representations of the hidden layers are very hard to understand for humans.

Hidden Layer Hidden Layer RepresentationsRepresentations

Hidden Layer Representation for Hidden Layer Representation for Identity FunctionIdentity Function


The evolving sum of squared errors for each of the eight output units as the number of training iterations (epochs) increase


The evolving hidden layer representation for the input string “01000000”


The evolving weights for one of the three hidden units

Continuing training until the training error falls below some predetermined threshold is a poor strategy since BP is susceptible to overfitting. Need to measure the generalization accuracy over a

validation set (distinct from the training set).

Two different types of overffiting Generalization error first decreases, then increases,

even the training error continues to decrease. Generalization error decreases, then increases, then

decreases again, while the training error continues to decreases.

Generalization and OverfittingGeneralization and Overfitting

Two Kinds of Overfitting Two Kinds of Overfitting PhenomenaPhenomena

Techniques for Overcoming Techniques for Overcoming the Overfitting Problemthe Overfitting Problem Weight decay

Decrease each weight by some small factor during each iteration.

This is equivalent to modifying the definition of E to include a penalty term corresponding to the total magnitude of the network weights.

The motivation for the approach is to keep weight values small, to bias learning against complex decision surfaces

k-fold cross-validation Cross validation is performed k different times, each

time using a different partitioning of the data into training and validation sets

The result are averaged after k times cross validation.

Designing an Artificial Neural Designing an Artificial Neural Network for Face Recognition Network for Face Recognition ApplicationApplication

ANN for Face Recognition

960 x 3 x 4 network is trained on gray-level images of faces to predict whether a person is looking to their left, right, ahead, or up.

Possible learning tasks Classifying camera images of faces of people in various

poses. Direction, Identity, Gender, ...

Data: 624 grayscale images for 20 different people 32 images per person, varying

person’s expression (happy, sad, angry, neutral) direction (left, right, straight ahead, up) with and without sunglasses

resolution of images: 120 x128, each pixel with a grayscale intensity between 0 (black) and 255 (white)

Task: Learning the direction in which the person is facing.

Problem Definition

Factors for ANN Design in the Factors for ANN Design in the Face Recognition TaskFace Recognition Task

Input encoding

Output encoding

Network graph structure

Other learning algorithm parameters

Input Coding for Face Input Coding for Face RecognitionRecognition

Possible Solutions Extract key features using preprocessing Coarse-resolution

Features extraction edges, regions of uniform intensity, other local image

features Defect: High preprocessing cost, variable number of features

Coarse-resolution Encode the image as a fixed set of 30 x 32 pixel intensity

values, with one network input per pixel. The 30x32 pixel image is a coarse resolution summary of the

original 120x128 pixel image Coarse-resolution reduces the number of inputs and weights

to a much more manageable size, thereby reducing computational demands.

Output Coding for Face Output Coding for Face RecognitionRecognition

Possible coding schemes Using one output unit with multiple threshold values Using multiple output units with single threshold

value. One unit scheme

Assign 0.2, 0.4, 0.6, 0.8 to encode four-way classification.

Multiple units scheme (1-of-n output encoding) Use four distinct output units Each unit represents one of the four possible face

directions, with highest-valued output taken as the network prediction

Output Coding for Face RecognitionOutput Coding for Face Recognition

Advantages of 1-of-n output encoding scheme It provides more degrees of freedom to the network

for representing the target function. The difference between the highest-valued output

and the second-highest can be used as a measure of the confidence in the network prediction.

Target value for the output units in 1-of-n encoding scheme < 1, 0, 0, 0 > v.s. < 0.9, 0.1, 0.1, 0.1 > < 1, 0, 0, 0 >: will force the weights to grow without

bound. < 0.9, 0.1, 0.1, 0.1 >: the network will have finite

weights.

Network Structure for Face Network Structure for Face RecognitionRecognition

One hidden layer v.s. more hidden layers How many hidden nodes is used?

Using 3 hidden units: test accuracy for the face data = 90% Training time = 5 min on Sun Sprac 5

Using 30 hidden units: test accuracy for the face data = 91.5% Training time = 1 hour on Sun Sparc 5

Other Parameters for Face Other Parameters for Face RecognitionRecognition

Learning rate = 0.3 Momentum = 0.3 Weight initialization: small random values near

0 Number of iterations: Cross validation

After every 50 iterations, the performance of the network was evaluated over the validation set.

The final selected network is the one with the highest accuracy over the validation set

Day 1 Day 1 3. Decision Tree Learning3. Decision Tree Learning


`

quicktime

unix

computer clinton

space

NO YES

no yes

NO

no yes no

NO

yes

YES

no yes

YES

no yes

Nodes: attributes Edges: values Terminal nodes: class

labels

Learning Algorithm: C4.5

Decision TreesDecision Trees

)()(),()(

v

AValuesv

vSEntropy

S

SSEntropyASGain

SNU Center for Bioinformation Technology (CBIT)68

Main IdeaMain Idea

Classification by Partitioning Example Space Goal : Approximating discrete-valued target

functions Appropriate Problems

Examples are represented by attribute-value pairs. The target function has discrete output value. Disjunctive description may be required. The training data may contain missing attribute values.

Day Outlook Temperature Humidity Wind PlayTennis

D1D2D3D4D5D6D7D8D9D10D11D12D13D14

SunnySunny

OvercastRainRainRain

OvercastSunnySunnyRain

SunnyOvercastOvercast

Rain

HotHotHotMildCoolCoolCoolMildCoolMildMildMildHotMild

HighHighHighHigh

NormalNormalNormal

HighNormalNormalNormal

HighNormal

High

WeakStrongWeakWeakWeakStrongStrongWeakWeakWeakStrongStrongWeakStrong

NoNoYesYesYesNoYesNoYesYesYesYesYesNo

Example Problem (Play Tennis)


Example SpaceExample Space

Yes(Outlook = Overcast)

No(Outlook = Sunny &Humidity = High)

Yes(Outlook = Sunny & Humidity = Normal)

Yes(Outlook = Rain & Wind = Weak)

No(Outlook = Rain

& Wind = Strong)


Decision Tree RepresentationDecision Tree Representation

OutlookOutlook

HumidityHumidity WindWind

SunnyOvercast

Rain

YES

High Normal

YESNO NO YES


Basic Decision Tree LearningBasic Decision Tree Learning

Which Attribute is Best? Select the attribute that is most useful for

classifying examples. Quantitative Measure

Information Gain For Attribute A, relative to a collection

of data D

Expected Reduction of Entropy

)(

)(||

||)(),(

AValuesv

DvEntropyD

DvDEntropyADGain


EntropyEntropy

Impurity of an Arbitrary Collection of Examples

Minimum number of bits of information needed to encode the classification of an arbitrary member of D

c

iii ppDEntropy

1

log)( Entropy(S

)

P 1.01.00.00.0

1.01.0


Constructing Decision TreeConstructing Decision Tree


Example : Play Tennis (1)Example : Play Tennis (1)

Entropy of D

940.0

14

5log

14

5

14

9log

14

9

])5,9([)(

EntropyDEntropy



Attribute Wind D = [9+,5-] Dweak = [6+,2-]

Dstrong=[3+,3-]

048.0

00.114

6811.0

14

8940.0

)(14

6)(

14

8)(

)(||

||)(),(

},{

strongweak

strongweakv

DEntropyDEntropyDEntropy

DvEntropyD

DvDEntropyWindDGain

Wind[9+,5-] : E = 0.940

Weak Strong

[6+,2-] : E = 0.811 [3+,3-] : E = 1.0



Attribute Humidity Dhigh = [3+,4-]

Dnormal=[6+,1-]

151.0

592.014

7985.0

14

7940.0

)(14

7)(

14

7)(

)(||

||)(),(

},{

normalhigh

normalhighv


DvEntropyD

DvDEntropyWindDGain

Humidity[9+,5-] : E = 0.940

High Normal

[3+,4-] : E = 0.985 [6+,1-] : E = 0.592



Best Attribute? Gain(D, Outlook) = 0.246 Gain(D, Humidity) = 0.151 Gain(D, Wind) = 0.048 Gain(D, Temperature) = 0.029

[9+,5-] : E = 0.940

OutlookSunny

Rain

[2+,3-] : (D1, D2, D8, D9, D11)

[3+,2-] : (D4, D5, D6, D10, D14)

Overcast

[4+,0-] : (D3, D7, D12, D13)

YES



Entropy Dsunny

971.0

5

3log

5

3

5

2log

5

2

])3,2([)(

EntropyDEntropy sunny


D1D2D8D9D11

SunnySunnySunnySunnySunny

HotHotMildCoolMild

HighHighHigh

NormalNormal

WeakStrongWeakWeakStrong

NoNoNoYesYes



Attribute Wind Dweak = [1+,2-]

Dstrong=[1+,1-]

020.0

00.15

2918.0

5

3971.0

)(5

2)(

5

3)(

)(||

||)(),(

},{

strongweaksunny

strongweakvsunny


DvEntropyD

DvDEntropyWindDGain

Wind[2+,3-] : E = 0.971

Weak Strong

[1+,2-] : E = 0.918 [1+,1-] : E = 1.0




Dnormal=[2+,0-]

971.0

00.05

200.0

5

3971.0

)(5

2)(

5

3)(

)(||

||)(),(

},{

normalhighsunny

normalhighvsunny


DvEntropyD

DvDEntropyWindDGain

Humidity[2+,3-] : E = 0.971

High Normal

[0+,3-] : E = 0.00 [2+,0-] : E = 0.00



Best Attribute? Gain(Dsunny, Humidity) = 0.971 Gain(Dsunny, Wind) = 0.020 Gain(Dsunny, Temperature) = 0.571

[9+,5-] : E = 0.940

OutlookSunny Rain

[3+,2-] : (D4, D5, D6, D10, D14)

Overcast

YESHumidity

YESNO

NormalHigh



Entropy Drain


D4D5D6D10D14

RainRainRainRainRain

MildCoolCoolMildMild

HighNormalNormalNormal

High

WeakWeakStrongWeakStrong

YesYesNoYesNo

971.0

5

2log

5

2

5

3log

5

3

])2,3([)(

EntropyDEntropy sunny



Attribute Wind Dweak = [3+,0-]

Dstrong=[0+,2-]

971.0

00.15

200.0

5

3971.0

)(5

2)(

5

3)(

)(||

||)(),(

},{

strongweakrain

strongweakvrain


DvEntropyD

DvDEntropyWindDGain

Wind[3+,2-] : E = 0.971

Weak Strong

[3+,0-] : E = 0.00 [0+,2-] : E = 0.00




Dnormal=[2+,1-]

020.0

918.05

300.1

5

2971.0

)(5

2)(

5

3)(

)(||

||)(),(

},{

normalhighrain

normalhighvrain


DvEntropyD

DvDEntropyWindDGain

Humidity[2+,3-] : E = 0.971

High Normal

[1+,1-] : E = 1.00 [2+,1-] : E = 0.918



Best Attribute? Gain(Drain, Humidity) = 0.020

Gain(Drain, Wind) = 0.971

Gain(Drain, Temperature) = 0.020 Outlook

Humidity Wind

Sunny Overcas

t

Rain

YES

High Normal

YESNO NO YES


Avoiding Overfitting DataAvoiding Overfitting Data

Definition Given a hypothesis space H, a hypothesis h H is said

to overfit the data if there exists some alternative hypothesis h’ H, such that h has smaller error than h’ over the training examples, but h’ has a smaller error than h over entire distribution of instances.

Occam’s Razor Prefer the simplest hypothesis that fits the data.


Avoiding Overfitting DataAvoiding Overfitting Data


Solutions to OverfittingSolutions to Overfitting

1. Partition examples into training, test, and validation set.

2. Use all data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set.

3. Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding is minimized.


SummarySummary

Decision trees provide a practical method for concept learning and discrete-valued functions.

ID3 searches a complete hypothesis space. Overfitting is an important issue in decision

tree learning.

Day 1 Day 1 4-5. 4-5. 실습실습

Plan for ML ExercisePlan for ML Exercise

Day 1: Classification Program: Weka Agenda: classification by Neural Network(NN) and Decision

Tree (DT)

Day 2: Clustering Program: Genesis Agenda: k-means /hierarchical clustering, SOM

Day 3: Bayesian Network Program: GeNIe Agenda: designing / learning / inference in Bayesian networks

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Classification – The Tool for PracticeClassification – The Tool for Practice

Weka 3: Data Mining Software in Java Collection of machine learning algorithms for data mining tasks What you can do with Weka are

data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization

Weka is an open source software issued under the GNU General Public License

How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.

Day 1


Classification Using Weka - ProblemsClassification Using Weka - Problems

Pima Indians Diabetes (2-class problem)

Pima Indians have the highest prevalence of diabetes in the world

We will build classification models that diagnoses if the patient shows signs of diabetes

Handwritten Digit Recognition (10-class problem)

The MNIST database of handwritten digits contains digits written by office workers and students

We will build a recognition model based on classifiers with the reduced set of MNIST


Classification Using Weka - AlgorithmsClassification Using Weka - Algorithms


Neural Network(Multilayer Perceptron)

Decision Tree

Classification Using WekaClassification Using Weka


click • load a file that contains the training data by clicking ‘Open file’ button• ‘ARFF’ or ‘CSV’ formats

are readible

• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron

• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning

Classification Using WekaClassification Using Weka


Result output:various measures appear

Options for testing the trained model

Day 1 Day 1 6. k-Nearest Neighbor6. k-Nearest Neighbor

Different Learning MethodsDifferent Learning Methods

Eager learning Explicit description of target function on the whole training set

Instance-based learning Learning = storing all training instances Classification = assigning a target function to a new instance Referred to as “lazy” learning

Kinds of instance-based learning K-nearest neighbor algorithm Locally weighted regression Case-based reasoning

100

Local vsLocal vs.. Distributed Distributed RepresentationRepresentation

Lazy Learning? Eager Learning

K-Nearest NeighborK-Nearest Neighbor

Features

All instances correspond to points in an n-dimensional

Euclidean space

Classification is delayed till a new instance arrives

Classification done by comparing feature vectors of the

different points

Target function may be discrete or real-valued


k-Nearest Neighbor Classifier k-Nearest Neighbor Classifier (kNN)(kNN)Memory-Based LearningMemory-Based Learning

(x1, y1=1)(x2, y2=1)(x3, y3=0)(x4, y4=0)(x5, y5=1)(x6, y6=0)...(xN, yN=1)

(x, y=?)

Learning Set D Match SetQuery Pattern

11

0

1

0

Three 1’svs. Two 0’s

y=1

k=5

103

kNN as a Neural NetworkkNN as a Neural Network

Hidden(vector)

Input(vector)

Output(vector)

…xq yq

(xi, yi)


k-Nearest Neighbor (kNN)k-Nearest Neighbor (kNN)

Training • For each training example <x, f(x)>, add the example to the learning set D Classification • Given a query instance xq, denote the k instances from D that are nearest to xq • Return

where (a,b)=1 if a=b, and (a,b)=0 otherwise. Memory-based or case-based learning

+

+

+ xq

105

Generalizing kNNGeneralizing kNN

Divide the input space into local regions and learn simple (constant/linear) models in each patch

Unsupervised: Competitive, online clustering Supervised: Radial-basis func, mixture of

experts

106

Radial-Basis FunctionRadial-Basis Function Network Network Locally-tuned units

01

wpwyH

h

thh

t

2

2

2exp

h

ht

th s

pmx

Input(scalar)

Hidden(scalar)

Output(scalar)

107

Training RBFTraining RBF

Hybrid learning: First layer centers and spreads:

Unsupervised k-means Second layer weights:

Supervised gradient-descent Fully supervised (Broomhead and Lowe, 1988; Moody and

Darken, 1989)

108

RegressionRegression

3

2

2

01

2

21

|

h

ht

th

t iih

ti

tih

h

hjtjt

ht i

ihti

tihj

t

th

ti

tiih

i

H

h

thih

ti

t i

ti

tih,iihhh

spwyrs

s

mxpwyrm

pyrw

wpwy

yrw,s,E

mx

m

X

109

ClassificationClassification

k h kthkh

h ithiht

i

t i

ti

tih,iihhh

wpw

wpwy

yrw,s,E

0

0

exp

exp

log |Xm

110

01

vpwy tTH

h

thh

t

xv

Rules and ExceptionsRules and Exceptions

Default ruleExceptions

111

Rule-Based KnowledgeRule-Based Knowledge

Incorporation of prior knowledge (before training)

Rule extraction (after training) (Tresp et al., 1997)

Fuzzy membership functions and fuzzy rules

10 with

2exp

10 with2

exp2

exp

10 THEN OR AND IF

223

23

2

122

22

21

21

1

321

.ws

cxp

.ws

bxs

axp

.ycxbxax

Day 1 Day 1 7. Support Vector Machines7. Support Vector Machines


K(x,x1)

… …

x1

x2

xn

Inputlayer

Outputneuron

yK(x,x2)

K(x,xm)

Bias b

Hidden layer of m inner-product kernels

Inner product kernel

K(x,xi)

Two-layer perceptron

Radial-basis function network

Polynomial learning machine

Type of SVM

p)1( Tyx

2

i22

1exp xx

1iT

0tanh xx

Support Vector MachinesSupport Vector Machines

Support Vector MachinesSupport Vector Machines The line that maximizes the minimum

margin is a good bet. The model class of “hyper-planes

with a margin of m” has a low VC dimension if m is big.

This maximum-margin separator is determined by a subset of the datapoints. Datapoints in this subset are called

“support vectors”. It will be useful computationally if

only a small fraction of the datapoints are support vectors, because we use the support vectors to decide which side of the separator a test case is on.

The support vectors are indicated by the circles around them.

Training a linear SVMTraining a linear SVM To find the maximum margin separator, we have to solve the

following optimization problem:

This is tricky but it’s a convex problem. There is only one optimum and we can find it without fiddling with learning rates or weight decay or early stopping. Don’t worry about the optimization problem. It has been

solved. Its called quadratic programming. It takes time proportional to N^2 which is really bad for very

big datasets so for big datasets we end up doing approximate optimization!

possibleassmallasisand

casesnegativeforb

casespositiveforbc

c

2||||

1.

1.

w

xw

xw

Testing a linear SVMTesting a linear SVM

The separator is defined as the set of points for which:

0

so if 0 say its a positive case

and if 0 say its a negative case

c

c

b

b

b

w x

w x

w x

Introducing slack variablesIntroducing slack variables Slack variables are constrained to be non-negative. When they

are greater than zero they allow us to cheat by putting the plane closer to the datapoint than the margin. So we need to minimize the amount of cheating. This means we have to pick a value for lamba (this sounds familiar!)

2

1 for positive cases

1 for negative cases

with 0 for all

|| ||and as small as possible

2

c c

c c

c

c

c

b

b

c

w x

w x

w

A picture of the best plane with a slack A picture of the best plane with a slack variablevariable

How to make a plane curvedHow to make a plane curved Fitting hyperplanes as separators

is mathematically easy. The mathematics is linear.

By replacing the raw input variables with a much larger set of features we get a nice property: A planar separator in the high-

dimensional space of feature vectors is a curved separator in the low dimensional space of the raw input variables.

A planar separator in a 20-D feature space projected back to the original 2-D space

A potential problem and a magic A potential problem and a magic solutionsolution If we map the input vectors into a very high-dimensional

feature space, surely the task of finding the maximum-margin separator becomes computationally intractable? The mathematics is all linear, which is good, but the

vectors have a huge number of components. So taking the scalar product of two vectors is very

expensive.

The way to keep things tractable is to use “the kernel trick”

The kernel trick makes your brain hurt when you first learn about it, but its actually very simple.

The kernel trickThe kernel trick For many mappings from a

low-D space to a high-D space, there is a simple operation on two vectors in the low-D space that can be used to compute the scalar product of their two images in the high-D space.

)(.)(),( baba xxxxK

Low-D

High-D

doing the scalar product in the obvious way

Letting the kernel do the work

ax

)( ax)( bx

bx

Kernel ExamplesKernel Examples

1. Gaussian kernel

2. Polynomial kernel

The classification ruleThe classification rule The final classification rule is quite simple:

All the cleverness goes into selecting the support vectors that maximize the margin and computing the weight to use on each support vector.

We also need to choose a good kernel function and we may need to choose a lambda for dealing with non-separable cases.

SVs

stests xxKwbias

0),(

The set of support vectors

Some commonly used Some commonly used kernelskernels

)(tanh),(

),(

)1.(),(

22 2/||||

x.yyx

yx

yxyx

yx

kK

eK

K pPolynomial:

Gaussian radial basis function

Neural net:

For the neural network kernel, there is one “hidden unit” per support vector, so the process of fitting the maximum margin hyperplane decides how many hidden units to use. Also, it may violate Mercer’s condition.

Parameters that the user must choose

Support Vector Machines are Support Vector Machines are Perceptrons!Perceptrons!

SVM’s use each training case, x, to define a feature K(x, .) where K is chosen by the user. So the user designs the features.

Then they do “feature selection” by picking the support vectors, and they learn how to weight the features by solving a big optimization problem.

So an SVM is just a very clever way to train a standard perceptron. All of the things that a perceptron cannot do cannot be

done by SVM’s (but it’s a long time since 1969 so people have forgotten this).

Supplement:Supplement:SVM as a Kernel MachineSVM as a Kernel Machine

Kernel Methods ApproachKernel Methods Approach

The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:

The expectation is that the feature space has a much higher

dimension than the input space.

Form of the FunctionsForm of the Functions

So kernel methods use linear functions in a feature space:

For regression this could be the function

For classification require thresholding

Controlling generalisationControlling generalisation

The critical method of controlling generalisation is to force a large margin on the training data:

Support Vector MachinesSupport Vector Machines

SVM optimization

Addresses generalization issue but not the computational

cost of dealing with large vectors

Complexity problemComplexity problem

Let’s apply the quadratic example

to a 20x30 image of 600 pixels – gives approximately

180000 dimensions! Would be computationally infeasible to work in this space

Dual RepresentationDual Representation

Suppose weight vector is a linear combination of the training examples:

can evaluate inner product with new example

Learning the dual variablesLearning the dual variables Since any component orthogonal to the space spanned by the

training data has no effect, general result that weight vectors have dual representation: the representer theorem.

Hence, can reformulate algorithms to learn dual variables rather than weight vector directly

Dual Form of SVMDual Form of SVM

The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:

Note that threshold must be determined from border examples

Using KernelsUsing Kernels

Critical observation is that again only inner products are used

Suppose that we now have a shortcut method of computing:

Then we do not need to explicitly compute the feature vectors either in training or testing

Kernel exampleKernel example

As an example consider the mapping

Here we have a shortcut:

EfficiencyEfficiency

Hence, in the pixel example rather than work with 180000 dimensional vectors, we compute a 600 dimensional inner product and then square the result! Can even work in infinite dimensional spaces, eg using the Gaussian kernel:

Constraints on the kernelConstraints on the kernel There is a restriction on the function:

This restriction for any training set is enough to guarantee function is a kernel

What Have We Achieved?What Have We Achieved?

Replaced problem of neural network architecture by kernel

definition Arguably more natural to define but restriction is a bit

unnatural Not a silver bullet as fit with data is key Can be applied to non- vectorial (or high dim) data

Gained more flexible regularization/ generalization

control Gained convex optimization problem

i.e. NO local minima!

However, choosing the right kernel remains as a design issue.


140

질문질문 (1(1 일차일차 ))

감독학습 , 무감독학습 , 강화학습의 차이는 무엇인가 ? Neural Net (NN) 에서 오류교정에 의한 학습 방식을

설명하시오 . Decision Tree (DT) 에서 엔트로피 기반 학습 방식을

설명하시오 . k-Nearest Neighbor (kNN) 의 학습 원리는 무엇인가 ? Support Vector Machine (SVM) 의 학습 원리는

무엇인가 ? NN, DT, kNN, SVM 의 장단점을 비교하시오 . 각각

어떤 응용에 적합한가 ?

machine learning day 1 machine learning day 1 삼성전자 종합기술원 첨단기술연수소...

Documents