inc 551 artificial intelligence

INC 551 Artificial Intelligence

Lecture 9

Introduction to Machine Learning

What is machine learning (or computer learning)?

ทางปฏิ�บั�ติ�คื�อการหา function ท��เหมาะสมเพื่��อ map input และ output

ทางวั�ติถุ�ประสงคื�คื�อการปร�บัติ�วัของ computer จาก ข อม!ลหนึ่#�งๆ ท��ป%อนึ่เข าไป

Definition of Learning

A computer program is said to “learn” from experience Ewith respect to some class of tasks T and performance P,if its performance improves with experience E

Tom Mitchell, 1997

To learn =

To change parameters in the world model

Deliberative Agent

Environment

Action

Sense, Perceive

MakeDecision

Agent

WorldModel

How to create a world model that representsreal world?

Car Model

ThrottleAmount(x)

Speed(v)

523 2

2

xxv

cbxaxv

Learning as function Mapping

Find better function mapping

Add+3

2 4

Add+3

2 4.8

Performanceerror = 1

Performanceerror = 0.2

ปร�บัติ�วั

Learning Design Issues

1. Components of the performance to be learned

2. Feedback (supervised, reinforcement, unsupervised)

3. Representation (function, tree, neural net, state-action model, genetic code)

Types of Learning

• Supervised Learningม�คืร!สอนึ่บัอกวั'าอะไรดี� อะไรไม'ดี� เหม�อนึ่เร�ยนึ่ในึ่ห อง

• Reinforcement Learningเร�ยนึ่ร! แบับัปฏิ�บั�ติ�ไปเลย นึ่�กเร�ยนึ่เล�อกท+าส��งท��อยากเร�ยนึ่เองคืร!คือยบัอกวั'าดี�หร�อไม'ดี�

• Unsupervised Learningไม'ม�คืร!คือยบัอกอะไรเลย นึ่�กศึ#กษาแยกแยะส��งดี� ไม'ดี� ออกเป.นึ่ 2 พื่วักแติ'ก/ย�งไม'ร! วั'าอะไรดี� ไม'ดี�

Supervised Learning

โดียท��วัไปจะม� data ท��เป.นึ่ Training set และ Test setTraining set ใช้ ในึ่การเร�ยนึ่, Test set ใช้ ในึ่การทดีสอบัData เหล'านึ่�2จะบัอกวั'าอะไร เป.นึ่ Type A, B, C, …

LearnerTraining set

features

type

LearnerTest set

features

typeAnswer

การเร�ยนึ่

การใช้ งานึ่

Graph Fitting

Find function f that is consistent for all samples

x f(x) type

1 3.2 B1.5 5.6 A4 2.3 B2 -3.1 B7 4.4 A5.5 1.1 B5 4.2 A

Data Mapping

1 2

3

ใช้ Least Mean SquareAlgorithm

Overfit

Ockham’s razor principle

“Prefer the simplest”

Least Mean Square Algorithm

)]()()()[()()1( nxnwnynxnwnw T

xwy T Let

We can find the weight vector recursively using

where n = current stateμ = step size

MATLAB Example Match x,y pairx=[1 2 7 4 -3.2]y=[2.1 4.2 13.7 8.1 -6.5]

-4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

8

10

12

14

Epoch 1


Epoch 2

-4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

8

10

12

14


Epoch 3

-4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

8

10

12

14


Epoch 4

-4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

8

10

12

14


Epoch 8

-4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

8

10

12

14

Neural Network

1 brain = 100,000,000,000 neurons

Neuron model

Mathematical Model of Neuron

Activation Function

Step function Sigmoid function

Network of Neurons

สามารถุแบั'งเป.นึ่ 2 ช้นึ่�ดี

1. Single layer feed-forward network (perceptron)2. Multilayer feed-forward network

Single layer Network

x1

x2

x0y

yxWn

jjj

0

สมมติ�วั'า activation function ไม'ม�

ซึ่#�งเป.นึ่ linear equation

W0

W1

W2

ในึ่กรณี�ท�� activation function เป.นึ่ stepจะเหม�อนึ่เป.นึ่เส นึ่ติรงคือยแบั'งกล�'ม ซึ่#�งเส นึ่ติรงนึ่�2จะแบั'งท��ไหนึ่จะข#2นึ่ก�บัคื'า weight wj

Perceptron Algorithm

มาจาก least-square methodใช้ เพื่��อปร�บั weight ของ neuron ให เหมาะสม

α = learning rate

Multilayer Neural Network

ม� hidden layers เพื่��มเข ามาLearn ดี วัย Back-Propagation Algorithm

Back-Propagation Algorithm

Learning Progress

ใช้ training set ป%อนึ่เข าไปหลายๆคืร�2ง แติ'ละคืร�2งเร�ยก Epoch

Types of Learning

• Supervised Learningม�คืร!สอนึ่บัอกวั'าอะไรดี� อะไรไม'ดี� เหม�อนึ่เร�ยนึ่ในึ่ห อง

• Reinforcement Learningเร�ยนึ่ร! แบับัปฏิ�บั�ติ�ไปเลย นึ่�กเร�ยนึ่เล�อกท+าส��งท��อยากเร�ยนึ่เองคืร!คือยบัอกวั'าดี�หร�อไม'ดี�

• Unsupervised Learningไม'ม�คืร!คือยบัอกอะไรเลย นึ่�กศึ#กษาแยกแยะส��งดี� ไม'ดี� ออกเป.นึ่ 2 พื่วักแติ'ก/ย�งไม'ร! วั'าอะไรดี� ไม'ดี�

Source:

Reinforcement Learning: An IntroductionRichard Sutton and Andrew BartoMIT Press, 2002

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs (features/class)

Supervised Learning

RLSystemInputs

Outputs (“actions”)

Evaluations (“rewards” / “penalties”)

Environment

Reinforcement Learning

Properties of RL

• Learner is not told which actions to take

• Trial-and-Error search

• Possibility of delayed reward– Sacrifice short-term gains for greater long-

term gains

• The need to explore and exploit

• Considers the whole problem of a goal-directed agent interacting with an uncertain environment

Model of RL

Environment

actionstate

rewardAgent

Key componentsstate, action, reward, and transition

Agent: ฉั�นึ่อย!'ท�� 134 เล�อกท+าการกระท+าแบับัท�� 12Environment: เธอไดี 29 คืะแนึ่นึ่ และไดี ไปอย!'ท�� 113





:

State Action Reward Next state

Example: Tic-Tac-Toe

X XXO O

X

XO

X

O

XO

X

O

X

XO

X

O

X O

XO

X

O

X O

X

} x’s move

} o’s move

} x’s move

...

...... ...

... ... ... ... ...

x x

x

x o

x

o

xo

x

x

xx

o

o

Assume an imperfect opponent:

—he/she sometimes makes mistakes

1. Make a table with one entry per state:

State V(s) – estimated probability of winning

.5 ?

.5 ?. . .

. . .

. . .. . .

1 win

0 loss

. . .. . .

0 draw

x

xxx

oo

oo

ox

x

oo

o ox

xx

xo

2. Now play lots of games.

To pick our moves,

look ahead one step:

current state

various possible

next states*Just pick the next state with the highest

estimated prob. of winning — the largest V(s);

moveafter state theฑ

move before state theฑ

s

s

We increment each V(s) toward V( s ) – a backup :

V(s) V (s) V( s ) V (s)

a small positive fraction, e.g., .1

the step - size parameter

s

s’

Table Generalizing Function

State VState V

s

s

s

.

.

.

s

1

2

3

N

เหม�อนึ่ก�บั function mapping

Value Table

4

3

22

4

2

3

2

3

State Value Table1 dimension

State

46698467

::::::::

584659875

456807646

584683459

65411357434

793479667

63467423

Action Value Table2 dimension

State

Action

(Q-table)(V-table)

Examples of RL Implementations

Start with a random network

Play very many games against self

Learn a value function from this simulated experience

Action selectionby 2–3 ply search

Value

TD errorVt1 Vt

Tesauro, 1992–1995

TD-Gammon

10 floors, 4 elevator cars

STATES: button states; positions, directions, and motion states of cars; passengers in cars & in halls

ACTIONS: stop at, or go by, next floor

REWARDS: roughly, –1 per time step for each person waiting

Conservatively about 10 states22

Crites and Barto, 1996

Elavator Dispatching

Issues in Reinforcement Learning

• Trade-off between exploration and exploitation

ε – greedysoftmax

• Algorithms to find the value function for delayed reward

Dynamic ProgrammingMonte CarloTemporal Difference

n-Armed Bandit Problem

Slot Machine

12345

Slot machine ม�คื�นึ่โยกอย!'หลายอ�นึ่ซึ่#�งให รางวั�ลไม'เท'าก�นึ่

สมมติ�ลองเล'นึ่ไปเร��อยๆจนึ่ถุ#งจ�ดีๆหนึ่#�ง ไดี ข อสร�ปวั'าเล'นึ่คื�นึ่โยก 1 26 คืร�2ง ไดี รางวั�ล 4 baht/คืร�2งเล'นึ่คื�นึ่โยก 2 14 คืร�2ง ไดี รางวั�ล 3 baht/คืร�2งเล'นึ่คื�นึ่โยก 3 10 คืร�2ง ไดี รางวั�ล 2 baht/คืร�2งเล'นึ่คื�นึ่โยก 4 16 คืร�2ง ไดี รางวั�ล 102 baht/คืร�2ง

Exploration and Exploitation

จะม�ป7ญหา 2 อย'าง• จะลองเล'นึ่คื�นึ่โยกท�� 5 ติ'อไปไหม• คื'าเฉัล��ยของรางวั�ลท��ผ่'านึ่มาเท��ยงติรงแคื'ไหนึ่

Exploitation คื�อการใช้ ในึ่ส��งท��เร�ยนึ่มา คื�อเล'นึ่อ�นึ่ 4 ไปเร��อยๆติลอดีExploration คื�อส+ารวัจติ'อ โดียลองมากข#2นึ่ในึ่ส��งท��ย�งไม'เคืยท+า

Balance

เร�ยก Greedy

ε-greedy Action Selection

Greedy

ε-greedy

at at* arg max

aQt(a)

at* with probability 1

random action with probability at

คื�อเล�อกทางท��ให ผ่ลติอบัแทนึ่ส!งส�ดี

Test: 10-armed Bandit Problem

• n = 10 possible actions

• Each is chosen randomly from a normal distribution:

• each is also normal:

• 1000 plays

• repeat the whole thing 2000 times and average the results

(Q*(at ),1)rt

Q*(a)

(0,1)

Results

SoftMax Action Selection

SoftMax จะเล�อก greedy action ติามปร�มาณีของ rewardReward มาก ก/จะเล�อก greedy action ดี วัย probability ส!งโดียคื+านึ่วัณีจาก Gibb-Boltzmann distribution

Choose action a on play t with probability

eQt (a)

eQt (b) b1

n,

where is the “computational temperature”

Algorithms to find the Value Function

• Incremental Implementation• Markov’s decision process (MDP)• Value Function Characteristics• Bellman’s Equation• Solution methods

Incremental Implementation

Value function มาจากคื'าเฉัล��ยจาก reward หลายๆคืร�2ง

Qk

r1 r2 rk

k

แปลวั'าติ องคื+านึ่วัณี Q ใหม'ท�กๆคืร�2งท��ม� reward เข ามา โดียเก/บัคื'า reward ไวั ดี วัย

Incremental คื�อจะท+าการ update Q ติาม reward ท��เข ามา

This is a common form for update rules:

NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

Policy

คื�อวั�ธ�การเล�อก action โดียดี!จาก state ท�� agent อย!'

Policy at step t, t :

a mapping from states to action probabilities

t (s, a) probability that at a when st s

Goal: To maximize total reward (reward คื+านึ่วัณีย�งไง)

Types of Tasks

Episodic Tasks สามารถุแบั'งเป.นึ่ ส'วันึ่ๆ เช้'นึ่ เกมส�, maze

Non-episodic Tasks ไม'ม�จ�ดีส�2นึ่ส�ดี จะใช้ discount method

Rt rt1 rt2 rT , T คื�อ terminal state

Rt rt1 rt2 2rt3 krt k1,k 0

where , 0 1, is the discount rate.

Markov Decision Processการใช้ RL จะสมม�ติ�ให model ของป7ญหาอย!'ในึ่ร!ปแบับัMarkov Decision Process (MDP) ซึ่#�ง modelนึ่�2จะประกอบัดี วัยส'วันึ่ส+าคื�ญ 4 ส'วันึ่

State, Action, Reward, Transition

State = sAction = a

Rs s a E rt1 st s,at a,st1 s for all s, s S, aA(s).

Reward

Transition

Ps s a Pr st1 s st s,at a for all s, s S, a A(s).

MDP สามารถุเข�ยนึ่ให อย!'ในึ่ร!ป state transition ไดี

• The value of a state is the expected return starting from that state; depends on the agent’s policy:

• The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :

Value Function

State - value function for policy :

V (s)E Rt st s E krtk 1 st sk 0

Action - value function for policy :

Q (s, a) E Rt st s, at a E krtk1 st s,at ak0

Bellman Equation for a Policy

Rt rt1 rt2 2rt3 3rt4

rt1 rt2 rt3 2rt4 rt1 Rt1

So: V (s)E Rt st s E rt1 V st1 st s

Or, without the expectation operator:

V (s) (s,a) Ps s a Rs s

a V ( s ) s

a

คื+านึ่วัณีหา Value Function จาก policy π

Example: Grid World

Action = {up, down, left, right}ถุ าอย!'ท��จ�ดี A แล วัท+า action อะไรก/ไดี จะมาอย!'ท�� A’ แล วัไดี reward = 10ถุ าอย!'ท��จ�ดี B แล วัท+า action อะไรก/ไดี จะมาอย!'ท�� B’ แล วัไดี reward = 5ช้นึ่ก+าแพื่ง reward = -1นึ่อกนึ่�2นึ่ reward = 0

สามารถุหา value function ไดี ติามร!ป b

9.0

• For finite MDPs, policies can be partially ordered:

• There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *.

• Optimal policies share the same optimal state-value function:

• Optimal policies also share the same optimal action-value function:

Optimal Value Function

if and only if V (s) V (s) for all s S

V (s) max

V (s) for all s S

)( and allfor ),(max),( sAaSsasQasQ

Bellman Optimality Equation for V*

V (s) maxaA(s)

Q

(s,a)

maxaA(s)

E rt1 V(st1) st s, at a max

aA(s)Ps s

a

s Rs s

a V ( s )

The value of a state under an optimal policy must equalthe expected return for the best action from that state:

Bellman Optimality Equation for Q*

Q(s,a)E rt1 maxa

Q (st1, a ) st s,at a Ps s

a Rs s a max

a Q( s , a )

s

Why Optimal State-Value Functions are Useful?

Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal actions.

V

Example: Grid World

หนึ่ าท��ของ RL ก/คื�อหา optimal value function

inc 551 artificial intelligence

Documents