reinforcement learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室
TRANSCRIPT
![Page 1: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/1.jpg)
Reinforcement Learning
主講人:虞台文
大同大學資工所智慧型多媒體研究室
![Page 2: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/2.jpg)
ContentIntroductionMain ElementsMarkov Decision Process (MDP)Value Functions
![Page 3: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/3.jpg)
Reinforcement Learning
Introduction
大同大學資工所智慧型多媒體研究室
![Page 4: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/4.jpg)
Reinforcement Learning
Learning from interaction (with environment) Goal-directed learning Learning what to do and its effect Trial-and-error search and delayed reward
– The two most important distinguishing features of reinforcement learning
![Page 5: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/5.jpg)
Exploration and Exploitation
The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future.
Dilemma neither exploitation nor exploration can be pursued exclusively without failing at the task.
![Page 6: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/6.jpg)
Supervised Learning
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
![Page 7: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/7.jpg)
Reinforcement Learning
RLSystem
Inputs Outputs(“actions”)
Training Info = evaluations (“rewards” / “penalties”)
Objective: get as much reward as possible
![Page 8: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/8.jpg)
Reinforcement Learning
Main Elements
大同大學資工所智慧型多媒體研究室
![Page 9: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/9.jpg)
Main Elements
EnvironmentEnvironment
action
reward
state
Agent
agent
To maximizevalue
![Page 10: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/10.jpg)
Example (Bioreactor)
state– current temperature and other sensory readings,
composition, target chemical actions
– how much heating, stirring, what ingredients to add
reward– moment-by-moment production of desired
chemical
![Page 11: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/11.jpg)
Example (Pick-and-Place Robot)
state– current positions and velocities of joints
actions– voltages to apply to motors
reward:– reach end-position successfully, speed,
smoothness of trajectory
![Page 12: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/12.jpg)
Example (Recycling Robot)
State– charge level of battery
Actions– look for cans, wait for can, go recharge
Reward– positive for finding cans, negative for running out
of battery
![Page 13: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/13.jpg)
Main Elements Environment
– Its state is perceivable Reinforcement Function
– To generate reward– A function of states (or state/action pairs)
Value Function– The potential to reach the goal (with maximum
total reward)– To determine the policy– A function of state
![Page 14: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/14.jpg)
The Agent-Environment Interface
EnvironmentEnvironment
AgentAgent
actionat
statest
rewardrt
st+1
rt+1
st st+1 st+2 st+3
rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …
Frequently, we model the environment as a Markov Decision Process (MDP).
![Page 15: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/15.jpg)
Reward Function
A reward function defines the goal in a reinforcement learning problem. – Roughly speaking, it maps perceived states
(or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state.
:r S R or :r S A R
S: a set of statesA: a set of actions
![Page 16: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/16.jpg)
Goals and Rewards
The agent's goal is to maximize the total amount of reward it receives.
This means maximizing not just immediate reward, but cumulative reward in the long run.
![Page 17: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/17.jpg)
Goals and Rewards
Reward = 0
Reward = 1
Can you design another reward function?
![Page 18: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/18.jpg)
Goals and Rewards
Win
Loss
Draw orNon-terminal
state reward
+1
1
0
![Page 19: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/19.jpg)
Goals and Rewards
The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved.
The reward signal is the way of communicating to the agent what we want it to achieve, not how we want it achieved.
0
11
11
![Page 20: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/20.jpg)
Reinforcement Learning
Markov Decision Processes
大同大學資工所智慧型多媒體研究室
![Page 21: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/21.jpg)
DefinitionAn MDP consists of:
– A set of states S, and actions A, – A transition distribution
– Expected next rewards
1( | , )tass t tP s s s s a a P
,s sSa A
1 1[ | , , ]t ts tas tE r s s a a s s R
,s sSa A
![Page 22: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/22.jpg)
Decision Making
Many stochastic processes can be modeled within the MDP framework.
The process is controlled by choosing actions in each state trying to attain the maximum long-term reward.
How to find the optimal policy?
:* S A
![Page 23: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/23.jpg)
Example (Recycling Robot)
High Low
wait
search
search
wait
recharge
1, waitR
, search R ,1 search R
, search R1 3,
1, waitR
1,0
![Page 24: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/24.jpg)
Example (Recycling Robot)High Low
wait
search
search
wait
recharge
1, waitR
, search R ,1 search R
, search R1 3,
1, waitR
1,0
: expected # cans while
: expected
searching
wait# cans while
ing
search
wait
search wait
R
R
R R
{ , }S high low( ) { , }A wait seigh archh
( ) { , , }A wait search rechaow rgel
![Page 25: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/25.jpg)
Reinforcement Learning
Value Functions
大同大學資工所智慧型多媒體研究室
![Page 26: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/26.jpg)
Value Functions
:r S R or :r S A R To estimate how good it is for the agent to be in a given
state (or how good it is to perform a given action in a given state).
The notion of ``how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return.
Value functions are defined with respect to particular policies.
![Page 27: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/27.jpg)
ReturnsEpisodic Tasks
– finite-horizon tasks– indefinite-horizon tasks
Continual Tasks– infinite-horizon tasks
![Page 28: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/28.jpg)
Finite Horizontal Tasks
1 2t t t TR r r r Returnat time t
[ ]tE RExpected returnat time t
k-armed bandit problem
![Page 29: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/29.jpg)
Indefinite Horizontal Tasks
1 2t t tR r r Returnat time t
[ ]tE RExpected returnat time t
Play chess
![Page 30: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/30.jpg)
Infinite Horizontal Tasks
21 2 3t t t tR r r r
Returnat time t
[ ]tE RExpected returnat time t
Control
10
kt k
k
r
![Page 31: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/31.jpg)
Unified Notation
Reformulation of episodic tasks
s0 s1 s2r1 r2 r3
r4=0r5=0
.
.
.
10
t t kk
kR r
Discounted
returnat time t
: discounting factor
= 0= 1< 1
![Page 32: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/32.jpg)
PoliciesA policy, , is a mapping from states, s
S, and actions, aA(s), to the probability (s, a) of taking action a when in state s.
![Page 33: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/33.jpg)
Value Functions under a Policy
( ) [ | ]t tV s R sE s
State-Value Function
Action-Value Function
( [ ],, ) |t t tsQ s a E aR s a
10
|kt k t
k
r s sE
10
| ,k
ktt k ts s aE ar
![Page 34: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/34.jpg)
Bellman Equation for a Policy State-Value Function
2 31 2 3 4t t t t tR r r r r
( ) [ | ]t tV s E R s s
,( ) ( )( )a s
as
asssV s V ss a
RP
21 2 3 4( )t t t tr r r r
1 1t tr R
1 1[ ( ) | ]t t tE r V s s s
![Page 35: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/35.jpg)
Backup Diagram State-Value Function
,( ) ( )( )a s
as
asssV s V ss a
RP
s
ar
![Page 36: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/36.jpg)
Bellman Equation for a Policy Action-Value Function
( , ) [ | , ]t t tQ s a E R s s a a
( , ) ( )ass
as
ssQ s a V s
RP
1 1t t tR r R
1 1[ | , ]t t t tE r R s s a a
1 1[ | , ] [ | , ]t t t t t tE r s s a a E R s s a a
1 1[ | ]a a ass ss ss t t
s s
E R s s
P R P
( )a a ass ss ss
s s
V s
P R P
![Page 37: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/37.jpg)
Backup Diagram Action-Value Function
( , ) ( )ass
as
ssQ s a V s
RP
s, a
s’
a’
![Page 38: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/38.jpg)
Bellman Equation for a Policy
,( ) ( )( )a s
as
asssV s V ss a
RP
This is a set of equations (in fact, linear), one for each state.
The value function for is its unique solution. It can be regarded as consistency condition
between values of states and successor states, and rewards.
( , ) ( )ass
as
ssQ s a V s
RP
![Page 39: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/39.jpg)
Example (Grid World)
,( ) ( )( )a s
as
asssV s V ss a
RP
( , ) ( )ass
as
ssQ s a V s
RP
State: position Actions: north, south, east, west; deterministic. Reward: If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move
agent out of special states A and B as shown.
State-value function for equiprobable random policy;= 0.9
![Page 40: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/40.jpg)
Optimal Policy (*)*( ) ( ), *V s V s *( ) ( ), *V s V s
,( ) ( )( )a s
as
asssV s V ss a
RP
( , ) ( )ass
as
ssQ s a V s
RP
* *( ) ( ) max ( )V s V s V s
Optimal State-Value Function
Optimal Action-Value Function*( , ) max ( , )Q s a Q s a
What is the relation btw. them.
![Page 41: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/41.jpg)
Optimal Value Functions
,( ) ( )( )a s
as
asssV s V ss a
RP
( , ) ( )ass
as
ssQ s a V s
RP
* *
( )( ) max ( , )
a sV s Q s a
A
*
( )max ( )aa
sss
sa s
s V s
A
P R
Bellman Optimality Equations:
* *
( ) ( )( , ) max max ( , )aa
sss
sa
ss a s
Q s a Q s a
A A
RP
![Page 42: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/42.jpg)
Optimal Value Functions
* *
( )( ) max ( , )
a sV s Q s a
A
*
( )max ( )aa
sss
sa s
s V s
A
P R
Bellman Optimality Equations:
* *
( ) ( )( , ) max max ( , )aa
sss
sa
ss a s
Q s a Q s a
A A
RP
How to apply the value function to determine the action to be taken on each state?
How to compute? How to store?
![Page 43: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/43.jpg)
Example (Grid World)
V* *
RandomPolicy
OptimalPolicy
![Page 44: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/44.jpg)
Finding Optimal Solution via Bellman
Finding an optimal policy by solving the Bellman Optimality Equation requires the following:– accurate knowledge of environment
dynamics;– we have enough space and time to do the
computation;– the Markov Property.
![Page 45: Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061603/56649f505503460f94c72c69/html5/thumbnails/45.jpg)
Optimality and Approximation
How much space and time do we need?– polynomial in number of states (via dynamic
programming methods)– BUT, number of states is often huge (e.g.,
backgammon has about 1020 states).
We usually have to settle for approximations. Many RL methods can be understood as
approximately solving the Bellman Optimality Equation.