learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003
TRANSCRIPT
Learning Coordination strategies using reinforcement learning
-- Myriam Z. Abramson , dissertation, 2003
張景照dorgon chang
20120614
Learning Coordination strategies using reinforcement learning
2
Index
• Coordinate problem( 要解決的問題 )• Evaluation of the GO• Reinforcement learning• Temporal Difference learning( 使用 Sarsa)• Learning Vector Quantization(LVQ)• Sarsa LVQ (SLVQ) <= 作者提出的方法
Learning Coordination strategies using reinforcement learning
3
Coordinate problem
• Coordination strategy problem 簡單來說就是action selection problem 。
• 當我們只知道 local situation 的時候,如何選擇一個正確的行動,在不依靠 end game state 的情況下,讓其能夠跟其他的行動結合在一起。• 局部地區的戰術 如何影響整體的戰略。
Learning Coordination strategies using reinforcement learning
4
Evaluation of the GOThis method convey the spatial connectivity between the stones
ε 為自定數值,當該點的 influence 超過 ε 時會繼續往外擴散
對盤面上所有數字做加總,可以得到一個盤面的評估值
黑子往外擴散 1白子往外擴散 -1
在接下來的方法中當成 reward
Learning Coordination strategies using reinforcement learning
5
Reinforcement learning :Introduction
• Machine learning 的目標是用來產生一個agents ,而 RL 是其中的一個方法 ,其特徵是 Trial-and-error search and delayed reward 。
下一顆子
例如:贏或輸、 平手盤面
Agent 往後預測幾步盤面
Learning Coordination strategies using reinforcement learning
6
Reinforcement learning :Value Function
• π = agent 所使用來選擇 action 的 policy 。• s = 目前的 state 。• :在 π 這個 policy 下, state s 所得到的 reward 。 • :在 π 這個 policy 下, state s 採取 action a 所得到的 reward 。 • 最常見的 policy 為 ε-greedy 。 ( 另有 greedy 、 ε-soft 、 softmax……)
• ε 介於 0~1 之間,其值越高,代表「 exploration 」越被鼓勵。 (exploration v.s. exploitation)
• ε-greedy :大部份都會選擇擁有最高 reward 的行動,有少部份的機率 ε 會亂數決定。
Learning Coordination strategies using reinforcement learning
7
Temporal Difference learning
• TD learning 是用來評估 RL 中 value function的方法。
DP :當前的評估值是基於先前學習過的評估值為基礎。MC :利用 random game 的模式,統計其結果來解決未來可能遇到的問題。TD 方法結合:
Learning Coordination strategies using reinforcement learning
8
Temporal Difference learning : Forward View of TD() (1)
• Monte Carlo : observe reward for all steps in an episode
• TD(0) : observed one step only observed two step
• TD() is a method for averaging all n-step,
1 1(1) ( )t t tR r V s 2
1 2 2(2) ( )t t t tt r r V sR
V(St)V(St+1)
Value update
set λ = 0, TD(0) set λ = 1, Monte Carlo
r = 在 t 時間點的 reward, γ = 觀察未來 reward 的 discount rate
為在 t 時間點往後觀察 T 步 的 total reward, 回傳一個 scalar
Learning Coordination strategies using reinforcement learning
9
Temporal Difference learning : Forward View of TD() (2)
• Monte Carlo : observe reward for all steps in an episode
• TD(0) : observed one step only observed two step
• TD() is a method for averaging all n-step,
1 1(1) ( )t t tR r V s 2
1 2 2(2) ( )t t t tt r r V sR
Value update
set λ = 0, TD(0) set λ = 1, Monte Carlo
set λ = 0 代入
r = 在 t 時間點的 reward, γ = 觀察未來 reward 的 discount rate
為在 t 時間點往後觀察 T 步 的 total reward, 回傳一個 scalar
Learning Coordination strategies using reinforcement learning
10
Temporal Difference learning : Forward View of TD() (3)
• Monte Carlo : observe reward for all steps in an episode
• TD(0) : observed one step only observed two step
• TD() is a method for averaging all n-step,
1 1(1) ( )t t tR r V s 2
1 2 2(2) ( )t t t tt r r V sR
Value update
set λ = 0, TD(0) set λ = 1, Monte Carlo
set λ = 1 代入
r = 在 t 時間點的 reward, γ = 觀察未來 reward 的 discount rate
為在 t 時間點往後觀察 T 步 的 total reward, 回傳一個 scalar
Learning Coordination strategies using reinforcement learning
11
Temporal Difference learning : Forward View of TD() (4)
T 為一場 game 的total step 、 t 為這場 game 的第幾個
step
1 1(1) ( )t t tR r V s
S0
w1 =
w2 =
w3 =
S1
S2
S3
Normalize 確保 weight 總和為 1
1nw
set λ = 0.5 and t = 0, T = 3
Learning Coordination strategies using reinforcement learning
12
Temporal Difference learning : Forward View of TD() (5)
= 之後的總和
λ 越高 weight 下降越快,越重視前面的結果。 λ 越低 weight 下降越慢,越重視後面的結果。
總結 λ 存在的功能與意義:1. 作為 TD 跟 MC 方法的橋梁2. 對於一個沒有立即影響的 action ,我們如何去做 punish or reward
Eligibility Traces
若 λ =0.1 => 1-λ = 0.9
若 λ = 0.9 =>1-λ = 0.1
Set λ = 0.5 and t = 0, T = 3 的結果
Learning Coordination strategies using reinforcement learning
13
Temporal Difference learning : Backward View of TD()(1)
• Eligibility Traces :• Reinforcing Events :• Value updates :
1
1
( )( ) 1
( ) t
t tt
te s s ss s
se s
e
1 1( ) ( )t tt t t tr V s V s ( ) ( )t tt s sV e
rt+1
V(St)
0
( ) ( )k
tt k
t ssk
e s I
10t
tss
t
s sI
s s
非遞迴的定義
利用 Reinforcing Events一步一步的往回更新
Learning Coordination strategies using reinforcement learning
14
Temporal Difference learning : Backward View of TD()(2)
• Eligibility Traces :• Reinforcing Events :• Value updates :
1
1
( )( ) 1
( ) t
t tt
te s s ss s
se s
e
1 1( ) ( )t tt t t tr V s V s ( ) ( )t tt s sV e
set λ = 0 get TD(0)1 1
(1) ( )t t tR r V s TD(0)
0
( ) ( )k
tt k
t ssk
e s I
10t
tss
t
s sI
s s
非遞迴的定義
Learning Coordination strategies using reinforcement learning
15
Temporal Difference learning : Why Backward View ?• Forward view– theoretical view :概念上比較容易理解– Not directly implementable :資訊仍要模擬取得。
• Backward view– mechanistic view :較好實作– simple conceptually and computationally – In the offline case, achieving the same result as the
forward view ( 可證明 )
Learning Coordination strategies using reinforcement learning
16
Temporal Difference learning : Equivalence of the Forward and Backward Views
•
1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
Ref : 7.4 Equivalence of the Forward and Backward Views, http://www.cs.ualberta.ca/~sutton/book/7/node1.html( 證明,在 offline case 下相等 )
Backward view
Forward view
Value update 相等
10t
tss
t
s sI
s s
Sum of Forward:If λ = 1(MC) and T = 3 =>
Learning Coordination strategies using reinforcement learning
17
Temporal Difference learning :Sarsa 演算法
•
Behavior Policy
Estimation policy
For each game
每下一子
更新時 Rt 要往後觀察幾步,看所使用的方法:如 Sarsa (λ)1 1
(1) ( )t t tR r V s
Learning Coordination strategies using reinforcement learning
18
Learning Vector Quantization
• 主要目的:資料的壓縮• 基本概念:希望以較少的群集來表示整個輸入樣本空間。 => 找一個類別的代表點
LVQVQ
適用於無類別資訊的資料 適用於有類別資訊的資料
M=3 ,代表數量 O=prototype vector+ = input data
m1 m2
m3
Learning Coordination strategies using reinforcement learning
19
SLVQ :架構 (1)<= 代表點,初始時 random 撒在棋盤上
an idea of what a SOM looks like
建立 n 個 agent = pattern database
可用 SOM 演算法動態決定需要幾個 M
意即 pattern 數量可以動態增減
Agent 會將嚐試過的 state/action pair 的值記錄下來,經由 LVQ 演算法: Q(s, a) = >Q(m, a) , state-space 的數量被大幅的壓縮。
Learning Coordination strategies using reinforcement learning
20
SLVQ :架構 (2)
m1
m2
m3
初始各代表點的 weight 亂數產生m1
m2
m3
遊戲終盤時,代表點更新 ( 用 LVQ)
更新 更新
設 M=3
m1 m2
m3
越多場訓練,代表點的代表性會越足夠 =>會逐漸收斂
更新代表點時,會利用相似性的計算找出特定的 pattern 。 ( 利用幾何距離 )Ref: S. Santini and R. Jain. Similarity measures. IEEE Transactions onPattern Analysis and Machine Intelligence, 21(9), 1999.
使用 Backward View 做更新
Learning Coordination strategies using reinforcement learning
21
Candidate Moves (1)
• 就經驗來講,如果一個 move 有多重意義的話會比較好。以下為圍棋中移動的特徵:• Attack:reduce opponent’s liberties• Defend:increase own’s liberties• Claim:increase own’s influence• Invade:decrease opponent’s influence• Connect:Join two groups• Conquer:enclose liberties
Learning Coordination strategies using reinforcement learning
22
Candidate Moves(2)
Attack : A,B,C,D,E,F => reduce opponent’s liberties(氣 )
黑方為攻擊方
Defend : N,O,P,G,Q => increase own’s liberties
No use : M,L,K,J,I,H =>從候選移動名單中移除
Pattern database中一個 agent 可能的攻擊及防守點
Match
m
J
12
Learning Coordination strategies using reinforcement learning
23
Reference(1)• 英文部份:• Myriam Z. Abramson, Learning Coordination strategies using reinforcement
learning, dissertation, George Mason University, Fairfax, VA, 2003
• Shin Ishii, Control of exploitation-exploration meta-parameter in reinforcement learning, Nara Institute of Science and technology, Neural Netwokrs 15(4-6), pp.665-687, 2002
• Simon Haykin, Neural networks and learning machines third Edition, Chapter 12, PEARSON EDUCATION
• Richard S. Sutton, A Convergent O(n) Algorithm for off-policy Temporal-difference learning with linear function approximation, Reinforcement Learning and Artificial Intelligence Laboratory, Department of Computing Science University of Alberta
Learning Coordination strategies using reinforcement learning
24
Reference(2)
• 中文部份:• 陳漢鴻,電腦象棋的自我學習,碩士論文,國立雲林科技大學資訊工程系,民 95年 6月
Learning Coordination strategies using reinforcement learning
25
Reference(3)• 網頁部份:• Reinforcement Learning,• http://www.cse.unsw.edu.au/~cs9417ml/RL1/index.html, 2009.12.03
• Cyber Rodent Project, http://www.cns.atr.jp/cnb/crp/, 2009.12.03
• Off-Policy Learning, http://rl.cs.mcgill.ca/Projects/off-policy.html, 2009.12.03
• [MATH] Monte Carlo Method 蒙地卡羅法則 , http://www.wretch.cc/blog/glCheng/3431370, 2009.12.03
• Intelligent agent, http://en.wikipedia.org/wiki/Intelligent_agent, 2009.12.03
• Simple Competitive Learning , http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/competitive.html, 2009.12.12
• Eligibility Traces, http://www.cs.ualberta.ca/~sutton/book/7/node1.html, 2009.12.12
• Tabu search, http://sjchen.im.nuu.edu.tw/Project_Courses/ML/Tabu.pdf, 2009.12.12
• Self Organizing Maps, http://davis.wpi.edu/~matt/courses/soms/ , 2009.12.16• Reinforcement Learning , http://www.informatik.uni-freiburg.de/~ki/teaching/ws0607/advanced/recordings/reinforcement.pdf,
2009.12.25