an introduction to po-mdp presented by alp sardağ
Post on 20-Dec-2015
219 views
TRANSCRIPT
An An IntroductionIntroduction to PO-MDP to PO-MDP
Presented byPresented by
Alp SardağAlp Sardağ
MDPMDP Components:Components:
– StateState– ActionAction– TransitionTransition– ReinforcementReinforcement
Problem:Problem:– choose the action that makes the right tradeoffs choose the action that makes the right tradeoffs
between the immediate rewards and the future gains, to between the immediate rewards and the future gains, to yield the best possible solutionyield the best possible solution
Solution:Solution:– Policy: value functionPolicy: value function
DefinitionDefinition
Horizon lengthHorizon length Value Iteration:Value Iteration:
– Temporal Difference Learning:Temporal Difference Learning:
Q(x,a) Q(x,a) Q(x,a) + Q(x,a) +(r+ (r+ maxmaxbbQ(y,b) - Q(x,a)) Q(y,b) - Q(x,a))
where where learning rate and learning rate and discount rate. discount rate.
Adding PO to CO-MDP is not trivial:Adding PO to CO-MDP is not trivial:– Requires the complete observability of the state.Requires the complete observability of the state.– PO clouds the current state.PO clouds the current state.
PO-MDPPO-MDP
Components:Components:– StatesStates– ActionsActions– TransitionsTransitions– ReinforcementReinforcement– ObservationsObservations
Mapping in CO-MDP & PO-MDPMapping in CO-MDP & PO-MDP
In CO-MDPsIn CO-MDPs,, mapping mapping is is from states to from states to actionsactions..
InIn PO PO--MDPsMDPs,, mapping mapping is is from probability from probability distributions (over states) to actionsdistributions (over states) to actions..
VI in CO-MDP & PO-MDPVI in CO-MDP & PO-MDP
In a CO-MDPIn a CO-MDP, , – TTrack our current state rack our current state – UUpdate it after each actionpdate it after each action
In a PO-MDP,In a PO-MDP,– PProbability distribution over statesrobability distribution over states– PPerform an action and make an observation, erform an action and make an observation,
then then update the distributionupdate the distribution
Belief State and SpaceBelief State and Space
Belief State: probability distribution over states.Belief State: probability distribution over states. Belief Space: the entire probability space.Belief Space: the entire probability space. Example:Example:
– Assume two state PO-MDP.Assume two state PO-MDP.
– P(sP(s11) = p & P(s) = p & P(s22) = 1-p.) = 1-p.
– Line become hyper-plane in higher dimension.Line become hyper-plane in higher dimension.
s1
Belief TransformBelief Transform
Assumption:Assumption:– Finite actionFinite action
– Finite observationFinite observation
– Next belief state = T(cbf,a,o) whereNext belief state = T(cbf,a,o) where
cbf: current belief state, a:action, o:observationcbf: current belief state, a:action, o:observation
Finite number of possible next belief stateFinite number of possible next belief state
PO-MDP into continuous CO-MDPPO-MDP into continuous CO-MDP
The process is Markovian, the next belief The process is Markovian, the next belief state depends on:state depends on:– Current belief stateCurrent belief state– Current actionCurrent action– ObservationObservation
DDiscrete POiscrete PO--MDP problem MDP problem can be can be converted converted into a continuous space CO-MDP into a continuous space CO-MDP problem where the continuous space is the problem where the continuous space is the belief spacebelief space..
ProblemProblem
Using VI in continuous state space.Using VI in continuous state space. No nice tabular representation as before.No nice tabular representation as before.
PWLCPWLC RRestrictions on the form of the solutions to the continuous estrictions on the form of the solutions to the continuous
space CO-MDPspace CO-MDP::– TheThe finite horizon value function is piecewise linear and convex finite horizon value function is piecewise linear and convex
(PWLC) for every horizon length(PWLC) for every horizon length..– the value of a belief point is simply the dot product of the two vectorsthe value of a belief point is simply the dot product of the two vectors ..
GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function
Steps in VISteps in VI
RRepresent the value function for each epresent the value function for each horizon as a set of vectorshorizon as a set of vectors..– Overcome hOvercome how to represent a value function ow to represent a value function
over a continuous spaceover a continuous space..
FFind the vector that has the largest dot ind the vector that has the largest dot product with the belief stateproduct with the belief state..
PO-MDP Value Iteration ExamplePO-MDP Value Iteration Example
Assumption:Assumption:– Two statesTwo states
– Two actionsTwo actions
– Three observationsThree observations
Ex: horizon length is 1.Ex: horizon length is 1.
b=[0.25 0.75]
[s1
s2
a1 a2
]1 00 1.5
V(a1,b) = 0.25x1+0.75x0 = 0.25V(a2,b)=0.25x0+0.75x1.5=1.125
a1 is the best a2 is the best
The value of a belief state for horizon The value of a belief state for horizon length length 2 2 given b,agiven b,a11,z,z11::– immediate action plus the value of the next actionimmediate action plus the value of the next action..
– FindFind best achievable value for the belief state that best achievable value for the belief state that results from our initial belief state b when we perform results from our initial belief state b when we perform action aaction a11 and observe z and observe z11..
PO-MDP Value Iteration ExamplePO-MDP Value Iteration Example
PO-MDP Value Iteration ExamplePO-MDP Value Iteration Example FFind the value for all the belief points given ind the value for all the belief points given
this fixed action and observation. this fixed action and observation.
The Transformed value function is also The Transformed value function is also PWLC.PWLC.
HHow to compute the value of a belief state given ow to compute the value of a belief state given only the actiononly the action??
TThe he horizon 2 horizon 2 value of the belief statevalue of the belief state, given that:, given that:– Values for each observation: zValues for each observation: z11: 0.7 z: 0.7 z22: 0.8 z: 0.8 z33: 1.2: 1.2
– P(zP(z11| b,a| b,a11)=0.6; P(z)=0.6; P(z22| b,a| b,a11)=0.25; P(z)=0.25; P(z33| b,a| b,a11)=0.15)=0.15
0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.8350.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835
PO-MDP Value Iteration ExamplePO-MDP Value Iteration Example
Transformed Value FunctionsTransformed Value Functions EEach of these transformed functions ach of these transformed functions
partitions the belief space differentlypartitions the belief space differently.. BBest next action to perform depends upon est next action to perform depends upon
the initial belief statethe initial belief state and and observationobservation..
BBest est VValue alue For BFor Belief elief SStatetatess TThe value of every single belief pointhe value of every single belief point, the sum , the sum
of:of:– IImmediate rewardmmediate reward..– TThe line segments from the S() functions for each he line segments from the S() functions for each
observation's future strategyobservation's future strategy..
since adding lines gives you linessince adding lines gives you lines, it is linear., it is linear.
AAll the useful future strategies are easy to ll the useful future strategies are easy to pick outpick out::
BBest est SStrategy for any trategy for any BBelief elief PPointsoints
Value Value FFunction and unction and PPartitionartition
For the specific action aFor the specific action a11, the value function , the value function
and corresponding partitions:and corresponding partitions:
Value Value FFunction and unction and PPartitionartition
For the specific action aFor the specific action a22, the value function , the value function
and corresponding partitions:and corresponding partitions:
Which Action to Choose?Which Action to Choose? put the value functions for each action put the value functions for each action
together to see where each action gives the together to see where each action gives the highest value. highest value.
CCompact ompact HHorizon 2 orizon 2 VValue alue FFunctionunction
VValue alue FFunction for unction for AAction action a11 with a with a
HHorizon of 3 orizon of 3
VValue alue FFunction for unction for AAction action a22 with a with a
HHorizon of 3orizon of 3
VValue alue FFunction for unction for Both ABoth Action with a ction with a HHorizon of 3orizon of 3
VValue alue FFunction for unction for HHorizon of 3orizon of 3