an introduction to po-mdp presented by alp sardağ

An An IntroductionIntroduction to PO-MDP to PO-MDP

Presented byPresented by

Alp SardağAlp Sardağ

MDPMDP Components:Components:

– StateState– ActionAction– TransitionTransition– ReinforcementReinforcement

Problem:Problem:– choose the action that makes the right tradeoffs choose the action that makes the right tradeoffs

between the immediate rewards and the future gains, to between the immediate rewards and the future gains, to yield the best possible solutionyield the best possible solution

Solution:Solution:– Policy: value functionPolicy: value function

DefinitionDefinition

Horizon lengthHorizon length Value Iteration:Value Iteration:

– Temporal Difference Learning:Temporal Difference Learning:

Q(x,a) Q(x,a) Q(x,a) + Q(x,a) +(r+ (r+ maxmaxbbQ(y,b) - Q(x,a)) Q(y,b) - Q(x,a))

where where learning rate and learning rate and discount rate. discount rate.

Adding PO to CO-MDP is not trivial:Adding PO to CO-MDP is not trivial:– Requires the complete observability of the state.Requires the complete observability of the state.– PO clouds the current state.PO clouds the current state.

PO-MDPPO-MDP

Components:Components:– StatesStates– ActionsActions– TransitionsTransitions– ReinforcementReinforcement– ObservationsObservations

Mapping in CO-MDP & PO-MDPMapping in CO-MDP & PO-MDP

In CO-MDPsIn CO-MDPs,, mapping mapping is is from states to from states to actionsactions..

InIn PO PO--MDPsMDPs,, mapping mapping is is from probability from probability distributions (over states) to actionsdistributions (over states) to actions..

VI in CO-MDP & PO-MDPVI in CO-MDP & PO-MDP

In a CO-MDPIn a CO-MDP, , – TTrack our current state rack our current state – UUpdate it after each actionpdate it after each action

In a PO-MDP,In a PO-MDP,– PProbability distribution over statesrobability distribution over states– PPerform an action and make an observation, erform an action and make an observation,

then then update the distributionupdate the distribution

Belief State and SpaceBelief State and Space

Belief State: probability distribution over states.Belief State: probability distribution over states. Belief Space: the entire probability space.Belief Space: the entire probability space. Example:Example:

– Assume two state PO-MDP.Assume two state PO-MDP.

– P(sP(s11) = p & P(s) = p & P(s22) = 1-p.) = 1-p.

– Line become hyper-plane in higher dimension.Line become hyper-plane in higher dimension.

s1

Belief TransformBelief Transform

Assumption:Assumption:– Finite actionFinite action

– Finite observationFinite observation

– Next belief state = T(cbf,a,o) whereNext belief state = T(cbf,a,o) where

cbf: current belief state, a:action, o:observationcbf: current belief state, a:action, o:observation

Finite number of possible next belief stateFinite number of possible next belief state

PO-MDP into continuous CO-MDPPO-MDP into continuous CO-MDP

The process is Markovian, the next belief The process is Markovian, the next belief state depends on:state depends on:– Current belief stateCurrent belief state– Current actionCurrent action– ObservationObservation

DDiscrete POiscrete PO--MDP problem MDP problem can be can be converted converted into a continuous space CO-MDP into a continuous space CO-MDP problem where the continuous space is the problem where the continuous space is the belief spacebelief space..

ProblemProblem

Using VI in continuous state space.Using VI in continuous state space. No nice tabular representation as before.No nice tabular representation as before.

PWLCPWLC RRestrictions on the form of the solutions to the continuous estrictions on the form of the solutions to the continuous

space CO-MDPspace CO-MDP::– TheThe finite horizon value function is piecewise linear and convex finite horizon value function is piecewise linear and convex

(PWLC) for every horizon length(PWLC) for every horizon length..– the value of a belief point is simply the dot product of the two vectorsthe value of a belief point is simply the dot product of the two vectors ..

GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

Steps in VISteps in VI

RRepresent the value function for each epresent the value function for each horizon as a set of vectorshorizon as a set of vectors..– Overcome hOvercome how to represent a value function ow to represent a value function

over a continuous spaceover a continuous space..

FFind the vector that has the largest dot ind the vector that has the largest dot product with the belief stateproduct with the belief state..

PO-MDP Value Iteration ExamplePO-MDP Value Iteration Example

Assumption:Assumption:– Two statesTwo states

– Two actionsTwo actions

– Three observationsThree observations

Ex: horizon length is 1.Ex: horizon length is 1.

b=[0.25 0.75]

[s1

s2

a1 a2

]1 00 1.5

V(a1,b) = 0.25x1+0.75x0 = 0.25V(a2,b)=0.25x0+0.75x1.5=1.125

a1 is the best a2 is the best

The value of a belief state for horizon The value of a belief state for horizon length length 2 2 given b,agiven b,a11,z,z11::– immediate action plus the value of the next actionimmediate action plus the value of the next action..

– FindFind best achievable value for the belief state that best achievable value for the belief state that results from our initial belief state b when we perform results from our initial belief state b when we perform action aaction a11 and observe z and observe z11..


PO-MDP Value Iteration ExamplePO-MDP Value Iteration Example FFind the value for all the belief points given ind the value for all the belief points given

this fixed action and observation. this fixed action and observation.

The Transformed value function is also The Transformed value function is also PWLC.PWLC.

HHow to compute the value of a belief state given ow to compute the value of a belief state given only the actiononly the action??

TThe he horizon 2 horizon 2 value of the belief statevalue of the belief state, given that:, given that:– Values for each observation: zValues for each observation: z11: 0.7 z: 0.7 z22: 0.8 z: 0.8 z33: 1.2: 1.2

– P(zP(z11| b,a| b,a11)=0.6; P(z)=0.6; P(z22| b,a| b,a11)=0.25; P(z)=0.25; P(z33| b,a| b,a11)=0.15)=0.15

0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.8350.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835


Transformed Value FunctionsTransformed Value Functions EEach of these transformed functions ach of these transformed functions

partitions the belief space differentlypartitions the belief space differently.. BBest next action to perform depends upon est next action to perform depends upon

the initial belief statethe initial belief state and and observationobservation..

BBest est VValue alue For BFor Belief elief SStatetatess TThe value of every single belief pointhe value of every single belief point, the sum , the sum

of:of:– IImmediate rewardmmediate reward..– TThe line segments from the S() functions for each he line segments from the S() functions for each

observation's future strategyobservation's future strategy..

since adding lines gives you linessince adding lines gives you lines, it is linear., it is linear.

AAll the useful future strategies are easy to ll the useful future strategies are easy to pick outpick out::

BBest est SStrategy for any trategy for any BBelief elief PPointsoints

Value Value FFunction and unction and PPartitionartition

For the specific action aFor the specific action a11, the value function , the value function

and corresponding partitions:and corresponding partitions:

Value Value FFunction and unction and PPartitionartition

For the specific action aFor the specific action a22, the value function , the value function

and corresponding partitions:and corresponding partitions:

Which Action to Choose?Which Action to Choose? put the value functions for each action put the value functions for each action

together to see where each action gives the together to see where each action gives the highest value. highest value.

CCompact ompact HHorizon 2 orizon 2 VValue alue FFunctionunction

VValue alue FFunction for unction for AAction action a11 with a with a

HHorizon of 3 orizon of 3

VValue alue FFunction for unction for AAction action a22 with a with a

HHorizon of 3orizon of 3

VValue alue FFunction for unction for Both ABoth Action with a ction with a HHorizon of 3orizon of 3

VValue alue FFunction for unction for HHorizon of 3orizon of 3

an introduction to po-mdp presented by alp sardağ

Documents