an object-oriented representation for efficient reinforcement learning
DESCRIPTION
An Object-oriented Representation for Efficient Reinforcement Learning. Carlos Diuk, Andre Cohen and Michael L. Littman Rutgers Laboratory for Real-Life Reinforcement Learning (RL) 3 Department of Computer Science Rutgers University (New Jersey, USA). ICML 2008 – Helsinki, Finland. - PowerPoint PPT PresentationTRANSCRIPT
An Object-oriented Representation for Efficient Reinforcement Learning
Carlos Diuk, Andre Cohen and Michael L. Littman
Rutgers Laboratory for Real-Life Reinforcement Learning (RL)3
Department of Computer ScienceRutgers University (New Jersey, USA)
ICML 2008 – Helsinki, Finland
Motivation
How would YOU play this game?
What’s in a state?
A simple hash code that tells you if you’ve been “there” before.
What we (the agent) can actually “see”: objects, interactions, spatial relationships.
s1 -> a0 -> s5
s5 -> a2 -> s24
s24 -> a1 -> s1
If we know that our agents are interacting in a spatial relation with objects, let’s just tell them so.
What we did
• Grab ideas from Relational RL and come up with a representation that:– is suitable for a wide-enough range of domains– is tractable– provides opportunities for generalization– enables smart exploration
• Strike a balance between generality and tractability.
OO representation• Problem defined by a set of objects and their
attributes. • Example: Objects in Pitfall defined by a
bounding box on a set of pixels based on color.
Man.<x,y>
Hole.<x,y>
Ladder.<x,y> Wall.<x,y>
Log.<x,y>
• State is the union of all objects’ attribute values.
OO representation• For any given state s, there is a function c(s)
that tells us which relations occur under s.
• Dynamics defined by preconditions and effects.
• Preconditions are conjunctions of terms:– Relations between objects:
• touchN/S/E/W(objecti, objectj)• on(objecti, objectj)
– Any (boolean) function on the attributes.– Any other function encoding prior knowledge.
• Actions have effects that determine how objects’ attributes get modified.
on(Man, Ladder)
Man.y = Man.y + 8
Action Up
DOORMax• An algorithm for efficient learning of
deterministic OO-MDPs.
• When objects interact, and an effect is observed, DOORMax learns the conjunction of terms that enabled the effect.
• Belongs to the R-Max family of algorithms:– Guides exploration to make objects interact
Pitfall video
DOORMax Analysis• Let n be the number of terms.• Assume that:
– The number of effects per action is bounded by a (small) constant m.
– Each effect has a unique conjunctive condition.
• As long as effects are observed (that is, some effect occurs given an action a), DOORMax will learn the condition-effect pairs that determine the dynamics of a in O(nm). There is a worst-case bound, when lots of no-effects are observed, of O(nm).
Results
What about this game?
Videogame
Representations in Taxi
# of steps Time per step What we have to tell it
Q-learning 47157 <1ms #states, #actions
MaxQ 6298 10ms Task hierarchy, DBNs for each task
Flat Rmax 4151 ~40ms #states, #actions
Factored Rmax 1676 44ms DBN
DSHP 319 11ms Task hierarchy, DBNs for each task
DOORMax 529 14ms Object representation
Bigger Taxi
Taxi 5x5 Taxi 10x10 Ratio
# States 500 7200 14.40
Factored – Rmax 1676 steps 19100 steps 11.39
DOORmax 529 steps 821 steps 1.55
DOORmax with Transfer from 5x5
529 steps 529 steps 1
Conclusions and future work• OO-MDPs provide a natural way of modeling an interesting set
of domains, while enabling generalization and smart exploration.
• DOORMax learns deterministic OO-MDPs outperforming state-of-the-art algorithms for factored-state representations.
• DOORMax scales very nicely with respect to the size of the state space, as long as transition dynamics between objects do not change.
• We do not have a provably efficient algorithm for stochastic OO-MDPs.
• We do not yet handle inheritance between classes of objects.