an object-oriented representation for efficient reinforcement learning

An Object-oriented Representation for Efficient Reinforcement Learning

Carlos Diuk, Andre Cohen and Michael L. Littman

Rutgers Laboratory for Real-Life Reinforcement Learning (RL)3

Department of Computer ScienceRutgers University (New Jersey, USA)

ICML 2008 – Helsinki, Finland

Motivation

How would YOU play this game?

What’s in a state?

A simple hash code that tells you if you’ve been “there” before.

What we (the agent) can actually “see”: objects, interactions, spatial relationships.

s1 -> a0 -> s5

s5 -> a2 -> s24

s24 -> a1 -> s1

If we know that our agents are interacting in a spatial relation with objects, let’s just tell them so.

What we did

• Grab ideas from Relational RL and come up with a representation that:– is suitable for a wide-enough range of domains– is tractable– provides opportunities for generalization– enables smart exploration

• Strike a balance between generality and tractability.

OO representation• Problem defined by a set of objects and their

attributes. • Example: Objects in Pitfall defined by a

bounding box on a set of pixels based on color.

Man.<x,y>

Hole.<x,y>

Ladder.<x,y> Wall.<x,y>

Log.<x,y>

• State is the union of all objects’ attribute values.

OO representation• For any given state s, there is a function c(s)

that tells us which relations occur under s.

• Dynamics defined by preconditions and effects.

• Preconditions are conjunctions of terms:– Relations between objects:

• touchN/S/E/W(objecti, objectj)• on(objecti, objectj)

– Any (boolean) function on the attributes.– Any other function encoding prior knowledge.

• Actions have effects that determine how objects’ attributes get modified.

on(Man, Ladder)

Man.y = Man.y + 8

Action Up

DOORMax• An algorithm for efficient learning of

deterministic OO-MDPs.

• When objects interact, and an effect is observed, DOORMax learns the conjunction of terms that enabled the effect.

• Belongs to the R-Max family of algorithms:– Guides exploration to make objects interact

Pitfall video

DOORMax Analysis• Let n be the number of terms.• Assume that:

– The number of effects per action is bounded by a (small) constant m.

– Each effect has a unique conjunctive condition.

• As long as effects are observed (that is, some effect occurs given an action a), DOORMax will learn the condition-effect pairs that determine the dynamics of a in O(nm). There is a worst-case bound, when lots of no-effects are observed, of O(nm).

Results

What about this game?

Videogame

Representations in Taxi

# of steps Time per step What we have to tell it

Q-learning 47157 <1ms #states, #actions

MaxQ 6298 10ms Task hierarchy, DBNs for each task

Flat Rmax 4151 ~40ms #states, #actions

Factored Rmax 1676 44ms DBN

DSHP 319 11ms Task hierarchy, DBNs for each task

DOORMax 529 14ms Object representation

Bigger Taxi

Taxi 5x5 Taxi 10x10 Ratio

# States 500 7200 14.40

Factored – Rmax 1676 steps 19100 steps 11.39

DOORmax 529 steps 821 steps 1.55

DOORmax with Transfer from 5x5

529 steps 529 steps 1

Conclusions and future work• OO-MDPs provide a natural way of modeling an interesting set

of domains, while enabling generalization and smart exploration.

• DOORMax learns deterministic OO-MDPs outperforming state-of-the-art algorithms for factored-state representations.

• DOORMax scales very nicely with respect to the size of the state space, as long as transition dynamics between objects do not change.

• We do not have a provably efficient algorithm for stochastic OO-MDPs.

• We do not yet handle inheritance between classes of objects.

an object-oriented representation for efficient reinforcement learning

Documents