17 hybridizing evolutionary computation and reinforcement learning for the design of almost...

/ 17

Hybridizing evolutionary computation and re-inforcement learning for the design of almost universal controllers for autonomous robots (Neurocomputing 2009)

Soft Computing Lab.

Yongjun Kim

7th May, 2009

Dario Maravall, Javier de Lope, Jose Antonio Martin H.

/ 172

Outline

• Introduction

• Proposed Approach– Guidelines for the design of an optimum controller for autonomous

robots by the combination of evolutionary algorithms and RL– A hard motion robot control problem– Evolving the table of situations-actions– The transition from innate behavior to knowledge-based behavior by

means of on-line experience

• Experimental results

• Conclusions and further research work

• Discussion

/ 173

Introduction: How important are representations in robotics?

• The representation of the external world is necessary for an agent to develop a particular task in a specific world.

• The nature and the characteristics of representations depend strongly on the physical nature of the agent itself.

• Robot representations are computational while human mind rep-resentations are “phenomenal”.

• The development of intelligent robots raises the obtaining of proper computational representations of the robot’s environ-ments.

/ 174


• The dominant approaches are based on the manipulation of mathematical models of the environment with varying levels of formal representation.

– Extremely demanding as regards the reasoning and perceptual abili-ties of robots.

– No robot has ever been able to navigate in truly real environments.

• The question then is how to progress toward higher levels of ro-bot autonomy.

– Reduce the complexity of both the reasoning and the perceptual tasks to be accomplished by the robots.

– Mainly through a direct coupling between perception and action.

• Adaptation and learning have become the central issues con-cerning world representations.

/ 175


• Robots are practically restricted to the interaction between per-ception and action in the reactive approach.

– The close coupling between perception and action extremely re-duces the activities and the possibilities of reactive robots.

• Use a primitive reasoning ability.

• The reactive navigation schemes can design robots able to adapt to very dynamic, uncertain and unknown environments.

• It is almost compulsory to integrate reactive schemes in any au-tonomous navigation system (a hybrid navigation system).

• The representation issue in the robotics field coincides with the sensory-motor coordination problem which at the end is pre-cisely the problem of designing the robotic controllers.

/ 176

Guidelines for the design of an optimum controller for auton-omous robots by the combination of evolutionary algorithms and RL

• The current robotic systems require controllers able to solve complex problems under uncertain and dynamic environments.

• Reinforcement Learning (RL) is particularly attractive and effi-cient in the very common and hard situation in which the de-signer does not have all the necessary information.

• RL approach seamlessly fits the usual modeling of the robot-en-vironment interaction as a Markov decision process (MDP).

– Control techniques for MDPs like dynamic programming may be ap-plied to the design of the robot controller.

• The drawback of RL is related to the curse of dimensionality.– Need to initialize a typical Q-learning algorithm with a look-up table

of situations-actions by means of an evolutionary algorithm.

/ 177


• The proposed method is consist of the following stages:– The selection and subsequent granulation of the state variables in-

volved which is a designer dependant task is undertaken.– The obtaining of the knowledge rule base by means of a genetic al-

gorithm.– The starting of the standard Q-learning algorithm to build its Q-table.

/ 178


• The first step is (1) the identification and selection and (2) the subsequent granulation of the state variables associated to the pair robot-environment.

– The granulation of state variables can be done by applying one of the followings:

• A fuzzy set concept : a fuzzy knowledge rule base (FKRB).• A Boolean sets-based granulation : a Boolean KRB (BKRB).

• The next step is obtaining the knowledge base of production or control rules of the type “if situation then action”.

– The search of the optimum knowledge base of rules can only be un-dertaken by efficient, parallel population-based search techniques like evolutionary algorithms in high number of possible control rules.

• The last step is starting the standard Q-learning algorithm.– Exploit the knowledge provided by the genetic algorithm.

/ 179

A hard motion robot control problem

• A two-link L-shaped robot moving in a cluttered environment with polygonal obstacles.

• Allow several degrees-of-freedom:– The linear movement along the XY Cartesian axes of the robot’s

middle joint (x, y).– The rotational movement for controlling the robot’s orientation (Ф).– Two additional independent rotational movements around the cen-

tral joint (θ1, θ2).

/ 1710

Evolving the table of situations-actions

• Distinguish two different groups or classes of state variables to describe all the possible states:

– Variables for reaching a desired goal position.– Variables for collision avoidance.

• Use three different state variables:– εd : the error between the current position and target position.

• Zero (Z), small (S), big (B).

– εΦ : the error between the current orientation and target orientation.• Zero (Z), small (S), big (B).

– ρo : the distance to the nearest obstacle.• Very small and could collide (VS), small (S), big (B), very far (VB).

• Discretize actions:– Positional movement : move forward, move backward, move left,

move right, and four diagonal movements.– Rotational movement : turn left, turn right.

/ 1711

Evolving the table of situations-actions

• Encoding Mechanism– Take the complete knowledge rule base of look-up table as the geno-

type of an individual (i.e. the Pittsburgh approach).

• Only mutation operator is applied due to the crossover operator could not be clearly identified for the proposed problem.

– Two possible adjacent actions defined in an internal table.– This internal table is fixed and “circular”.

• When one of the bounds are encountered, the other bound is used as the value for the mutation.

• Replacing policy– The n worst individuals are replaced by new randomly generated

ones.

/ 1712

The transition from innate behavior to knowledge-based behavior by means of on-line experience

• A stationary deterministic policy πd commits to a single action choice per state.

– πd : S -> A, πd(s) indicates the action that the agent takes in state s.

• The goal is to produce a robot controller that is initially based on its innate behavior and experiments a transition to a knowledge-based behavior by means of on-line experience.

– Use RL paradigm.

• A classical temporal difference (TD) learning rule.

– This basic update rule can be directly derived from the formula of the expected value.

⇒ ⇒

⇒

/ 1713


• It is proved that the Q-learning algorithm reaches an optimal con-trol policy under certain strong assumptions.

• Being Q-learning an off-policy algorithm we can separate the pol-icy used to select actions from the policy that the experience-based behavior controller is learning.

– Use πd as the initial behavioral policy while learning a new policy π which will be based on the information provided by the Q-table.

• The method consists of behaving with innate capabilities until enough experience has been gained from the interaction with the environment and Q-table has converged to the optimal values.

• Once the Q-learning controller has been trained, the policy πd can be inhibited letting the controller behave with the new policy π.

/ 1714


• The final controller will be an adaptive on-line experience-based behavior controller.

– It will remain adaptive in the sense that it can adapt to changes in the environment.

– It is rational in the sense that it has an action selection mechanism over different choices based on cause and effect relations.

– It will perform on-line, that is, can learn continuously by direct real-time interaction with the environment.

/ 1715

Experimental Results

• Simple case– Move a stick in a cluttered environment with unknown obstacles.

• Consider a total of 36 (3x3x4) possible states and 10 possible robot movements.

– The dimension of the state space is not extremely high (1036).– The standard Q-learning algorithm can find the optimum solution.– Genetic algorithm (GA) can also find the optimum solution.

/ 1716

Experimental Results

• Complex case– Control a two-link L-shaped robot.

• Add two new rotational movements, left and right, for each link.• Consider a total of 324 (3x3x4x3x3) possible states and 14 possible robot movements.

– The dimension of the state space is extremely high (14324).– The standard Q-learning algorithm can not find the optimum solution.– Genetic algorithm (GA) can find the optimum solution (within 150

gen.).

/ 1717

Conclusions and further research work

• The advantages and disadvantages of RL are discussed.– Present very attractive features concerning real-time applications

and on-line applications.– Sometimes show difficulties when the dimension of the state space

is extremely high.

• Propose the combination of both paradigms for the solution on real-time, extremely high dimensional state space problems.

– Show the efficiency of hybrid approach by using complex problem.

• Future Works– Use a fuzzy granulation rather than a Boolean granulation.– Use Michigan approach rather than Pittsburgh approach in GA.

• Encode each individual as a single knowledge rule.

/ 1718

Discussion

• Evolutionary algorithm has many successful results in various areas especially when the search space is enormous.

– e.g., Multi-Agent Systems (MAS)

• However, it may be unfair to compare it directly to any traditional methods since it requires more computing power.

• The experiments of this paper are especially unfair since the evolutionary algorithm has (number of generations x number of populations) chances.

17 hybridizing evolutionary computation and reinforcement learning for the design of almost...

Documents