Learning Algorithms for Small Mobile Robots: Case Study on
                         Maze Exploration

                            Stanislav Slušný and Roman Neruda and Petra Vidnerová

                                            Institute of Computer Science
                                     Academy of Sciences of the Czech Republic
                                  Pod vodárenskou věžı́ 2, Prague 8, Czech Republic
                                                  slusny@cs.cas.cz

Abstract. An emergence of intelligent behavior within a            Pioneering work was done by Martinoli [14]. He
simple robotic agent is studied in this paper. Two control     solved the task, in which group of simulated Khep-
mechanisms for an agent are considered — new direction         era robots were asked to find “food items” randomly
of reinforcement learning called relational reinforcement      distributed on an arena. The control system was devel-
learning, and a radial basis function neural network trained   oped by the artificial evolution. Our work with single
by evolutionary algorithm. Relational reinforcement learn-
                                                               robot and robot teams were published in [19, 18].
ing is a new interdisciplinary approach combining logical
programming with traditional reinforcement learning. Ra-
                                                                   Reinforcement learning is gaining increasing atten-
dial basis function networks offer wider interpretation pos-   tion in recent years. The basic overview of the field can
sibilities than commonly used multilayer perceptrons. Re-      be found in [20].
sults are discussed on the maze exploration problem.

                                                               3   Evolutionary robotics
1    Introduction
                                                               The evolutionary algorithms (EA) [13, 12] represent a
One of the key question of Artificial Intelligence is how      stochastic search technique used to find approximate
to design intelligent agents. Several approaches have          solutions to optimization and search problems. They
been studied so far. In our previous work, we have             use techniques inspired by evolutionary biology such
been examining mainly Evolutionary robotics (ER).              as mutation, selection, and crossover. The EA typ-
    The ER approach attacks the problem through a              ically works with a population of individuals repre-
self-organization process based on artificial evo-lu-ti-       senting abstract representations of feasible solutions.
on [13]. Robot control system is typically realized by         Each individual is assigned a fitness that is a mea-
a neural network, which provides direct mapping from           sure of how good solution it represents. The better
robot’s sensors to effectors. Most of current applica-         the solution is, the higher the fitness value it gets.
tions use traditional multi-layer perceptron networks.         The population evolves toward better solutions. The
In our approach we utilize local unit network architec-        evolution starts from a population of completely ran-
ture called radial basis function (RBF) network, which         dom individuals and iterates in generations. In each
has competitive performance, more learning options,            generation, the fitness of each individual is evaluated.
and (due to its local nature) better interpretation pos-       Individuals are stochastically selected from the cur-
sibilities [18, 19].                                           rent population (based on their fitness), and modified
    This article gives summary of our experiences and          by means of operators mutation and crossover to form
comparison to Reinforcement Learning (RL) - another            a new population. The new population is then used in
widely studied approach in Artificial Intelligence. RL         the next iteration of the algorithm.
is focusing on agent, that is interacting with the envi-           Feed forward neural used as robot controllers are
ronment by its sensors and effectors. This interaction         encoded in order to use them in the evolutionary algo-
process helps agent to learn effective behavior. These         rithm. The encoded vector is represented as a floating-
kinds of tasks are commonly studied on miniature mo-           point encoded vector of real parameters determining
bile robots of type Khepera [2] and E-puck [1].                the network weights.
                                                                   Typical evolutionary operators for this case have
2    Related work                                              been used, namely the uniform crossover and the mu-
                                                               tation which performs a slight additive change in the
The book [16] provides comprehensive introduction to           parameter value. The rate of these operators is quite
the ER, with focus on robot systems. Recently, effort is       big, ensuring exploration capabilities of an evolution-
made to study emergence of intelligent behavior within         ary learning. A standard roulette-wheel selection is
the group of robots.                                           used together with a small elitist rate parameter. De-
tailed discussions about fitness function are presented
in the next section.                                                              π ∗ = argmaxπ V π (s), ∀s ∈ S            (3)
                                                                                                                      ∗
                                                                         To simplify the notation, let’s write V (s) instead
                                                                                   ∗
4     Relational Reinforcement Learning                              of symbol V π , value function corresponding to opti-
                                                                     mal strategy π ∗ .
The lack of theoretical insight into EA is the most
serious problem of the previous approach. The RL is                                   V ∗ (s) = maxπ V π (s)               (4)
based on dynamic programming [6], which has been                         The first breakthrough of RL was the Q-learning
studied more than 50 years already. It has solid the-                algorithm [21, 4], which computes optimal strategy in
oretical backgrounds built around Markov chains and                  described conditions.
several proved fundamental results. On the other side,                   The key idea of the algorithm is to define the so-
it is not possible usually to fulfill theoretical assump-            called Q-values. Qπ (s, a) is the expected reward, if the
tions in the experiments.                                            agent takes action a in state s and then follows policy
     The general model of agent-environment interac-                 π.
tion is modeled through the notion of rewards. The
essential assumption of RL states, that agent is able                             Qπ (s, a) = r(s, a) + γV π (s′ ),        (5)
to sense rewards coming from the environment. Re-                            ′
wards evaluate taken actions, agent’s task is to maxi-               where s is the state, in which agent occurs taking
mize them. The next assumption is that agent is work-                action a in state s (s′ = δ(s, a)).
ing in discrete time steps. Symbol S will denote finite                  It is probably most commonly used algorithm of
discrete set of states and symbol A set of actions. In               RL, mainly because of its simplicity. However, sev-
each time step t, agent determines its actual state and              eral improvements have been suggested to speed up
chooses one action. Therefore, agent’s life can be writ-             the algorithm. In real life applications, state space is
ten as a sequence                                                    usually too big and convergence toward optimal strat-
                                                                     egy is slow. In recent years, there have been a lot of
                                                                     efforts devoted to rethinking idea of states by using
                       o0 a0 r0 s1 a1 r1 ...                  (1)
                                                                     function approximators [7], defining notion of options
where st denotes state, which is determined by pro-                  and hierarchical abstractions [5]. Relational reinforce-
cessing sensors input, at ∈ A action and finally symbol              ment learning [11] is approach that combines RL with
rt ∈ R represents reward, that was received at time t.               Inductive Logical Programming.
    Formally, agent’s task is to maximize                                The distinction between classical RL and Rela-
                                                                     tional Reinforcement Learning is the way how the Q-
                                                X                    values are represented. In classical Q-learning algo-
    V π (st ) = rt + γrt+1 + γ 2 rt+2 + ... =         γ i rt+i (2)   rithm are Q-values stored in the table. In relational
                                                i=0                  version of the algorithm, they are stored in the struc-
                                                                     ture called Logical decision tree [8]. In our experi-
    where the quantity V π (st ) [16] is called discounted           ments, we have used logical decision trees as imple-
cumulative reward. It is telling us, what reward can                 mented in the programs TILDE [8] from package ACE-
be expected, if the agent starts in state st and follows             ilProlog [9].
policy π, 0 ≤ γ < 1 is a constant that determines the
relative value of delayed versus immediate rewards.
    The most serious assumption of RL algorithms is                  5    Evolutionary RBF Networks
the Markov property, which states, that agent does not
need history of previous states to make decision. The                Evolutionary robotics combines two AI approaches:
decision of the agent is based on the last state st only.            neural networks and evolutionary algorithms. Neural
When this property holds, we can use theory coming                   network receives input values from robot’s sensors and
from the field of Markov decision processes (MDP).                   it outputs control signals to the wheels. This way it re-
    The policy π, which determines what action is cho-               alizes a control system of the robot.
sen in particular state, can be defined as function π :                  Evolutionary algorithms [13, 12] are then used to
S → A, where π(st ) = at . Now, the agent’s task is to               train such a network. It would be difficult to utilize the
find optimal strategy π ∗ . Optimal strategy is the one,             training by traditional supervised learning algorithms
that maximalizes expected reward. In MDP, single op-                 since they require instant feedback in each step. Here
timal deterministic strategy always exists, no matter                we typically can evaluate each run of a robot as a
in what state has the agent started.                                 good or bad one, but it is impossible to assess each one
    Optimal strategy π ∗ can now be defined as                       move as good or bad. Thus, the evolutionary algorithm
 – for each s, a do                                               1. START: Create population P (0) = {I1 , · · · , IN }.
     • initialize the table entry Q′ (s, a) = 0                   2. FITNESS EVALUATION: For each individual evalu-
     • e=0                                                           ate fitness function.
 – do forever                                                     3. TEST: If the stop criterion is satisfied, return the so-
     • e=e+1                                                         lution.
     • i=0                                                        4. NEW GENERATION: Create empty population
     • generate a random state s0                                    P (i + 1) and repeat the following procedure until
     • while not goal(si ) do                                        P (i + 1) has N individuals.
         ∗ select an action ai and execute it                          i) Selection: Select two individuals from P (i) :
         ∗ receive an immediate reward ri = r(si , ai )                   I1 ← selection(Pi ),
         ∗ observe the new state si+1                                     I2 ← selection(Pi ).
         ∗ i=i+1                                                      ii) Crossover: With probability pc :
     • endwhile                                                           (I1 , I2 ) ← crossover(I1 , I2 )
     • for j = i − 1 to 0 do                                         iii) Mutation: With probability pm :
         ∗ update Q′ (sj , aj ) = rj + γ maxa′ Q′ (sj+1 , a′ )            Ik ← mutate(Ik ), k = 1, 2
                                                                     iv) Insert: Insert I1 , I2 into Pi+1
                                                                  5. LOOP: Go to step 2.
Fig. 1. Scheme of Q-learning algorithm, taken from [11].

                                                                       Fig. 3. Scheme of an evolutionary algorithm.
represent one of the few possibilities, how to train the
network.
    The RBF network [17, 15, 10], used in this work, is             In case of RBF networks learning, each individual
a feed-forward neural network with one hidden layer              encodes one RBF network. The individual consists of
of RBF units and linear output layer. The network                h blocks:
function is given by Eq. (7).                                                   IRBF = {B1 , . . . , Bh },        (8)
                                                                 where h is a number of hidden units. Each of the blocks
                                      
                             kx−ck
                 y(x) = ϕ                            (6)         contains parameter values of one RBF units:
                                 b
                     h
                                                                          Bk = {ck1 , . . . , ckn , bk , wk1 , . . . , wkm },   (9)
                                          
                    X           k x − cj k
           fs (x) =     wjs ϕ                ,       (7)
                    j=1
                                    bj
                                                                 where n is the number of inputs, m is the number of
where fs is the output of the s-th output unit, y is the         outputs, ck = {ck1 , . . . , ckn } is the k-th unit’s centre,
output of a hidden unit, ϕ is an activation function,            bk the width and wk = {wk1 , . . . , wkm } the weights
                                        2
typically Gaussian function ϕ(s) = e−s .                         connecting k-th hidden unit with the output layer. The
                                                                 parameter values are encoded using direct floating-
                                                                 point encoding.
                                                                     We use standard tournament selection, 1-point
                                                                 crossover and additive mutation. Additive mutation
                                                                 changes the values in the individual by adding small
                                                                 value randomly drawn from h−ǫ, ǫi.
                                                                     The fitness function should reflect how good the
                                                                 robot is in given tasks and so it is always problem
                                                                 dependent. Detailed description of the fitness function
 Fig. 2. A scheme of a Radial Basis Function Network.            is included in the experiment section.


    The evolutionary algorithm is summarised in Fig. 3.          6    Experiments
It works with a population of individuals representing
abstract representations of feasible solutions. Each in-         In order to compare performance and properties of
dividual is assigned a fitness that is a measure of how          described algorithms, we conducted simulated exper-
good solution it represents. The evolution starts from           iment. Miniature robot of type e-puck[1] was trained
a population of completely random individuals and                to explore the environment and avoid walls. E-puck
iterates in generations. Individuals are stochastically          is a mobile robot with a diameter of 70 mm and a
selected from the current population (based on their             weight of 50 g. The robot is supported by two lat-
fitness), and modified by means of genetic operators             eral wheels that can rotate in both directions and two
mutation to form a new generation.                               rigid pivots in the front and in the back. The sensory
            Fig. 4. Miniature e-puck robot.


system employs eight “active infrared light” sensors Fig. 5. Agent was trained in the simulated environment of
distributed around the body, six on one side and two size 100 x 60 cm.
on other side. In “passive mode”, they measure the
amount of infrared light in the environment, which is
roughly proportional to the amount of visible light. In
“active mode” these sensors emit a ray of infrared light
and measure the amount of reflected light. The closer
they are to a surface, the higher is the amount of in-
frared light measured. The e-puck sensors can detect a
white paper at a maximum distance of approximately
8 cm. Sensors return values from interval [0, 4095]. Ef-
fectors accept values from interval [−1000, 1000]. The
higher value, the faster the motor is moving.
                                                         Fig. 6. Simulated testing environment of size 110 x 100
                                                         cm.
            Sensor value     Meaning
            0-50             NOWHERE
            51-300           FEEL                        6.1   Evolutionary RBF Networks
            301-500          VERYFAR
            501-1000         FAR                        The evolutionary RBF networks were applied to the
            1001-2000        NEAR                       maze exploration task. The network input and output
            2001-3000        VERYNEAR                   values are preprocessed in the same way as for the
            3001-4095        CRASHED                    reinforcement learning.
       Table 1. Sensor values and their meaning.            To stimulate maze exploration, agent is rewarded,
                                                        when it passes through the zone. The zone is randomly
                                                        located area, which can not be sensed by an agent.
                                                        Therefore, ∆j is 1, if agent passed through the zone
    Without any further preprocessing of sensor’s and in j-th trial and 0 otherwise. The fitness value is then
effector’s values, the state space would be too big. computed as
Therefore, instead of raw sensor values, learning al-
                                                                                   4
gorithms worked with “perceptions”. Instead of 4095                               X
raw sensor values, we used only 5 perceptions(table 1).                F itness =    (Sj + ∆j ),          (10)
Effector’s values were processed in similar way: instead                           j=1

of 2000 values, learning algorithm chosen from values where quantity Sj is computed by summing normal-
[−500, −100, 200, 300, 500]. To reduce the state space ized trial gains Tk,j in each simulation step k and trial
even more, we grouped pairs of sensors together and j.
back sensors were not used at all.
                                                                                   800
    The agent was trained in the simulated environ-                                X   Tk,j
                                                                             Sj =           .              (11)
ment of size 100 x 60 cm and tested in more complex                                    800
                                                                                   k=1
environment of size 110 x 100 cm. We used Webots
[3] simulation software. Simulation process consisted      The three component Tk,j motivates agent to move
of predefined number of steps. In each simulation step and avoid obstacles.
agent processed sensor values and set speed to the left                            p
and right motor. One simulation step took 32 ms.                  Tk,j = Vk,j (1 − ∆Vk,j )(1 − ik,j )      (12)
    First component Vk,j is computed by summing ab-                 Sensor              Width       Motor
solute values of motor speed in k-th simulation step left          front      right                left right
and j-th trial, generatingpvalue between 0 and 1. The     VERYNEAR NEAR       VERYFAR      1.56 500 -100
second component (1 − ∆Vk,j ) encourages the two FEEL              NOWHERE NOWHERE         1.93 -500 500
                                                          NEAR     NEAR       NOWHERE      0.75 500 -500
wheels to rotate in the same direction. The last com-
                                                          FEEL     NOWHERE NEAR            0.29 500 -500
ponent (1 − ik,j ) supports agent’s ability to avoid ob-  VERYFAR  NOWHERE NOWHERE         0.16 500 500
stacles. The value ik,j of the most active sensor in k-th
simulation step and j-th trial provides a conservative Table 2. Rules represented by RBF units (listed values
measure of how close the robot is to an object. The are original RBF network parameters after discretization).
closer it is to an object, the higher the measured value
in range from 0 to 1. Thus, Tk,j is in range from 0 to
1, too.
    The experiment was repeated 10 times, each run
lasted 200 generations (each generation corresponding
to 800 simulation steps). In all cases the successful
behavior was found, i.e. the evolved robot was able to
explore the whole maze without crashing to the walls.
See Fig. 7 for the mean, minimal and maximal fitness
over 10 runs.

                                           Fitness function                                    Fig. 8. The evolved RBF network (see also Tab. 2). Local
             700
                                                                                 min           units responses plotted in 2D input space corresponding to
                                                                                 max
                                                                                mean
                                                                                               left and right sensory inputs.
             600


             500


                                                                                               shows average number of steps from each learning epi-
   Fitness


             400
                                                                                               sode. It can be seen that after 10000 episodes, the
             300
                                                                                               agent has learned the successful behavior. This num-
                                                                                               ber roughly corresponds to the time complexity of the
             200                                                                               GA, where 200 populations of 50 individuals also re-
                                                                                               sult in 10000 simulations. The fitness of the solution
             100
                   0   20   40   60   80         100          120   140   160      180   200   found by RL is slightly better than the GA-found so-
                                            Generations
                                                                                               lution, on the other hand the inner representation of
                                                                                               the neural network is much more compact.
Fig. 7. The mean, minimal and maximal fitness function
over 10 runs of evolution. Fitness is scaled in a way that
successful walk through the whole maze corresponds to the                                                 900
fitness 600 and higher.
                                                                                                          800


                                                                                                          700


    Table 2 and Figure 8 show parameters of an evolved                                                    600

network with five RBF units. For the sake of clarity,                                                     500
                                                                                                  Steps


the parameters listed are also discretized. We can un-                                                    400
derstand them as rules providing mapping from input
                                                                                                          300
sensor space to motor control. However, these ‘rules’
act in accord, since the whole network computes linear                                                    200


sum of the five corresponding gaussians.                                                                  100


                                                                                                           0
                                                                                                                0   1000   2000   3000   4000    5000     6000   7000   8000   9000   10000

6.2           Reinforcement learning                                                                                                            Episode


The same experiment has been performed by means of
relational reinforcement learning algorithm described Fig. 9. Learning curve for Reinforcement Learning agent
above under the same simulated environment and iden- averaged on 10 runs.
tical conditions. The performance of the Reinforce-
ment learning agent is shown on figure 9. The graph
7   Discussion                                               8. H. Blockeel and L. De Raedt. Top-down induction of
                                                                first order logical decision trees. Artificial Intelligence,
This article presented survey of popular approaches             101:285–297, 1998.
                                                             9. H. Blockeel, L. Dehaspe, B. Demoen, G. Janssens,
in mobile robotics used to robot behavior synthesis.
                                                                J. Ramon, and H. Vandecasteele. Improving the ef-
In our future work, we would like to design hybrid              ficiency of inductive logic programming through the
intelligent system, combining the advantages of these           use of query packs. Journal of Artificial Intelligence
approaches. This way, agent would benefit from using            Research, 16:135–166.
three widely studied fields: Inductive Logic Program-       10. D.S. Broomhead and D. Lowe. Multivariable func-
ming, Neural Networks and Reinforcement Learning.               tional interpolation and adaptive networks. Complex
The Reinforcement Learning has strong mathematical              Systems, 2:321–355, 1988.
background. On the other side, in real experiments,         11. S. Dzeroski, L. De Raedt, and K. Driessens. Relational
some of the assumptions are not realistic. Neural net-          reinforcement learning. Machine Learning 43, pages
works are very popular in robotics, because they pro-           7–52, 2001.
vide straightforward mapping from input signals to          12. D. B. Fogel. Evolutionary Computation: The Fossil
                                                                Record. MIT-IEEE Press, 1998.
output signals, several levels of adaptation and are
                                                            13. J. Holland. Adaptation In Natural and Artificial Sys-
robust to noise. Inductive logic programming allows             tems. MIT Press, reprinted edition, 1992.
agent to reason about states, thus concentrating at-        14. A. Martinoli. Swarm intelligence in autonomous Col-
tention on the most promising parts of the state space.         lective robotics: from tools to the analysis and synthesis
    The experiments showed that a preprocessing plays           of distributed control strategies. Lausanne: Computer
rather important role in the case of robotic agent con-         Science Department, EPFL, 1999.
trol. In our approach we have chosen a rather strong        15. J. Moody and C. Darken. Fast learning in networks of
processing of inputs and outputs, which is suitable for         locally-tuned processing units. Neural Computation,
RL algorithms mainly. In our future work we would               1:289–303, 1989.
like to study control with less preprocessed inputs/out-    16. S. Nolfi and D. Floreano. Evolutionary Robotics — The
                                                                Biology, Intelligence and Techology of Self-Organizing
puts which can be used mainly for the neural network
                                                                Machines. The MIT Press, 2000.
controller. Also, another immediate work is to extract
                                                            17. T. Poggio and F. Girosi. A theory of networks for
the most frequently used state transitions from the             approximation and learning. Technical report, Cam-
RL algorithm and interpret them as rules in a similar           bridge, MA, USA, 1989. A. I. Memo No. 1140, C.B.I.P.
fashion we did with the RBF network.                            Paper No. 31.
                                                            18. S. Slušný and R. Neruda. Evolving homing be-
                                                                haviour for team of robots. Computational Intelli-
Acknowledgements                                                gence, Robotics and Autonomous Systems. Palmerston
                                                                North : Massey University, 2007.
This work has been supported by the Ministry of Edu-        19. S. Slušný, R. Neruda, and P. Vidnerová. Evolution
cation of the Czech Republic under the project Center           of simple behavior patterns for autonomous robotic
                                                                agent. System Science and Simulation in Engineering.
of Applied Cybernetics No. 1M684077004 (1M0567),
                                                                - : WSEAS Press, pages 411–417, 2007.
S.    Slušný   been   partially    supported    by        20. Richard S. Sutton and Andrew G. Barto. Reinforce-
by the Czech Science Foundation under the contract              ment Learning: An Introduction. MIT Press, Cam-
no. 201/05/H014G.                                               bridge, MA, 1998.
                                                            21. C. J. Watkings. Learning from Delayed Rewards. PhD
                                                                thesis, Cambridge University, 1989.
References

 1. E-puck, online documentation.           http://www.e-
    puck.org.
 2. Khepera II documentation. http://k-team.com.
 3. Webots simulator. http://www.cyberbotics.com/.
 4. A. Barto, S. Bradtke, and S. Singh. Learning to act
    using real-time dynamic programming. Artificial In-
    telligence, pages 81–138.
 5. A. G. Barto and S. Mahadevan. Recent advances in
    hierarchical reinforcement learning. 13:341–379.
 6. R. E. Bellman. Dynamic Programming. Princeton Uni-
    versity Press, 1957.
 7. D. Bertsekas and J. Tsitsiklis. Neuro-dynamic pro-
    gramming. Ahtena Scientific, 1996.