Learning Algorithms for Small Mobile Robots: Case Study on Maze Exploration Stanislav Slušný and Roman Neruda and Petra Vidnerová Institute of Computer Science Academy of Sciences of the Czech Republic Pod vodárenskou věžı́ 2, Prague 8, Czech Republic slusny@cs.cas.cz Abstract. An emergence of intelligent behavior within a Pioneering work was done by Martinoli [14]. He simple robotic agent is studied in this paper. Two control solved the task, in which group of simulated Khep- mechanisms for an agent are considered — new direction era robots were asked to find “food items” randomly of reinforcement learning called relational reinforcement distributed on an arena. The control system was devel- learning, and a radial basis function neural network trained oped by the artificial evolution. Our work with single by evolutionary algorithm. Relational reinforcement learn- robot and robot teams were published in [19, 18]. ing is a new interdisciplinary approach combining logical programming with traditional reinforcement learning. Ra- Reinforcement learning is gaining increasing atten- dial basis function networks offer wider interpretation pos- tion in recent years. The basic overview of the field can sibilities than commonly used multilayer perceptrons. Re- be found in [20]. sults are discussed on the maze exploration problem. 3 Evolutionary robotics 1 Introduction The evolutionary algorithms (EA) [13, 12] represent a One of the key question of Artificial Intelligence is how stochastic search technique used to find approximate to design intelligent agents. Several approaches have solutions to optimization and search problems. They been studied so far. In our previous work, we have use techniques inspired by evolutionary biology such been examining mainly Evolutionary robotics (ER). as mutation, selection, and crossover. The EA typ- The ER approach attacks the problem through a ically works with a population of individuals repre- self-organization process based on artificial evo-lu-ti- senting abstract representations of feasible solutions. on [13]. Robot control system is typically realized by Each individual is assigned a fitness that is a mea- a neural network, which provides direct mapping from sure of how good solution it represents. The better robot’s sensors to effectors. Most of current applica- the solution is, the higher the fitness value it gets. tions use traditional multi-layer perceptron networks. The population evolves toward better solutions. The In our approach we utilize local unit network architec- evolution starts from a population of completely ran- ture called radial basis function (RBF) network, which dom individuals and iterates in generations. In each has competitive performance, more learning options, generation, the fitness of each individual is evaluated. and (due to its local nature) better interpretation pos- Individuals are stochastically selected from the cur- sibilities [18, 19]. rent population (based on their fitness), and modified This article gives summary of our experiences and by means of operators mutation and crossover to form comparison to Reinforcement Learning (RL) - another a new population. The new population is then used in widely studied approach in Artificial Intelligence. RL the next iteration of the algorithm. is focusing on agent, that is interacting with the envi- Feed forward neural used as robot controllers are ronment by its sensors and effectors. This interaction encoded in order to use them in the evolutionary algo- process helps agent to learn effective behavior. These rithm. The encoded vector is represented as a floating- kinds of tasks are commonly studied on miniature mo- point encoded vector of real parameters determining bile robots of type Khepera [2] and E-puck [1]. the network weights. Typical evolutionary operators for this case have 2 Related work been used, namely the uniform crossover and the mu- tation which performs a slight additive change in the The book [16] provides comprehensive introduction to parameter value. The rate of these operators is quite the ER, with focus on robot systems. Recently, effort is big, ensuring exploration capabilities of an evolution- made to study emergence of intelligent behavior within ary learning. A standard roulette-wheel selection is the group of robots. used together with a small elitist rate parameter. De- tailed discussions about fitness function are presented in the next section. π ∗ = argmaxπ V π (s), ∀s ∈ S (3) ∗ To simplify the notation, let’s write V (s) instead ∗ 4 Relational Reinforcement Learning of symbol V π , value function corresponding to opti- mal strategy π ∗ . The lack of theoretical insight into EA is the most serious problem of the previous approach. The RL is V ∗ (s) = maxπ V π (s) (4) based on dynamic programming [6], which has been The first breakthrough of RL was the Q-learning studied more than 50 years already. It has solid the- algorithm [21, 4], which computes optimal strategy in oretical backgrounds built around Markov chains and described conditions. several proved fundamental results. On the other side, The key idea of the algorithm is to define the so- it is not possible usually to fulfill theoretical assump- called Q-values. Qπ (s, a) is the expected reward, if the tions in the experiments. agent takes action a in state s and then follows policy The general model of agent-environment interac- π. tion is modeled through the notion of rewards. The essential assumption of RL states, that agent is able Qπ (s, a) = r(s, a) + γV π (s′ ), (5) to sense rewards coming from the environment. Re- ′ wards evaluate taken actions, agent’s task is to maxi- where s is the state, in which agent occurs taking mize them. The next assumption is that agent is work- action a in state s (s′ = δ(s, a)). ing in discrete time steps. Symbol S will denote finite It is probably most commonly used algorithm of discrete set of states and symbol A set of actions. In RL, mainly because of its simplicity. However, sev- each time step t, agent determines its actual state and eral improvements have been suggested to speed up chooses one action. Therefore, agent’s life can be writ- the algorithm. In real life applications, state space is ten as a sequence usually too big and convergence toward optimal strat- egy is slow. In recent years, there have been a lot of efforts devoted to rethinking idea of states by using o0 a0 r0 s1 a1 r1 ... (1) function approximators [7], defining notion of options where st denotes state, which is determined by pro- and hierarchical abstractions [5]. Relational reinforce- cessing sensors input, at ∈ A action and finally symbol ment learning [11] is approach that combines RL with rt ∈ R represents reward, that was received at time t. Inductive Logical Programming. Formally, agent’s task is to maximize The distinction between classical RL and Rela- tional Reinforcement Learning is the way how the Q- X values are represented. In classical Q-learning algo- V π (st ) = rt + γrt+1 + γ 2 rt+2 + ... = γ i rt+i (2) rithm are Q-values stored in the table. In relational i=0 version of the algorithm, they are stored in the struc- ture called Logical decision tree [8]. In our experi- where the quantity V π (st ) [16] is called discounted ments, we have used logical decision trees as imple- cumulative reward. It is telling us, what reward can mented in the programs TILDE [8] from package ACE- be expected, if the agent starts in state st and follows ilProlog [9]. policy π, 0 ≤ γ < 1 is a constant that determines the relative value of delayed versus immediate rewards. The most serious assumption of RL algorithms is 5 Evolutionary RBF Networks the Markov property, which states, that agent does not need history of previous states to make decision. The Evolutionary robotics combines two AI approaches: decision of the agent is based on the last state st only. neural networks and evolutionary algorithms. Neural When this property holds, we can use theory coming network receives input values from robot’s sensors and from the field of Markov decision processes (MDP). it outputs control signals to the wheels. This way it re- The policy π, which determines what action is cho- alizes a control system of the robot. sen in particular state, can be defined as function π : Evolutionary algorithms [13, 12] are then used to S → A, where π(st ) = at . Now, the agent’s task is to train such a network. It would be difficult to utilize the find optimal strategy π ∗ . Optimal strategy is the one, training by traditional supervised learning algorithms that maximalizes expected reward. In MDP, single op- since they require instant feedback in each step. Here timal deterministic strategy always exists, no matter we typically can evaluate each run of a robot as a in what state has the agent started. good or bad one, but it is impossible to assess each one Optimal strategy π ∗ can now be defined as move as good or bad. Thus, the evolutionary algorithm – for each s, a do 1. START: Create population P (0) = {I1 , · · · , IN }. • initialize the table entry Q′ (s, a) = 0 2. FITNESS EVALUATION: For each individual evalu- • e=0 ate fitness function. – do forever 3. TEST: If the stop criterion is satisfied, return the so- • e=e+1 lution. • i=0 4. NEW GENERATION: Create empty population • generate a random state s0 P (i + 1) and repeat the following procedure until • while not goal(si ) do P (i + 1) has N individuals. ∗ select an action ai and execute it i) Selection: Select two individuals from P (i) : ∗ receive an immediate reward ri = r(si , ai ) I1 ← selection(Pi ), ∗ observe the new state si+1 I2 ← selection(Pi ). ∗ i=i+1 ii) Crossover: With probability pc : • endwhile (I1 , I2 ) ← crossover(I1 , I2 ) • for j = i − 1 to 0 do iii) Mutation: With probability pm : ∗ update Q′ (sj , aj ) = rj + γ maxa′ Q′ (sj+1 , a′ ) Ik ← mutate(Ik ), k = 1, 2 iv) Insert: Insert I1 , I2 into Pi+1 5. LOOP: Go to step 2. Fig. 1. Scheme of Q-learning algorithm, taken from [11]. Fig. 3. Scheme of an evolutionary algorithm. represent one of the few possibilities, how to train the network. The RBF network [17, 15, 10], used in this work, is In case of RBF networks learning, each individual a feed-forward neural network with one hidden layer encodes one RBF network. The individual consists of of RBF units and linear output layer. The network h blocks: function is given by Eq. (7). IRBF = {B1 , . . . , Bh }, (8) where h is a number of hidden units. Each of the blocks   kx−ck y(x) = ϕ (6) contains parameter values of one RBF units: b h Bk = {ck1 , . . . , ckn , bk , wk1 , . . . , wkm }, (9)   X k x − cj k fs (x) = wjs ϕ , (7) j=1 bj where n is the number of inputs, m is the number of where fs is the output of the s-th output unit, y is the outputs, ck = {ck1 , . . . , ckn } is the k-th unit’s centre, output of a hidden unit, ϕ is an activation function, bk the width and wk = {wk1 , . . . , wkm } the weights 2 typically Gaussian function ϕ(s) = e−s . connecting k-th hidden unit with the output layer. The parameter values are encoded using direct floating- point encoding. We use standard tournament selection, 1-point crossover and additive mutation. Additive mutation changes the values in the individual by adding small value randomly drawn from h−ǫ, ǫi. The fitness function should reflect how good the robot is in given tasks and so it is always problem dependent. Detailed description of the fitness function Fig. 2. A scheme of a Radial Basis Function Network. is included in the experiment section. The evolutionary algorithm is summarised in Fig. 3. 6 Experiments It works with a population of individuals representing abstract representations of feasible solutions. Each in- In order to compare performance and properties of dividual is assigned a fitness that is a measure of how described algorithms, we conducted simulated exper- good solution it represents. The evolution starts from iment. Miniature robot of type e-puck[1] was trained a population of completely random individuals and to explore the environment and avoid walls. E-puck iterates in generations. Individuals are stochastically is a mobile robot with a diameter of 70 mm and a selected from the current population (based on their weight of 50 g. The robot is supported by two lat- fitness), and modified by means of genetic operators eral wheels that can rotate in both directions and two mutation to form a new generation. rigid pivots in the front and in the back. The sensory Fig. 4. Miniature e-puck robot. system employs eight “active infrared light” sensors Fig. 5. Agent was trained in the simulated environment of distributed around the body, six on one side and two size 100 x 60 cm. on other side. In “passive mode”, they measure the amount of infrared light in the environment, which is roughly proportional to the amount of visible light. In “active mode” these sensors emit a ray of infrared light and measure the amount of reflected light. The closer they are to a surface, the higher is the amount of in- frared light measured. The e-puck sensors can detect a white paper at a maximum distance of approximately 8 cm. Sensors return values from interval [0, 4095]. Ef- fectors accept values from interval [−1000, 1000]. The higher value, the faster the motor is moving. Fig. 6. Simulated testing environment of size 110 x 100 cm. Sensor value Meaning 0-50 NOWHERE 51-300 FEEL 6.1 Evolutionary RBF Networks 301-500 VERYFAR 501-1000 FAR The evolutionary RBF networks were applied to the 1001-2000 NEAR maze exploration task. The network input and output 2001-3000 VERYNEAR values are preprocessed in the same way as for the 3001-4095 CRASHED reinforcement learning. Table 1. Sensor values and their meaning. To stimulate maze exploration, agent is rewarded, when it passes through the zone. The zone is randomly located area, which can not be sensed by an agent. Therefore, ∆j is 1, if agent passed through the zone Without any further preprocessing of sensor’s and in j-th trial and 0 otherwise. The fitness value is then effector’s values, the state space would be too big. computed as Therefore, instead of raw sensor values, learning al- 4 gorithms worked with “perceptions”. Instead of 4095 X raw sensor values, we used only 5 perceptions(table 1). F itness = (Sj + ∆j ), (10) Effector’s values were processed in similar way: instead j=1 of 2000 values, learning algorithm chosen from values where quantity Sj is computed by summing normal- [−500, −100, 200, 300, 500]. To reduce the state space ized trial gains Tk,j in each simulation step k and trial even more, we grouped pairs of sensors together and j. back sensors were not used at all. 800 The agent was trained in the simulated environ- X Tk,j Sj = . (11) ment of size 100 x 60 cm and tested in more complex 800 k=1 environment of size 110 x 100 cm. We used Webots [3] simulation software. Simulation process consisted The three component Tk,j motivates agent to move of predefined number of steps. In each simulation step and avoid obstacles. agent processed sensor values and set speed to the left p and right motor. One simulation step took 32 ms. Tk,j = Vk,j (1 − ∆Vk,j )(1 − ik,j ) (12) First component Vk,j is computed by summing ab- Sensor Width Motor solute values of motor speed in k-th simulation step left front right left right and j-th trial, generatingpvalue between 0 and 1. The VERYNEAR NEAR VERYFAR 1.56 500 -100 second component (1 − ∆Vk,j ) encourages the two FEEL NOWHERE NOWHERE 1.93 -500 500 NEAR NEAR NOWHERE 0.75 500 -500 wheels to rotate in the same direction. The last com- FEEL NOWHERE NEAR 0.29 500 -500 ponent (1 − ik,j ) supports agent’s ability to avoid ob- VERYFAR NOWHERE NOWHERE 0.16 500 500 stacles. The value ik,j of the most active sensor in k-th simulation step and j-th trial provides a conservative Table 2. Rules represented by RBF units (listed values measure of how close the robot is to an object. The are original RBF network parameters after discretization). closer it is to an object, the higher the measured value in range from 0 to 1. Thus, Tk,j is in range from 0 to 1, too. The experiment was repeated 10 times, each run lasted 200 generations (each generation corresponding to 800 simulation steps). In all cases the successful behavior was found, i.e. the evolved robot was able to explore the whole maze without crashing to the walls. See Fig. 7 for the mean, minimal and maximal fitness over 10 runs. Fitness function Fig. 8. The evolved RBF network (see also Tab. 2). Local 700 min units responses plotted in 2D input space corresponding to max mean left and right sensory inputs. 600 500 shows average number of steps from each learning epi- Fitness 400 sode. It can be seen that after 10000 episodes, the 300 agent has learned the successful behavior. This num- ber roughly corresponds to the time complexity of the 200 GA, where 200 populations of 50 individuals also re- sult in 10000 simulations. The fitness of the solution 100 0 20 40 60 80 100 120 140 160 180 200 found by RL is slightly better than the GA-found so- Generations lution, on the other hand the inner representation of the neural network is much more compact. Fig. 7. The mean, minimal and maximal fitness function over 10 runs of evolution. Fitness is scaled in a way that successful walk through the whole maze corresponds to the 900 fitness 600 and higher. 800 700 Table 2 and Figure 8 show parameters of an evolved 600 network with five RBF units. For the sake of clarity, 500 Steps the parameters listed are also discretized. We can un- 400 derstand them as rules providing mapping from input 300 sensor space to motor control. However, these ‘rules’ act in accord, since the whole network computes linear 200 sum of the five corresponding gaussians. 100 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 6.2 Reinforcement learning Episode The same experiment has been performed by means of relational reinforcement learning algorithm described Fig. 9. Learning curve for Reinforcement Learning agent above under the same simulated environment and iden- averaged on 10 runs. tical conditions. The performance of the Reinforce- ment learning agent is shown on figure 9. The graph 7 Discussion 8. H. Blockeel and L. De Raedt. Top-down induction of first order logical decision trees. Artificial Intelligence, This article presented survey of popular approaches 101:285–297, 1998. 9. H. Blockeel, L. Dehaspe, B. Demoen, G. Janssens, in mobile robotics used to robot behavior synthesis. J. Ramon, and H. Vandecasteele. Improving the ef- In our future work, we would like to design hybrid ficiency of inductive logic programming through the intelligent system, combining the advantages of these use of query packs. Journal of Artificial Intelligence approaches. This way, agent would benefit from using Research, 16:135–166. three widely studied fields: Inductive Logic Program- 10. D.S. Broomhead and D. Lowe. Multivariable func- ming, Neural Networks and Reinforcement Learning. tional interpolation and adaptive networks. Complex The Reinforcement Learning has strong mathematical Systems, 2:321–355, 1988. background. On the other side, in real experiments, 11. S. Dzeroski, L. De Raedt, and K. Driessens. Relational some of the assumptions are not realistic. Neural net- reinforcement learning. Machine Learning 43, pages works are very popular in robotics, because they pro- 7–52, 2001. vide straightforward mapping from input signals to 12. D. B. Fogel. Evolutionary Computation: The Fossil Record. MIT-IEEE Press, 1998. output signals, several levels of adaptation and are 13. J. Holland. Adaptation In Natural and Artificial Sys- robust to noise. Inductive logic programming allows tems. MIT Press, reprinted edition, 1992. agent to reason about states, thus concentrating at- 14. A. Martinoli. Swarm intelligence in autonomous Col- tention on the most promising parts of the state space. lective robotics: from tools to the analysis and synthesis The experiments showed that a preprocessing plays of distributed control strategies. Lausanne: Computer rather important role in the case of robotic agent con- Science Department, EPFL, 1999. trol. In our approach we have chosen a rather strong 15. J. Moody and C. Darken. Fast learning in networks of processing of inputs and outputs, which is suitable for locally-tuned processing units. Neural Computation, RL algorithms mainly. In our future work we would 1:289–303, 1989. like to study control with less preprocessed inputs/out- 16. S. Nolfi and D. Floreano. Evolutionary Robotics — The Biology, Intelligence and Techology of Self-Organizing puts which can be used mainly for the neural network Machines. The MIT Press, 2000. controller. Also, another immediate work is to extract 17. T. Poggio and F. Girosi. A theory of networks for the most frequently used state transitions from the approximation and learning. Technical report, Cam- RL algorithm and interpret them as rules in a similar bridge, MA, USA, 1989. A. I. Memo No. 1140, C.B.I.P. fashion we did with the RBF network. Paper No. 31. 18. S. Slušný and R. Neruda. Evolving homing be- haviour for team of robots. Computational Intelli- Acknowledgements gence, Robotics and Autonomous Systems. Palmerston North : Massey University, 2007. This work has been supported by the Ministry of Edu- 19. S. Slušný, R. Neruda, and P. Vidnerová. Evolution cation of the Czech Republic under the project Center of simple behavior patterns for autonomous robotic agent. System Science and Simulation in Engineering. of Applied Cybernetics No. 1M684077004 (1M0567), - : WSEAS Press, pages 411–417, 2007. S. Slušný been partially supported by 20. Richard S. Sutton and Andrew G. Barto. Reinforce- by the Czech Science Foundation under the contract ment Learning: An Introduction. MIT Press, Cam- no. 201/05/H014G. bridge, MA, 1998. 21. C. J. Watkings. Learning from Delayed Rewards. PhD thesis, Cambridge University, 1989. References 1. E-puck, online documentation. http://www.e- puck.org. 2. Khepera II documentation. http://k-team.com. 3. Webots simulator. http://www.cyberbotics.com/. 4. A. Barto, S. Bradtke, and S. Singh. Learning to act using real-time dynamic programming. Artificial In- telligence, pages 81–138. 5. A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. 13:341–379. 6. R. E. Bellman. Dynamic Programming. Princeton Uni- versity Press, 1957. 7. D. Bertsekas and J. Tsitsiklis. Neuro-dynamic pro- gramming. Ahtena Scientific, 1996.