Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy A Case Study on Goal Oriented Obstacle Avoidance Pasquale Caianiello and Domenico Presutti DISIM Università dell’Aquila Italy Email: pasquale.caianiello@univaq.it, domenico.presutti@gmail.com Abstract—We report on several test experiments with a mobile ultrasonic sensor that acquires a panoramic of the environment, agent equipped with an artificial neural net control to achieve a two infrared front mounted close range proximity sensors and basic route direction goal reflex in a 2-dimensional environment a GY-80 multi-chip module that integrates a 3-axis gyroscope, with obstacles. A real assembled 4tronix Initio robot kit agent is a 3-axis accelerometer, and a 3-axis digital compass used for reproduced with its sensor and motor characteristics in a virtual determining the agent absolute orientation. environment for experimenting and comparing the behavior of its artificial neural net control with two different learning ap- proaches: A standard supervised error back propagation training B. The virtual agent and environment with examples, and an unsupervised reinforcement learning with environmental feedback. The virtual environment is constructed as a bit matrix of dimension m × n. Each bit represents a virtual position for the agent. A bit is set to 1 if the position is blocked, there is an I. I NTRODUCTION obstacle, and the position cannot be occupied by the agent. The Obstacle avoidance is a basic task for mobile agents. goal area of the environment is a circle identified by a center Commercial and research applications address the problem by position and a boundary radius. The virtual agent emulates using state of the art adaptive techniques and mathematical the real hardware in performing its basic control commands to modeling. In this work we confronted with the task of using the hardware controllers. The basic functioning cycle has three a simple neural net architecture to equip a real 4tronix Initio steps: Sample sensors, Compute orientation and velocity, Run [22] robot kit agent with a basic obstacle avoidance control. for a time unit. Its neural net would take inputs from a pre-processing of the sensor information of the robot, and provide an output to C. The artificial neural net control module control its motor actuators. The preliminary aim of our project, as reported in this paper in its start-up phase, is to construct the The artificial neural net (ANN) is emulated through the virtual counterpart of the 4tronix Initio robot agent, that will use of the Neuroph [24] Java framework used to implement a let us perform fast and reliable test experiments of possible 3-layer fully connected feed-forward net with 31 input units, control strategies at the reflex proactive level with real time 62 hidden nodes, and 10 output units as depicted in fig 1. All response. artificial neurons in the net are sigmoid. The 31 input units collect the infrared proximity information, the distances in 9 As a result we identified a simple artificial neural net predefined pan positions measured by the ultrasonic sensor architecture and a pair of training strategies that would let the (normalized in the unity interval), the distance from the agent agent show simple adaptive/reactive capabilities in avoiding to the center of the goal area, and the computed orientation obstacles, while achieving a given target position goal. The angle with respect to the given direction goal, and represented agent’s control neural net model is deliberately kept at a reflex into 18 binary direction intervals of 20◦ to comprise the whole level state, the perceptron basic input/output transduction, with 360◦ range. The output units consist of 9 binary orientation no high level onthologies or primitives for describing the directions (the front 180◦ ) and one real in the unity interval to environment, no environment model or map acquisition capa- represent the distance to be covered. So the control protocol bilities, and no planning ability of any sort. The environment will interpret the ANN output by setting the agent to the only exists for the pattern that it induces on the agent sensors orientation given, and let it run for the given distance. at a given position and orientation. The agent will have to react by issuing control settings to its motor components, taking into The experiments are described when the ANN is trained account its given goal. according to two different protocols: Supervised error Back Propagation (BP), environment reinforcement learning (RL). II. T OOLS AND M ETHODS 1) The Supervised error back propagation protocol: The A. The agent BP protocol was experimented with training sets (TS) con- structed in two different ways, the first (BPp) by sampling the The agent hardware is a 4tronix Initio [22] robot kit con- control choices of a human pilot, the second (BPh) by record- trolled by a Raspberry Pi B+ [25] computer that emulates the ing the behavior choices of a heuristic evaluation function in a artificial neural net and performs low level sensor acquisitions wide range of enumeration of input sensor patterns. The use of and issues motor control primitives to a PiRoCon v.2 motor BPh allowed the easy construction of much larger virtual TS, controller. The agent is equipped with a pan-tilt HC-SR04 while TS construction with BPp required synchronized reading 142 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy Fig. 1. The perceptron architecture for proactive reflex input/output trans- duction the agent sensors and sampling the human pilot, who on the other hand was allowed to comprise higher level cognitive faculties, and was allowed to look at the environment map while making his decision. To preserve generalization capabilities and avoid over- fitting of the ANN, the training process was stopped at predetermined network mean square error limit values. For this purpose, each training sets were split in two subsets of training Fig. 2. Simulation Results, Environment complexity 3. Each snapshots subsets and test subsets, containing respectively 90% and 10% records the trajectory of the corresponding row agent. Trajectory position of the original TS samples. Optimal limit network error values points have a time scale color starting from yellow and going to darker red as time passes. Goal area is green. See text. were determined by training the network on training subsets and testing network response on the test subsets: when the error on the test subsets became stationary or increasing, training and to loosen excessive local/opportunistic behavior. The net was paused and finally completed by using the full TS and the was trained on a collection of problem samples with random minimum network error limit. selection of obstacles configuration, starting position and goal After the training process, the trained ANN’s were tested area to obtain the trained net that was used in the comparative on the agent control system and some critical aspects emerged experiments where its behavior is sampled at different levels as of the dimensional insufficiency of the TS obtained with of maturation. BPp when compared to the dimension of the inputs state configurations. It did, in fact, bring about insufficient network III. E XPERIMENTAL R ESULTS output response polarization on new input patterns and a random-like behavior of the agent in specific configurations. The experimentation was performed after training the ANN On the other hand the ANN’s trained with TS constructed both with the BP protocol and with the Q-learning protocol with BPh was often trapped in stationary or cyclic behavior in leading to two net configurations named SUPERVISED (SU) sub-optimal positions with respect to the navigation goal. In and REINFORCED(RE) respectively. consideration of the complementary critical aspects described The RE network was trained for 4000 learning sessions of for the BPp and BPh cases, a third instance of the ANN was 100 base cycles. For each session a random start and target is trained with an incremental training process that combined assigned in a randomly generated environment. Main learning both pilot driving and heuristic evaluation TS. In this case parameters are set by an empirical optimization process to the the learning process consisted in two training phases. In the following values: Learning rate= 0.036, Future actions discount first phase, the BPh TS was used for training the ANN for a factor g= 0.24, and Stochastic action selector temperature small number of training epochs in order to give the network T=4. A high temperature T value is necessary during the base response capabilities in covering a wide range of input learning process to maximize reinforcements. The T value is configurations. In the second phase, the BPp TS was used until subsequently lowered on tests to 0.4 value to appreciate neural training completion. As the incremental trained network was network response and control system behavior. finally tested on the robot, the critical behaviors were relieved. On the other hand, SU is trained for one learning epoch All the test results reported in the following are obtained by on the BPh TS, containing 1.152.000 training samples, and running the net configuration obtained at the end of the training resulting in a 0.14 mean square error after training. The process as described, with a sample environment problem. following 2076 training epochs are performed with the BPp TS, with 400 training samples. Mean square error at the end 2) The Reinforcement Learning protocol: The same net of the training process is 0.07. Both BPh and BPp training architecture has been used with an unsupervised reinforcement sets are generated on several base generation sessions of 50 training protocol Q-learning [15] with reward/reinforce func- iterations each. Network weights are initalized with random tion taking into account distance from goal, runs length, and values in the [-0.02, 0.02] interval. route declination from goal direction. Reward was corrected by the Q function [15] to take into account future effects of actions After training the nets are tested on a battery of several tests 143 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy Fig. 5. Comparative statistics of reflex effectiveness in reaching the goal Each snapshots records the trajectory of the corresponding row agent. Trajectory position points have a time scale color starting from yellow and going to darker red as time passes. Test problems in a row are presented in the same order as they were performed. The order is irrelevant for UN, RE, and SU, but AD’s behavior changes (and improves) while solving a problem. For subsequent tests in the battery, AD’s behavior Fig. 3. Simulation Results, Environment complexity 7. Each snapshots change gives an idea of how a Q-learning net evolves to a records the trajectory of the corresponding row agent. Trajectory position points have a time scale color starting from yellow and going to darker red finally trained RE from an UN. as time passes. Goal area is green. See text. Figure 5 reports a statistics over the tests performed in a single test time of 1600 iterations. The first index indicates ineffectiveness on task achievement, obtained by measuring the time taken for the agent to reach the goal area. High values of ineffectiveness are generally associated to wandering be- havior or stationary dead ends encountered during navigation. The second index computes effectiveness and persistence, by measuring time percentage spent inside the goal area after the area is reached. Low values of the index are generally associated with excessive random behavior or fortuitous goal area achievements. The tests are performed on increasing environment complexity levels, and then growing time needed to success. UN network shows negative performances on all Fig. 4. Performance of RE while training tests conditions, while SU network shows the best perfor- mances especially into low complexity environments, with a low number of obstacles, moving straight to the goal area in a few number of cycles, basically focusing on target. RE on the same environment, organized in groups of increasing shows best performances particularly in high complexity envi- environment complexity. ronments, proving better explorations capabilities and abilities to overcome stationary configurations. Their behavior, while trying to achieve the target area goal, is sampled as reported in fig. 2 and in fig.3. The whole Figure 4 reports performance statistics indexes over in- experiment ranges over 8 batteries of 10 random problems creasing learning time of AD neural network from 0 to in the same environment, for 10 different random choices 400.000 learning cycles, increasing by a 50.000 interval. The of start/target positions. In the figures we show the outcome progressive trend is evident and demonstrates the adaptive of just two test batteries, where each row collects an order capabilities of the reinforcement learning protocol. Positive preserving under-sampling of five out of the ten snapshots performance indexes show a clear increasing trend, while in the battery, each representing the trajectory of the agent negative performance indexes show a decreasing trend. Perfor- behavior in the same environment. Snapshots of RE and SU mance charts show acceleration between 100.000 and 200.000 behavior in test problems are in the bottom two rows. The learning cycles, with inflection points in this interval, and a top two rows records the agent behavior when controlled by final stabilization after 250.000 learning iterations. an UNTRAINED (UN), randomly selected net configuration, and when controlled by a Q-learning ADAPTIVE (AD) net. A. AD is always in learning phase, it is randomly initialized at IV. C ONCLUSION the first test in the battery, and retains its net configuration through subsequent tests in the battery. As AD gets trained We presented simulations of the behavior of a mobile agent while testing, it is expected to converge to the one of RE. equipped with a neural net reflex-like control in avoiding ob- 144 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy stacles and achieving a given target position goal. At this stage [15] Rummery G.A., Niranjan M. (1994). On-Line Q-Learning Using Con- of the project we use no high level onthologies or primitives nectionist Systems, Cambridge University. for describing the environment, no environment model or map [16] Tsankova D.D. (2010). Neural Networks Based Navigation and Control acquisition capabilities, and no planning abilities. We imple- of a Mobile Robot in a Partially Known Environment, Mobile Robots Navigation, Alejandra Barrera (Ed.), ISBN: 978-953-307-076-6, InTech. mented the artificial neural net control with two different learn- [17] Ulrich I., Borenstein J., (2000). VFH*: Local Obstacle Avoidance with ing approaches: a standard supervised error back propagation Look-Ahead Verification, International Conference on Robotics and training with examples, and an unsupervised reinforcement Automation, San Francisco, CA, 28, 2000, pp. 2505-2511 learning with environmental feedback. We constructed both [18] Yang G.S., Chen E.K., An C.W., (2004). Mobile robot navigation a real robot agent and a virtual agent-environment simulation using neural Q-learning, Machine Learning and Cybernetics, 2004. system, in order to perform fast and reliable test experiments. Proceedings of 2004 International Conference on (Vol. 1, pp. 48-52), The virtual environment let us perform advanced integrated IEEE. training and test sessions with progressive complexity levels [19] Yang S.X., Luo C. (2004). A neural network approach to complete coverage path planning, Systems, Man, and Cybernetics, Part B: and random configurations, leading to a high grade of general- Cybernetics, IEEE Transactions on, 34(1), 718-724. ization for the neural net control. We collected statistical data [20] Yang S.X., Meng M. (2000). An efficient neural network approach to on several test experiments and compared the performance of dynamic robot motion planning, Neural Networks, 13(2), 143-148. the two learning approaches. The analysis of the control system [21] Floreano D., Mattiussi C. (2002). Manuale sulle reti neurali, Il Mulino, critical aspects and capabilities, as observed in the simulations, Bologna. favored fixing and improving data presentation in the training [22] 4tronix website, http : //4tronix.co.uk/ protocol. [23] HC-SR04 Ultrasonic Ranging Module, Iteadstudio, http : //wiki.iteadstudio.com/U ltrasonic Ranging M odule HC- SR04 R EFERENCES [24] Neuroph Framework, Neuroph website, [1] Anvar A.M., Anvar A.P. (2011). AUV Robots Real-time Control Nav- http : //neuroph.sourcef orge.net/ igation System Using Multi-layer Neural Networks Management, 19th [25] RaspberryPi website, http : //www.raspberrypi.org International Congress on Modelling and Simulation, Perth, Australia. [2] Awad H.A., Al-Zorkany M.A. (2007). Mobile Robot Navigation Using Local Model Networks, World Academy of Science, Engineering and Technology. [3] Bing-Qiang Huang, Guang-Yi Cao, Min Guo (2005). Reinforcement Learning Neural Network to the Problem of Autonomous Mobile Robot Obstacle Avoidance, Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou. [4] Chen C., Li H.X., Dong D., (2008). Hybrid Control for Robot Naviga- tion - A Hierarchical Q-Learning Algorithm, Robotics and Automation Magazine, IEEE, 15(2), 37-47. [5] Floreano, D., Mondada, F. (1994). Automatic creation of an autonomous agent: Genetic evolution of a neural network driven robot, Proceedings of the third international conference on Simulation of adaptive behavior: From Animals to Animats 3 (No. LIS-CONF-1994-003, pp. 421-430), MIT Press. [6] Floreano D., Mondada F. (1996). Evolution of homing navigation in a real mobile robot, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 26(3), 396-407. [7] Janglova D. (2004). Neural Networks in Mobile Robot Motion, Inter- national Journal of Advanced Robotic System, Institute of Informatics SAS, vol. 1, no. 1, pp. 15-22. [8] Glasius, R., Komoda, A., Gielen, S.C. (1995). Neural network dynamics for path planning and obstacle avoidance, Neural Networks, 8(1), 125- 133. [9] Medina-Santiago A., et. al. (2014). Neural Control System in Obstacle Avoidance in Mobile Robots Using Ultrasonic Sensors, Instituto Tec- nolgico de Tuxtla Gutirrez, Chiapas, Mxico. pp. 104-110 [10] Michels J., Saxena, A., Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning, Proceedings of the 22nd international conference on Machine learning (pp. 593-600), ACM. [11] Milln J. (1995). Reinforcement Learning of Goal-Directed Obstacle- Avoiding Reaction Strategies in an Autonomous Mobile Robot, Robotics and Autonomous Systems, Volume 15, Issue 4. pp. 275299. [12] Na, Y.K., Oh, S.Y., (2003). Hybrid control for autonomous mobile robot navigation using neural network based behavior modules and environment classification, Autonomous Robots, 15(2), 193-206. [13] Pomerleau D.A. (1991). Efficient Training of Artificial Neural Networks for Autonomous Navigation, in Neural Computation3: 1. pp. 88-97. [14] Rogers T.T. , McClelland J.L, (2014). Parallel Distributed Processing at 25: Further Explorations in the Microstructure of Cognition, Cognitive Science 38, 10241077. 145