Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                June 17-19, Naples, Italy


A Case Study on Goal Oriented Obstacle Avoidance

                                            Pasquale Caianiello and Domenico Presutti
                                                              DISIM
                                                       Università dell’Aquila
                                                                Italy
                                Email: pasquale.caianiello@univaq.it, domenico.presutti@gmail.com


    Abstract—We report on several test experiments with a mobile       ultrasonic sensor that acquires a panoramic of the environment,
agent equipped with an artificial neural net control to achieve a      two infrared front mounted close range proximity sensors and
basic route direction goal reflex in a 2-dimensional environment       a GY-80 multi-chip module that integrates a 3-axis gyroscope,
with obstacles. A real assembled 4tronix Initio robot kit agent is     a 3-axis accelerometer, and a 3-axis digital compass used for
reproduced with its sensor and motor characteristics in a virtual      determining the agent absolute orientation.
environment for experimenting and comparing the behavior of
its artificial neural net control with two different learning ap-
proaches: A standard supervised error back propagation training        B. The virtual agent and environment
with examples, and an unsupervised reinforcement learning with
environmental feedback.                                                    The virtual environment is constructed as a bit matrix of
                                                                       dimension m × n. Each bit represents a virtual position for the
                                                                       agent. A bit is set to 1 if the position is blocked, there is an
                        I.   I NTRODUCTION
                                                                       obstacle, and the position cannot be occupied by the agent. The
    Obstacle avoidance is a basic task for mobile agents.              goal area of the environment is a circle identified by a center
Commercial and research applications address the problem by            position and a boundary radius. The virtual agent emulates
using state of the art adaptive techniques and mathematical            the real hardware in performing its basic control commands to
modeling. In this work we confronted with the task of using            the hardware controllers. The basic functioning cycle has three
a simple neural net architecture to equip a real 4tronix Initio        steps: Sample sensors, Compute orientation and velocity, Run
[22] robot kit agent with a basic obstacle avoidance control.          for a time unit.
Its neural net would take inputs from a pre-processing of the
sensor information of the robot, and provide an output to              C. The artificial neural net control module
control its motor actuators. The preliminary aim of our project,
as reported in this paper in its start-up phase, is to construct the       The artificial neural net (ANN) is emulated through the
virtual counterpart of the 4tronix Initio robot agent, that will       use of the Neuroph [24] Java framework used to implement a
let us perform fast and reliable test experiments of possible          3-layer fully connected feed-forward net with 31 input units,
control strategies at the reflex proactive level with real time        62 hidden nodes, and 10 output units as depicted in fig 1. All
response.                                                              artificial neurons in the net are sigmoid. The 31 input units
                                                                       collect the infrared proximity information, the distances in 9
    As a result we identified a simple artificial neural net           predefined pan positions measured by the ultrasonic sensor
architecture and a pair of training strategies that would let the      (normalized in the unity interval), the distance from the agent
agent show simple adaptive/reactive capabilities in avoiding           to the center of the goal area, and the computed orientation
obstacles, while achieving a given target position goal. The           angle with respect to the given direction goal, and represented
agent’s control neural net model is deliberately kept at a reflex      into 18 binary direction intervals of 20◦ to comprise the whole
level state, the perceptron basic input/output transduction, with      360◦ range. The output units consist of 9 binary orientation
no high level onthologies or primitives for describing the             directions (the front 180◦ ) and one real in the unity interval to
environment, no environment model or map acquisition capa-             represent the distance to be covered. So the control protocol
bilities, and no planning ability of any sort. The environment         will interpret the ANN output by setting the agent to the
only exists for the pattern that it induces on the agent sensors       orientation given, and let it run for the given distance.
at a given position and orientation. The agent will have to react
by issuing control settings to its motor components, taking into          The experiments are described when the ANN is trained
account its given goal.                                                according to two different protocols: Supervised error Back
                                                                       Propagation (BP), environment reinforcement learning (RL).
                  II.    T OOLS AND M ETHODS                               1) The Supervised error back propagation protocol: The
A. The agent                                                           BP protocol was experimented with training sets (TS) con-
                                                                       structed in two different ways, the first (BPp) by sampling the
    The agent hardware is a 4tronix Initio [22] robot kit con-         control choices of a human pilot, the second (BPh) by record-
trolled by a Raspberry Pi B+ [25] computer that emulates the           ing the behavior choices of a heuristic evaluation function in a
artificial neural net and performs low level sensor acquisitions       wide range of enumeration of input sensor patterns. The use of
and issues motor control primitives to a PiRoCon v.2 motor             BPh allowed the easy construction of much larger virtual TS,
controller. The agent is equipped with a pan-tilt HC-SR04              while TS construction with BPp required synchronized reading


                                                                 142
     Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                               June 17-19, Naples, Italy


Fig. 1. The perceptron architecture for proactive reflex input/output trans-
duction


the agent sensors and sampling the human pilot, who on the
other hand was allowed to comprise higher level cognitive
faculties, and was allowed to look at the environment map
while making his decision.
     To preserve generalization capabilities and avoid over-
fitting of the ANN, the training process was stopped at
predetermined network mean square error limit values. For this
purpose, each training sets were split in two subsets of training                Fig. 2.    Simulation Results, Environment complexity 3. Each snapshots
subsets and test subsets, containing respectively 90% and 10%                    records the trajectory of the corresponding row agent. Trajectory position
of the original TS samples. Optimal limit network error values                   points have a time scale color starting from yellow and going to darker red
                                                                                 as time passes. Goal area is green. See text.
were determined by training the network on training subsets
and testing network response on the test subsets: when the error
on the test subsets became stationary or increasing, training                    and to loosen excessive local/opportunistic behavior. The net
was paused and finally completed by using the full TS and the                    was trained on a collection of problem samples with random
minimum network error limit.                                                     selection of obstacles configuration, starting position and goal
     After the training process, the trained ANN’s were tested                   area to obtain the trained net that was used in the comparative
on the agent control system and some critical aspects emerged                    experiments where its behavior is sampled at different levels
as of the dimensional insufficiency of the TS obtained with                      of maturation.
BPp when compared to the dimension of the inputs state
configurations. It did, in fact, bring about insufficient network                                 III.    E XPERIMENTAL R ESULTS
output response polarization on new input patterns and a
random-like behavior of the agent in specific configurations.                        The experimentation was performed after training the ANN
On the other hand the ANN’s trained with TS constructed                          both with the BP protocol and with the Q-learning protocol
with BPh was often trapped in stationary or cyclic behavior in                   leading to two net configurations named SUPERVISED (SU)
sub-optimal positions with respect to the navigation goal. In                    and REINFORCED(RE) respectively.
consideration of the complementary critical aspects described                        The RE network was trained for 4000 learning sessions of
for the BPp and BPh cases, a third instance of the ANN was                       100 base cycles. For each session a random start and target is
trained with an incremental training process that combined                       assigned in a randomly generated environment. Main learning
both pilot driving and heuristic evaluation TS. In this case                     parameters are set by an empirical optimization process to the
the learning process consisted in two training phases. In the                    following values: Learning rate= 0.036, Future actions discount
first phase, the BPh TS was used for training the ANN for a                      factor g= 0.24, and Stochastic action selector temperature
small number of training epochs in order to give the network                     T=4. A high temperature T value is necessary during the
base response capabilities in covering a wide range of input                     learning process to maximize reinforcements. The T value is
configurations. In the second phase, the BPp TS was used until                   subsequently lowered on tests to 0.4 value to appreciate neural
training completion. As the incremental trained network was                      network response and control system behavior.
finally tested on the robot, the critical behaviors were relieved.
                                                                                     On the other hand, SU is trained for one learning epoch
   All the test results reported in the following are obtained by                on the BPh TS, containing 1.152.000 training samples, and
running the net configuration obtained at the end of the training                resulting in a 0.14 mean square error after training. The
process as described, with a sample environment problem.                         following 2076 training epochs are performed with the BPp
                                                                                 TS, with 400 training samples. Mean square error at the end
    2) The Reinforcement Learning protocol: The same net                         of the training process is 0.07. Both BPh and BPp training
architecture has been used with an unsupervised reinforcement                    sets are generated on several base generation sessions of 50
training protocol Q-learning [15] with reward/reinforce func-                    iterations each. Network weights are initalized with random
tion taking into account distance from goal, runs length, and                    values in the [-0.02, 0.02] interval.
route declination from goal direction. Reward was corrected by
the Q function [15] to take into account future effects of actions                   After training the nets are tested on a battery of several tests


                                                                               143
Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                               June 17-19, Naples, Italy


                                                                              Fig. 5.   Comparative statistics of reflex effectiveness in reaching the goal


                                                                                  Each snapshots records the trajectory of the corresponding
                                                                              row agent. Trajectory position points have a time scale color
                                                                              starting from yellow and going to darker red as time passes.
                                                                              Test problems in a row are presented in the same order as
                                                                              they were performed. The order is irrelevant for UN, RE, and
                                                                              SU, but AD’s behavior changes (and improves) while solving
                                                                              a problem. For subsequent tests in the battery, AD’s behavior
Fig. 3.    Simulation Results, Environment complexity 7. Each snapshots       change gives an idea of how a Q-learning net evolves to a
records the trajectory of the corresponding row agent. Trajectory position
points have a time scale color starting from yellow and going to darker red   finally trained RE from an UN.
as time passes. Goal area is green. See text.
                                                                                  Figure 5 reports a statistics over the tests performed in
                                                                              a single test time of 1600 iterations. The first index indicates
                                                                              ineffectiveness on task achievement, obtained by measuring the
                                                                              time taken for the agent to reach the goal area. High values
                                                                              of ineffectiveness are generally associated to wandering be-
                                                                              havior or stationary dead ends encountered during navigation.
                                                                              The second index computes effectiveness and persistence, by
                                                                              measuring time percentage spent inside the goal area after
                                                                              the area is reached. Low values of the index are generally
                                                                              associated with excessive random behavior or fortuitous goal
                                                                              area achievements. The tests are performed on increasing
                                                                              environment complexity levels, and then growing time needed
                                                                              to success. UN network shows negative performances on all
Fig. 4.   Performance of RE while training
                                                                              tests conditions, while SU network shows the best perfor-
                                                                              mances especially into low complexity environments, with a
                                                                              low number of obstacles, moving straight to the goal area
                                                                              in a few number of cycles, basically focusing on target. RE
on the same environment, organized in groups of increasing                    shows best performances particularly in high complexity envi-
environment complexity.                                                       ronments, proving better explorations capabilities and abilities
                                                                              to overcome stationary configurations.
    Their behavior, while trying to achieve the target area
goal, is sampled as reported in fig. 2 and in fig.3. The whole                    Figure 4 reports performance statistics indexes over in-
experiment ranges over 8 batteries of 10 random problems                      creasing learning time of AD neural network from 0 to
in the same environment, for 10 different random choices                      400.000 learning cycles, increasing by a 50.000 interval. The
of start/target positions. In the figures we show the outcome                 progressive trend is evident and demonstrates the adaptive
of just two test batteries, where each row collects an order                  capabilities of the reinforcement learning protocol. Positive
preserving under-sampling of five out of the ten snapshots                    performance indexes show a clear increasing trend, while
in the battery, each representing the trajectory of the agent                 negative performance indexes show a decreasing trend. Perfor-
behavior in the same environment. Snapshots of RE and SU                      mance charts show acceleration between 100.000 and 200.000
behavior in test problems are in the bottom two rows. The                     learning cycles, with inflection points in this interval, and a
top two rows records the agent behavior when controlled by                    final stabilization after 250.000 learning iterations.
an UNTRAINED (UN), randomly selected net configuration,
and when controlled by a Q-learning ADAPTIVE (AD) net.                        A.
AD is always in learning phase, it is randomly initialized at
                                                                                                         IV.    C ONCLUSION
the first test in the battery, and retains its net configuration
through subsequent tests in the battery. As AD gets trained                      We presented simulations of the behavior of a mobile agent
while testing, it is expected to converge to the one of RE.                   equipped with a neural net reflex-like control in avoiding ob-


                                                                       144
       Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                                    June 17-19, Naples, Italy


stacles and achieving a given target position goal. At this stage                    [15]   Rummery G.A., Niranjan M. (1994). On-Line Q-Learning Using Con-
of the project we use no high level onthologies or primitives                               nectionist Systems, Cambridge University.
for describing the environment, no environment model or map                          [16]   Tsankova D.D. (2010). Neural Networks Based Navigation and Control
acquisition capabilities, and no planning abilities. We imple-                              of a Mobile Robot in a Partially Known Environment, Mobile Robots
                                                                                            Navigation, Alejandra Barrera (Ed.), ISBN: 978-953-307-076-6, InTech.
mented the artificial neural net control with two different learn-
                                                                                     [17]   Ulrich I., Borenstein J., (2000). VFH*: Local Obstacle Avoidance with
ing approaches: a standard supervised error back propagation                                Look-Ahead Verification, International Conference on Robotics and
training with examples, and an unsupervised reinforcement                                   Automation, San Francisco, CA, 28, 2000, pp. 2505-2511
learning with environmental feedback. We constructed both                            [18]   Yang G.S., Chen E.K., An C.W., (2004). Mobile robot navigation
a real robot agent and a virtual agent-environment simulation                               using neural Q-learning, Machine Learning and Cybernetics, 2004.
system, in order to perform fast and reliable test experiments.                             Proceedings of 2004 International Conference on (Vol. 1, pp. 48-52),
The virtual environment let us perform advanced integrated                                  IEEE.
training and test sessions with progressive complexity levels                        [19]   Yang S.X., Luo C. (2004). A neural network approach to complete
                                                                                            coverage path planning, Systems, Man, and Cybernetics, Part B:
and random configurations, leading to a high grade of general-                              Cybernetics, IEEE Transactions on, 34(1), 718-724.
ization for the neural net control. We collected statistical data                    [20]   Yang S.X., Meng M. (2000). An efficient neural network approach to
on several test experiments and compared the performance of                                 dynamic robot motion planning, Neural Networks, 13(2), 143-148.
the two learning approaches. The analysis of the control system                      [21]   Floreano D., Mattiussi C. (2002). Manuale sulle reti neurali, Il Mulino,
critical aspects and capabilities, as observed in the simulations,                          Bologna.
favored fixing and improving data presentation in the training                       [22]   4tronix website, http : //4tronix.co.uk/
protocol.                                                                            [23]   HC-SR04 Ultrasonic Ranging Module, Iteadstudio, http                   :
                                                                                            //wiki.iteadstudio.com/U ltrasonic Ranging M odule HC-
                                                                                            SR04
                               R EFERENCES                                           [24]   Neuroph Framework, Neuroph website,
 [1]   Anvar A.M., Anvar A.P. (2011). AUV Robots Real-time Control Nav-                     http : //neuroph.sourcef orge.net/
       igation System Using Multi-layer Neural Networks Management, 19th             [25]   RaspberryPi website, http : //www.raspberrypi.org
       International Congress on Modelling and Simulation, Perth, Australia.
 [2]   Awad H.A., Al-Zorkany M.A. (2007). Mobile Robot Navigation Using
       Local Model Networks, World Academy of Science, Engineering and
       Technology.
 [3]   Bing-Qiang Huang, Guang-Yi Cao, Min Guo (2005). Reinforcement
       Learning Neural Network to the Problem of Autonomous Mobile Robot
       Obstacle Avoidance, Proceedings of the Fourth International Conference
       on Machine Learning and Cybernetics, Guangzhou.
 [4]   Chen C., Li H.X., Dong D., (2008). Hybrid Control for Robot Naviga-
       tion - A Hierarchical Q-Learning Algorithm, Robotics and Automation
       Magazine, IEEE, 15(2), 37-47.
 [5]   Floreano, D., Mondada, F. (1994). Automatic creation of an autonomous
       agent: Genetic evolution of a neural network driven robot, Proceedings
       of the third international conference on Simulation of adaptive behavior:
       From Animals to Animats 3 (No. LIS-CONF-1994-003, pp. 421-430),
       MIT Press.
 [6]   Floreano D., Mondada F. (1996). Evolution of homing navigation in a
       real mobile robot, Systems, Man, and Cybernetics, Part B: Cybernetics,
       IEEE Transactions on, 26(3), 396-407.
 [7]   Janglova D. (2004). Neural Networks in Mobile Robot Motion, Inter-
       national Journal of Advanced Robotic System, Institute of Informatics
       SAS, vol. 1, no. 1, pp. 15-22.
 [8]   Glasius, R., Komoda, A., Gielen, S.C. (1995). Neural network dynamics
       for path planning and obstacle avoidance, Neural Networks, 8(1), 125-
       133.
 [9]   Medina-Santiago A., et. al. (2014). Neural Control System in Obstacle
       Avoidance in Mobile Robots Using Ultrasonic Sensors, Instituto Tec-
       nolgico de Tuxtla Gutirrez, Chiapas, Mxico. pp. 104-110
[10]   Michels J., Saxena, A., Ng, A. Y. (2005). High speed obstacle avoidance
       using monocular vision and reinforcement learning, Proceedings of
       the 22nd international conference on Machine learning (pp. 593-600),
       ACM.
[11]   Milln J. (1995). Reinforcement Learning of Goal-Directed Obstacle-
       Avoiding Reaction Strategies in an Autonomous Mobile Robot, Robotics
       and Autonomous Systems, Volume 15, Issue 4. pp. 275299.
[12]   Na, Y.K., Oh, S.Y., (2003). Hybrid control for autonomous mobile
       robot navigation using neural network based behavior modules and
       environment classification, Autonomous Robots, 15(2), 193-206.
[13]   Pomerleau D.A. (1991). Efficient Training of Artificial Neural Networks
       for Autonomous Navigation, in Neural Computation3: 1. pp. 88-97.
[14]   Rogers T.T. , McClelland J.L, (2014). Parallel Distributed Processing at
       25: Further Explorations in the Microstructure of Cognition, Cognitive
       Science 38, 10241077.


                                                                                   145