<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Simulation of the navigation of a mobile robot by the Q- Learning using artificial neuron networks</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University Hadj Lakhdar of Batna-Algeria-Faculty of Engineering Department of Electronics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a type of machine learning is reinforcement learning, this approach is often used in the field of robotics. It aims to determine a control law for a mobile robot in an unknown environment. This kind of technique applies when one assumes that the only information on the quality of actions performed by the mobile robot is a scalar signal which has a reward or punishment, the process of learning is to improve the choice of actions to maximize rewards. One of the most used algorithms for solving this problem is learning the Q-learning algorithm which is based on the Qfunction, and to ensure the generation of this latter function and the proper functioning of the apprenticeship system using an artificial neural network as the statements of changing environments where mobile robots have wide open spaces, the action performed by the mobile robot in its environment is ensured by using a selection function, this action is evaluated by a scalar signal which is -1, 0 and 1.</p>
      </abstract>
      <kwd-group>
        <kwd>Reinforcement learning</kwd>
        <kwd>Q-Learning</kwd>
        <kwd>Q-Function</kwd>
        <kwd>Artificial Neural Networks</kwd>
        <kwd>Mobile Robot</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Learning is a process to improve the performance of a system based on its past
experiences. This method occurs when the problem seems too complicated to solve in
real time, or when it seems impossible to solve the problem in a classical and
rigorous. An example of learning methods cited reinforcement learning.</p>
      <p>The reinforcement learning is a technique which is to acquire the agent executor
behavior desired by methods based on the concept of reward or punishment. The
optimal behavior of an agent is often difficult to implement given the large number of
variables that may play a role. In the framework of reinforcement learning, the agent
can learn to behave in the same way as we learn to ride a bicycle from a signal known
as reinforcement. One of the fundamental parts of the system of reinforcement
learning is the Q-function; it allows the agent to learn how to choose good actions and
how to measure their utility.
One of the goals for the navigation of autonomous mobile robots is the avoidance of
obstacles, several techniques and methods are used for this reason [1], [2]. The
algorithm, which allows the mobile robot starting from a stop position and a final
position following a pattern set meadows. If the robot encounters obstacles in its path,
the algorithm of obstacle avoidance takes control of mobile robot. Once the path of
the robot is free shipping to the destination is taken [3]. In this article we solve the
problem of navigation with obstacle avoidance by a method of learning. For this
purpose the Q-Learning with the Q-function is generated by a network of neurons</p>
    </sec>
    <sec id="sec-2">
      <title>2 Reinforcement learning</title>
      <sec id="sec-2-1">
        <title>2.1 Principle</title>
        <p>The reinforcement learning is a learning technique based on trial and error, and the
interaction between the agent and its environment [4], [5], where from a state or
situation s in the environment, the agent selects and performs an action that causes a
transition to state s'. He receives in return a reinforcement signal r, which can be a
reward or punishment. The purpose of this learning is to maximize future rewards [4].
Fig. 1 shows a state of interaction between the agent and the environment
A
c
t
i
o
n</p>
      </sec>
      <sec id="sec-2-2">
        <title>Environment</title>
      </sec>
      <sec id="sec-2-3">
        <title>Agent</title>
        <p>S
t
a
t
S</p>
        <p>R
e
i
n
f
o
r
c
e
m
e
n
t
r</p>
        <p>The Q-Learning is a famous algorithm that is used for solving problems of
reinforcement learning, it was proposed in 1989 by C. J. Watkins [6], [7]. This
algorithm is based on three main functions, an evaluation function, a function of
strengthening and reinforcing function. Fig. 2 shows the general model for the
algorithms of reinforcement learning, by Q-Learning.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.2.2 Reinforcement Function</title>
        <p>After executing the action by the agent in its environment; Reinforcement function
of R, provides for each new state s', r a signal that can be a reward or punishment, this
signal usually takes a single value, 1, -1 or 0, which is used by the Update function to
adjust the Q-value function associated with the pair "State Action" [8].</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.2.3 Update Function</title>
        <p>This function uses the value of reinforcement to adjust the value associated with
the pair "State, Action" which has just been completed [8].</p>
        <p>The Q-Learning as a principle estimating the Q-function noted by: Q , and is defined
by:</p>
        <p>Or:</p>
        <p>Q* s, a  Ert1  V * s  Ert1   maxQ* s', a'
a'</p>
        <sec id="sec-2-5-1">
          <title>Using the equation for updating one finds that:</title>
          <p>
            Qt1st ,at  1t Qt st ,at  t rt1  maxQt s',a'
a'
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
rt1 Is the reinforcement received when the agent selected the "a" action in the
state "s" to move to the state "s’".
 t Is a real positive:  t  0,1
          </p>
          <p>
            In principle it should be randomly explore the environment for a large number of
iterations for the Q-Learning converges to the optimal Q - function [8], then we can
use the optimal policy defined by:
 ' s  arg max Q * s, a
aA
(
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 System architecture of reinforcement learning</title>
      <p>The objective of this proposition is to have a learning system that allows a robot
that moved in an environment that is totally unknown in avoiding obstacles.</p>
      <p>The generation of the Q-function is performed by an artificial neural network-type
MLP, where the inputs are reading sensors associated with the robot on its three
sides, giving the perception of its environment, and the vector of possible actions
(Turn Left, Forward and Turn Right). The choice of action that the agent must
perform is a function called by function selection.</p>
      <p>Fig. 3 shows the structure of reinforcement learning based on an Artificial Neural</p>
      <sec id="sec-3-1">
        <title>Network</title>
        <p>Reinforcement
Possible
Actions
Inputs</p>
        <p>ANN</p>
        <p>Q Value</p>
        <p>Action
S. Action
4 Generating the Q-function with an Artificial Neural Network</p>
        <p>The generation of the Q-function can be made by a table in which each cell
corresponds to an approximation of the Q-function for a configuration of the pair state
/ action. This severely limits the size of problems we can solve; in fact, many
realworld problems such as robotics have a large space. An innovative approach to the
generation of the Q-function in the case of large spaces is to use Artificial Neural
Networks. The approximation of the Q-function is obtained, using Artificial Neural</p>
      </sec>
      <sec id="sec-3-2">
        <title>Networks with the learning algorithm Back propagation [9], [10]</title>
        <p>In this implementation, the Artificial Neural Networks chooses is a Multi Layer
Perceptron, which entered as the state of the environment and the possible actions, a
layer containing nh hidden neurons and one output neuron that max of the Q-function
[11]. The activation function of all neurons is the sigmoid function. Fig. 4 shows the
Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuron
networks 5
generation of the Q-function with Artificial Neural Network-type MLP for
Q</p>
      </sec>
      <sec id="sec-3-3">
        <title>Learning.</title>
        <p>ed roP
t
e x
itcon iitym
P
o
s
s
il
b
e
A
c
it
o
n
s
State
Vector S
L 1
e
tf 2
F 3
o
rw 4
a
r
d
iR 6
g
th 7
5
R
A
L
c</p>
        <p>Hidden
Layer
Output Layer
QS, a, w</p>
        <sec id="sec-3-3-1">
          <title>4.1 Learning of Artificial Neural Networks</title>
          <p>The learning Network is based on updating or adjusting the weight matrices of the
Neural Network using the equation of the update of the classic algorithm of
QLearning and the algorithm Back propagation
4.1.1 The output layer
 For the weight matrix
 For the vector of means we have
w2 t  w2 t 1  rst , at   aS   Qst , at , wt f S  Q aS   Qt arget 
w2
b2 t  b2 t 1  rst , at   aS   Qst , at , wt f S  Q aS   Qtarget 
w2</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Where:</title>
        <p>Qtarget is simplifying the equation of optimality BELLMAN which is given by the
following equation</p>
        <p>Qt arg et  rst , at    max QS , at , w
f : Activation functions of neurons in the output layer.</p>
        <p>
          Qst , at , w: Function value state / action corresponding to the action performed.
S : State of the environment.
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
Qst , at , w: Function value state / action corresponding to the action performed.
        </p>
        <p>These changes in values for the weight matrix and vector of bias are present if the
reinforcement signal is: -1 or 0.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 Function Selection Action</title>
      <p>The neural network allows us to generate the Q-function. The set of possible
actions is given by where:
• a1: Turn Left Action.
• a2: Forward Action.
• a3: Turn Right Action.</p>
      <p>The selection of the action that the robot must execute is based on the policy
Exploration / Exploitation (EEP) [12], for this reason we used the greedy method (
greedy) which consists of choosing the greedy action with probability  and to
• p  0,1 a random number.
choose a random action with a probability,1   .</p>
      <p>• If p   we chooses a random action a "Exploration", where a  A of the all
actions possible.</p>
      <p>
        • If p   you selected
aS   Arg max QS , b, w “Exploitation”
bA
f
w1
f
w1
 w2S
 w2 S
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
      </p>
      <sec id="sec-4-1">
        <title>4.1.2 The hidden layer</title>
        <p> For the weight matrix</p>
        <p>w1t   w1t 1  rst , at   aS   Qst , at , wt S
 For the vector of means we have:</p>
        <p>b1 t   b1 t  1   rst , at   aS   Qst , at , wt S</p>
        <sec id="sec-4-1-1">
          <title>With:</title>
          <p>f : Activation functions of neurons in the hidden layer.</p>
          <p>S : State of the environment.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6 Environment</title>
      <p>Reading the state of the environment is done through sensors that are placed on
three sides of the robot. Two on the left, two on the right and three in front. The
Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuron
networks 7
sensors can be used type of proximity detection. The opening angle of each sensor
varies between -π/12 and π/12. The state vector S is chosen so as to obtain
information on the existence of obstacles on three sides of the robot.</p>
      <p>This vector is composed of seven binary variables si, i = 1,.. 7. The choice of these
variables is made in order to have any information or anything. i.e., for example if si =
1 then there is an obstacle near the robot, if si = 0 no obstacle near the robot.</p>
      <p>Fig. 5 shows a state of the environment that gives the value of state vector S, such
as S  1,1,0,0,0,0,0.</p>
    </sec>
    <sec id="sec-6">
      <title>7 Function of reinforcement</title>
      <p>For each state in which the robot is found, an evaluation of the action done is found
by a signal known as signal reinforcement. This function of reinforcement allows the
robot to explore its environment; this signal is related to the values of sensor
measurements that indicate the presence or absence of obstacles in three directions,
left, front and right, which represent the state of the environment. The value of this
function or the signal reinforcement is given by:
 1

r   0
  1
le robot percute un obstacle</p>
      <p>
        les autres cas. .
le robot avance tout droit
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
8
      </p>
    </sec>
    <sec id="sec-7">
      <title>Model of the Robot</title>
      <p>The type of robot that we have chosen for the application is circular of radius R.
This robot is operated by two independent wheels separated by a distance L. Fig. 6
shows this type of robot.
y</p>
      <p>Right wheel
θ</p>
      <p>C
Left wheel</p>
      <p>L</p>
      <sec id="sec-7-1">
        <title>8.1.1 The position of the robot</title>
        <p>xr k  1  xr k   vk  cos k   2k  
yr k  1  yr k   vk   sin k   2k  
xr (k) and yr (k) are the x and y of the robot in the landmark (Ox, Oy).</p>
      </sec>
      <sec id="sec-7-2">
        <title>8.1.2 The orientation of the robot</title>
        <p> k 1   k    k </p>
        <sec id="sec-7-2-1">
          <title>With:</title>
          <p> k  : Is the angular position of the robot in the landmark (Ox, Oy)</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>8.1.3 The speed of the robot</title>
        <p>
          vk   1  r k   l k 
2
With:
 r k  : The angular velocity of the right wheel of the robot.
 l k  : The angular velocity of the left wheel of the robot.
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(13)
(14)
        </p>
      </sec>
      <sec id="sec-7-4">
        <title>8.1.4 The change in theta</title>
        <p> k  
1
2  L
 r k  l k </p>
        <sec id="sec-7-4-1">
          <title>Where:</title>
        </sec>
        <sec id="sec-7-4-2">
          <title>L: is the distance between the right wheel and left wheel.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>9 Algorithm Q-Learning with Artificial Neural Networks</title>
      <p>
        (15)
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )-Initialize random weights W of the neural network;
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Give the initial position of the robot [(Xr(0),
Yr(0), θr(0)];
For k = 1 to iteration k
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )-Reading the St state of the environment by the seven
sensors (L, A, R);
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) For a fixed state, calculation of the Q-function by
the neural network for the three possible actions (TL,
Ar, TR);
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )-Determination of the optimal action is action which
is the maximum value of Q (optimal_Action  Max (Q));
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )-Run the optimal with probability  ;or random action
with probability 1-
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) Reading the new state St+1 of the environment
through sensors the Seven (G, A, D);
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )-Reinforcement;
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )-Test of Reinforcement
      </p>
      <p>If r =- 1 (there is an obstacle)</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) - Update weights and biases of the neural
network with the formula of Q-Learning.
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) - Return to the original state.
      </p>
      <p>
        End if
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) -    * 0.99 to decrease gradually;
      </p>
      <p>End For
10 Simulation results</p>
      <p>The proposed algorithm is implemented for a simulation for a mobile robot in a
scene or two obstacles are deferred. The use of the algorithm Q-Learning with
Artificial Neural Networks for MLP-type obstacle avoidance is the major objective of
our simulation.</p>
      <p>
        The generation of the Q-function is based on the use of Artificial Neural
Networktype MLP, in which learning takes place by the algorithm of back-propagation. The
Artificial Neural Networks is characterized by ten neurons as input cells where seven
neurons present state of the environment and the three other neurons present possible
actions. The values of its input neurons are binary types (
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ), a hidden layer which
contains seven neurons whose activation function is the sigmoid function, and the
output layer contains one neuron with activation function sigmoid. The adjustment of
weights and biases of this neurons network, that fact in a collision of the robot with an
obstacle, as it returns to its original state. The action performed by the robot is based
on the use of the greedy method. The state of the environment is obtained by the
simulation of a proximity sensor, placed on three sides of the robot. Fig. 7 shows the
environment evolution of the robot, where the reinforcement is given by a scalar
signal which is set to -1 when the robot hits an obstacle, the robot 1 when straight
ahead and 0 in all other cases.
      </p>
      <p>The trajectory followed by the robot during the learning stage is presented in Fig. 8
for 2500 iterations. After learning the trajectory of the robot is presented in Fig. 9 for
2500 iterations.</p>
      <p>Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuron
networks 11</p>
    </sec>
    <sec id="sec-9">
      <title>Conclusion</title>
      <p>In this paper we presented the technique of reinforcement learning where we have
chosen the Q-Learning algorithm by using artificial neural networks for the
generation of Q-function this approach has been used for navigation of a mobile robot
in an unknown environment while avoiding obstacles, the results are very satisfactory,
and meet the target. This allows us to say that this approach can are used for the
navigation of mobile robot in an unknown environment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Barraquand</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.C.</given-names>
            <surname>Latombe</surname>
          </string-name>
          ,
          <article-title>"Robot Motion Planning: A Distributed Representation Approach"</article-title>
          ,
          <source>The Int. Jour. of Robotics Research</source>
          , Vol.
          <volume>10</volume>
          , No.
          <volume>6</volume>
          ,
          <fpage>628</fpage>
          -
          <lpage>649</lpage>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Hamid</given-names>
            <surname>Boubertakh</surname>
          </string-name>
          , Mohamed Tadjine,
          <string-name>
            <surname>Pierre-Yves</surname>
            <given-names>Glorennec</given-names>
          </string-name>
          ,
          <article-title>" A Simple Goal Seeking Navigation Method for a Mobile Robot Using Human Sense, Fuzzy Logic and Reinforcement Learning"</article-title>
          ,
          <source>KES (1)</source>
          <year>2008</year>
          :
          <fpage>666</fpage>
          -
          <lpage>673</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.T.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y. C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>"Neuro Fuzzy Behavior-Based Control of a Mobile Robot in an Unknown Environments"</article-title>
          ,
          <source>Proc. Of the 3rd Int. Conf. On Machine Learning and Cyb., Shangai</source>
          ,
          <fpage>26</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>August</year>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Richard</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sutton</surname>
            and
            <given-names>Andrew G.</given-names>
          </string-name>
          <string-name>
            <surname>Barto</surname>
          </string-name>
          .
          <article-title>Reinforcement Learning: An Introduction, Bradford Books</article-title>
          , MIT,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bing-Oiang Huang</surname>
          </string-name>
          . Guang- Yicao.Min Guo.
          <article-title>Reinforcement Learning Neural Network to the Problem of Autonomous Mobile Robot Obstacle Avoidance</article-title>
          . Août 2005
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Abdessemed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Benmahammed</surname>
          </string-name>
          and E. Monacelli, “
          <article-title>A Fuzzy-Based Reactive Controller For Non-holonomic Mobile Robot”</article-title>
          ,
          <source>Journal of Robotics and Autonomous Systems</source>
          ,
          <volume>47</volume>
          (
          <year>2004</year>
          )
          <fpage>31</fpage>
          -
          <lpage>46</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Watkins J. C. H</surname>
          </string-name>
          , Learning from Delayed Rewards,
          <source>PhD Thesis</source>
          , University of combridge, England.
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Takanori</given-names>
            <surname>Fukao</surname>
          </string-name>
          , Takaaki Sumitomo, Norikatsu Ineyama and
          <string-name>
            <given-names>Norihiko</given-names>
            <surname>Adachi</surname>
          </string-name>
          .
          <source>Departement of Aapplied Systems Science</source>
          , Graduate School of Engineering, Kyoto Uuniversity.
          <year>1998</year>
          .
          <article-title>QLearning Based on Regularization Theory to Treat the Continuous States and Actions</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Claude</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Touzet</surname>
          </string-name>
          .
          <article-title>Q-Learning for Robots</article-title>
          .
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kristijan</surname>
            <given-names>Macek</given-names>
          </string-name>
          , Ivan Petrovic and
          <string-name>
            <given-names>Nedjelko</given-names>
            <surname>Peric</surname>
          </string-name>
          .
          <article-title>University of Zagreb A reiforcement learning approach to obstacle avoidance of mobile robots</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. Carreras. P.</given-names>
            <surname>Ridao</surname>
          </string-name>
          and El- Fakdi. Institute of Informatics and Aapplications. University of Girona. Spain
          <string-name>
            <surname>Semi-Online Neuronal</surname>
          </string-name>
          Q
          <article-title>-Learning for real Time Robot Learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Yves Glorennec Département d'informatique INSA</surname>
          </string-name>
          de Rennes / IRISA 1999.
          <article-title>Algorithmes d'apprentissage pour les systèmes d'inférence floue</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>