<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuron networks</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mezaache</forename><surname>Hatem</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Engineering Department of Electronics</orgName>
								<orgName type="institution">University Hadj Lakhdar of Batna</orgName>
								<address>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Abdessemed</forename><surname>Foudil</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Engineering Department of Electronics</orgName>
								<orgName type="institution">University Hadj Lakhdar of Batna</orgName>
								<address>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuron networks</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B644A9F0998FB5A32614F98F54385773</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T00:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Reinforcement learning</term>
					<term>Q-Learning</term>
					<term>Q-Function</term>
					<term>Artificial Neural Networks</term>
					<term>Mobile Robot</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents a type of machine learning is reinforcement learning, this approach is often used in the field of robotics. It aims to determine a control law for a mobile robot in an unknown environment. This kind of technique applies when one assumes that the only information on the quality of actions performed by the mobile robot is a scalar signal which has a reward or punishment, the process of learning is to improve the choice of actions to maximize rewards. One of the most used algorithms for solving this problem is learning the Q-learning algorithm which is based on the Qfunction, and to ensure the generation of this latter function and the proper functioning of the apprenticeship system using an artificial neural network as the statements of changing environments where mobile robots have wide open spaces, the action performed by the mobile robot in its environment is ensured by using a selection function, this action is evaluated by a scalar signal which is -1, 0 and 1.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Learning is a process to improve the performance of a system based on its past experiences. This method occurs when the problem seems too complicated to solve in real time, or when it seems impossible to solve the problem in a classical and rigorous. An example of learning methods cited reinforcement learning.</p><p>The reinforcement learning is a technique which is to acquire the agent executor behavior desired by methods based on the concept of reward or punishment. The optimal behavior of an agent is often difficult to implement given the large number of variables that may play a role. In the framework of reinforcement learning, the agent can learn to behave in the same way as we learn to ride a bicycle from a signal known as reinforcement. One of the fundamental parts of the system of reinforcement learning is the Q-function; it allows the agent to learn how to choose good actions and how to measure their utility.</p><p>One of the goals for the navigation of autonomous mobile robots is the avoidance of obstacles, several techniques and methods are used for this reason <ref type="bibr" target="#b0">[1]</ref>, <ref type="bibr" target="#b1">[2]</ref>. The algorithm, which allows the mobile robot starting from a stop position and a final position following a pattern set meadows. If the robot encounters obstacles in its path, the algorithm of obstacle avoidance takes control of mobile robot. Once the path of the robot is free shipping to the destination is taken <ref type="bibr" target="#b2">[3]</ref>. In this article we solve the problem of navigation with obstacle avoidance by a method of learning. For this purpose the Q-Learning with the Q-function is generated by a network of neurons 2 Reinforcement learning</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Principle</head><p>The reinforcement learning is a learning technique based on trial and error, and the interaction between the agent and its environment <ref type="bibr" target="#b3">[4]</ref>, <ref type="bibr" target="#b4">[5]</ref>, where from a state or situation s in the environment, the agent selects and performs an action that causes a transition to state s'. He receives in return a reinforcement signal r, which can be a reward or punishment. The purpose of this learning is to maximize future rewards <ref type="bibr" target="#b3">[4]</ref>. Fig. <ref type="figure" target="#fig_0">1</ref> shows a state of interaction between the agent and the environment </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Q-Learning</head><p>The Q-Learning is a famous algorithm that is used for solving problems of reinforcement learning, it was proposed in 1989 by C. J. Watkins <ref type="bibr" target="#b5">[6]</ref>, <ref type="bibr" target="#b6">[7]</ref>. This algorithm is based on three main functions, an evaluation function, a function of strengthening and reinforcing function. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Evaluation function</head><p>The state of the environment and the action performed by the agent are evaluated by this function is called Q-Function From a current state s of the environment that is observed by the agent, an action is performed based on knowledge available within the internal memory (This knowledge is stored in the form of utility value associated with a pair of "State Action"). <ref type="bibr" target="#b7">[8]</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Reinforcement Function</head><p>After executing the action by the agent in its environment; Reinforcement function of R, provides for each new state s', r a signal that can be a reward or punishment, this signal usually takes a single value, 1, -1 or 0, which is used by the Update function to adjust the Q-value function associated with the pair "State Action" <ref type="bibr" target="#b7">[8]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.3">Update Function</head><p>This function uses the value of reinforcement to adjust the value associated with the pair "State, Action" which has just been completed <ref type="bibr" target="#b7">[8]</ref>.</p><p>The Q-Learning as a principle estimating the Q-function noted by:  Q , and is defined by:</p><formula xml:id="formula_0">          ' , ' max , * ' 1 * 1 * a s Q r E s V r E a s Q a t t         (1)</formula><p>Using the equation for updating one finds that:</p><formula xml:id="formula_1">            ' , ' max , 1 , ' 1 1 a s Q r a s Q a s Q t a t t t t t t t t t             (2)</formula><p>Or:</p><formula xml:id="formula_2">1  t</formula><p>r Is the reinforcement received when the agent selected the "a" action in the state "s" to move to the state "s'".</p><formula xml:id="formula_3">t  Is a real positive:   1 , 0  t </formula><p>In principle it should be randomly explore the environment for a large number of iterations for the Q-Learning converges to the optimal Q -function <ref type="bibr" target="#b7">[8]</ref>, then we can use the optimal policy defined by:</p><formula xml:id="formula_4">    a s Q s A a , max arg * '    (3)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">System architecture of reinforcement learning</head><p>The objective of this proposition is to have a learning system that allows a robot that moved in an environment that is totally unknown in avoiding obstacles.</p><p>The generation of the Q-function is performed by an artificial neural network-type MLP, where the inputs are reading sensors associated with the robot on its three sides, giving the perception of its environment, and the vector of possible actions (Turn Left, Forward and Turn Right). The choice of action that the agent must perform is a function called by function selection.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Generating the Q-function with an Artificial Neural Network</head><p>The generation of the Q-function can be made by a table in which each cell corresponds to an approximation of the Q-function for a configuration of the pair state / action. This severely limits the size of problems we can solve; in fact, many realworld problems such as robotics have a large space. An innovative approach to the generation of the Q-function in the case of large spaces is to use Artificial Neural Networks. The approximation of the Q-function is obtained, using Artificial Neural Networks with the learning algorithm Back propagation <ref type="bibr" target="#b8">[9]</ref>, <ref type="bibr" target="#b9">[10]</ref> In this implementation, the Artificial Neural Networks chooses is a Multi Layer Perceptron, which entered as the state of the environment and the possible actions, a layer containing n h hidden neurons and one output neuron that max of the Q-function <ref type="bibr" target="#b10">[11]</ref>. The activation function of all neurons is the sigmoid function. </p><formula xml:id="formula_5">                    et t t t t t t Q S a w Q S f w a s Q S a a s r t w t w arg 2 2 2 , , , 1           (4)</formula><p> For the vector of means we have</p><formula xml:id="formula_6">                    et t t t t t t Q S a w Q S f w a s Q S a a s r t b t b arg 2 2 2 , , , 1          <label>(5)</label></formula><p>Where: Q target is simplifying the equation of optimality BELLMAN which is given by the following equation </p><formula xml:id="formula_7">    w a S Q a s r Q t t t</formula><formula xml:id="formula_8">                      S w w f S w a s Q S a a s r t w t w t t t t t 2 1 1 1 , , , 1   (7)</formula><p> For the vector of means we have:</p><formula xml:id="formula_9">                      S w w f S w a s Q S a a s r t b t b t t t t t 2 1 1 1 , , , 1   (8)</formula><p>With:</p><p>f : Activation functions of neurons in the hidden layer. S : State of the environment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head> </head><formula xml:id="formula_10">w a s Q t t</formula><p>, , : Function value state / action corresponding to the action performed.</p><p>These changes in values for the weight matrix and vector of bias are present if the reinforcement signal is: -1 or 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Function Selection Action</head><p>The neural network allows us to generate the Q-function. The set of possible actions is given by where:</p><p>• a 1 : Turn Left Action.</p><p>• a 2 : Forward Action.</p><p>• a 3 : Turn Right Action.</p><p>The selection of the action that the robot must execute is based on the policy Exploration / Exploitation (EEP) <ref type="bibr" target="#b11">[12]</ref>, for this reason we used the greedy method ( greedy) which consists of choosing the greedy action with probability  and to choose a random action with a probability,</p><formula xml:id="formula_11">  1 . •   1 , 0  p a random number. • If   p</formula><p>we chooses a random action a "Exploration", where A a  of the all actions possible.</p><p>• If</p><formula xml:id="formula_12">  p you selected     w b S Q Arg S a A b , , max   "Exploitation"<label>(9)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Environment</head><p>Reading the state of the environment is done through sensors that are placed on three sides of the robot. Two on the left, two on the right and three in front. The Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuron networks 7</p><p>sensors can be used type of proximity detection. The opening angle of each sensor varies between -π/12 and π/12. The state vector S is chosen so as to obtain information on the existence of obstacles on three sides of the robot. This vector is composed of seven binary variables s i , i = 1,.. 7. The choice of these variables is made in order to have any information or anything. i.e., for example if s i = 1 then there is an obstacle near the robot, if s i = 0 no obstacle near the robot.</p><p>Fig. <ref type="figure" target="#fig_5">5</ref> shows a state of the environment that gives the value of state vector S, such as </p><formula xml:id="formula_13">  0 , 0 , 0 , 0 , 0 , 1 , 1  S .</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Model of the Robot</head><p>The type of robot that we have chosen for the application is circular of radius R. This robot is operated by two independent wheels separated by a distance L. Fig. <ref type="figure" target="#fig_6">6</ref> shows this type of robot. </p><formula xml:id="formula_14">                      2 cos 1 k k k v k x k x r r   (11)                       2 sin 1 k k k v k y k y r r  <label>(12)</label></formula><p>x r (k) and y r (k) are the x and y of the robot in the landmark (Ox, Oy).  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.2">The orientation of the robot</head><formula xml:id="formula_15">      k k k       1<label>(</label></formula><formula xml:id="formula_16">        k k k v l r    <label>2</label></formula><formula xml:id="formula_17">        k k L k l r       <label>2 1 (15)</label></formula><p>Where: L: is the distance between the right wheel and left wheel.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Algorithm Q-Learning with Artificial Neural Networks</head><p>(1)-Initialize random weights W of the neural network;</p><p>(2) Give the initial position of the robot [(X r (0), Y r (0), θ r (0)];</p><p>For k = 1 to iteration k</p><p>(3)-Reading the S t state of the environment by the seven sensors (L, A, R);</p><p>(4) For a fixed state, calculation of the Q-function by the neural network for the three possible actions (TL, Ar, TR);</p><p>(5)-Determination of the optimal action is action which is the maximum value of Q (optimal_Action  Max (Q));</p><p>(6)-Run the optimal with probability  ;or random action with probability 1- <ref type="bibr" target="#b6">(7)</ref> Reading the new state S t+1 of the environment through sensors the Seven (G, A, D);   To test the ability of our artificial neural network proposed in Fig. <ref type="figure" target="#fig_12">10</ref> shows a change of environment, where the robot moves through its environment while avoiding obstacles. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>In this paper we presented the technique of reinforcement learning where we have chosen the Q-Learning algorithm by using artificial neural networks for the</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Interaction between the agent and the environment</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 3 (Fig. 2 .</head><label>232</label><figDesc>Fig. 2. The general model for the algorithms of reinforcement learning, by Q-Learning.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3</head><label>3</label><figDesc>shows the structure of reinforcement learning based on an Artificial Neural Network</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. The structure of reinforcement learning based on an Artificial Neural Network.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 Fig. 4 . 4 . 1 . 1</head><label>44411</label><figDesc>Fig. 4. Q-function with Artificial Neural Network-type MLP for Q-Learning</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 5 .</head><label>5</label><figDesc>Fig. 5. State of the Environment.</figDesc><graphic coords="7,231.12,252.48,144.48,103.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Fig. 6 .</head><label>6</label><figDesc>Fig. 6. The robot Type.This robot is characterized by the following cinematic equations:</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head></head><label></label><figDesc>the angular position of the robot in the landmark (Ox, Oy)8.1.3 The speed of the robot</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>:: 9 8. 1 . 4</head><label>914</label><figDesc>The angular velocity of the right wheel of the robot. The angular velocity of the left wheel of the robot.Simulation of the navigation of a mobile robot by the Q-Learning using artificial neuronnetworks The change in theta</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>( 8 )</head><label>8</label><figDesc>-Reinforcement; (9)-Test of Reinforcement If r =-1 (there is an obstacle) (1) -Update weights and biases of the neural network with the formula of Q-Learning.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>( 2 ) 11 Fig. 8 .</head><label>2118</label><figDesc>Fig. 8. The trajectory followed by the robot during the learning stage for 2500 iterations.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Fig. 9 .</head><label>9</label><figDesc>Fig. 9. The trajectory of the robot after learning for 2500 iterations.</figDesc><graphic coords="11,225.60,297.48,155.52,110.64" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_12"><head>Fig. 10 .</head><label>10</label><figDesc>Fig. 10. Change of environment.</figDesc></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mezaache Hatem and Abdessemed Foudil</head><p>generation of Q-function this approach has been used for navigation of a mobile robot in an unknown environment while avoiding obstacles, the results are very satisfactory, and meet the target. This allows us to say that this approach can are used for the navigation of mobile robot in an unknown environment.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mezaache Hatem and Abdessemed Foudil</head><p>End if <ref type="bibr" target="#b9">(10)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">Simulation results</head><p>The proposed algorithm is implemented for a simulation for a mobile robot in a scene or two obstacles are deferred. The use of the algorithm Q-Learning with Artificial Neural Networks for MLP-type obstacle avoidance is the major objective of our simulation.</p><p>The generation of the Q-function is based on the use of Artificial Neural Networktype MLP, in which learning takes place by the algorithm of back-propagation. The Artificial Neural Networks is characterized by ten neurons as input cells where seven neurons present state of the environment and the three other neurons present possible actions. The values of its input neurons are binary types (0, 1), a hidden layer which contains seven neurons whose activation function is the sigmoid function, and the output layer contains one neuron with activation function sigmoid. The adjustment of weights and biases of this neurons network, that fact in a collision of the robot with an obstacle, as it returns to its original state. The action performed by the robot is based on the use of the greedy method. The state of the environment is obtained by the simulation of a proximity sensor, placed on three sides of the robot. Fig. <ref type="figure">7</ref> shows the environment evolution of the robot, where the reinforcement is given by a scalar signal which is set to -1 when the robot hits an obstacle, the robot 1 when straight ahead and 0 in all other cases. The trajectory followed by the robot during the learning stage is presented in Fig. <ref type="figure">8</ref> for 2500 iterations. After learning the trajectory of the robot is presented in Fig. <ref type="figure">9 for</ref> </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Robot Motion Planning: A Distributed Representation Approach</title>
		<author>
			<persName><forename type="first">J</forename><surname>Barraquand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Latombe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Int. Jour. of Robotics Research</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="628" to="649" />
			<date type="published" when="1991">1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A Simple Goal Seeking Navigation Method for a Mobile Robot Using Human Sense, Fuzzy Logic and Reinforcement Learning</title>
		<author>
			<persName><forename type="first">Hamid</forename><surname>Boubertakh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohamed</forename><surname>Tadjine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pierre-Yves</forename><surname>Glorennec</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">KES</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="666" to="673" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Neuro Fuzzy Behavior-Based Control of a Mobile Robot in an Unknown Environments</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">C</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Of the 3rd Int. Conf. On Machine Learning and Cyb</title>
				<meeting>Of the 3rd Int. Conf. On Machine Learning and Cyb<address><addrLine>Shangai</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004-08">August, 2004</date>
			<biblScope unit="page" from="26" to="29" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Reinforcement Learning: An Introduction</title>
		<author>
			<persName><forename type="first">Richard</forename><forename type="middle">S</forename><surname>Sutton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><forename type="middle">G</forename><surname>Barto</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>MIT</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Reinforcement Learning Neural Network to the Problem of Autonomous Mobile Robot Obstacle Avoidance</title>
		<author>
			<persName><forename type="first">Bing-Oiang</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Min</forename><surname>Guang-Yicao</surname></persName>
		</author>
		<author>
			<persName><surname>Guo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005-08">Août 2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A Fuzzy-Based Reactive Controller For Non-holonomic Mobile Robot</title>
		<author>
			<persName><forename type="first">F</forename><surname>Abdessemed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Benmahammed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Monacelli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Robotics and Autonomous Systems</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page" from="31" to="46" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Learning from Delayed Rewards</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Watkins</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1989">1989</date>
			<pubPlace>England</pubPlace>
		</imprint>
		<respStmt>
			<orgName>University of combridge</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD Thesis</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Departement of Aapplied Systems Science</title>
		<author>
			<persName><forename type="first">Takanori</forename><surname>Fukao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Takaaki</forename><surname>Sumitomo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Norikatsu</forename><surname>Ineyama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Norihiko</forename><surname>Adachi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Q-Learning Based on Regularization Theory to Treat the Continuous States and Actions</title>
				<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
		<respStmt>
			<orgName>Graduate School of Engineering, Kyoto Uuniversity</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Q-Learning for Robots</title>
		<author>
			<persName><forename type="first">Claude</forename><forename type="middle">F</forename><surname>Touzet</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">University of Zagreb A reiforcement learning approach to obstacle avoidance of mobile robots</title>
		<author>
			<persName><forename type="first">Kristijan</forename><surname>Macek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ivan</forename><surname>Petrovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nedjelko</forename><surname>Peric</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Spain Semi-Online Neuronal Q-Learning for real Time Robot Learning</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Carreras</surname></persName>
		</author>
		<author>
			<persName><surname>Ridao</surname></persName>
		</author>
		<author>
			<persName><surname>El-Fakdi</surname></persName>
		</author>
		<imprint/>
		<respStmt>
			<orgName>Institute of Informatics and Aapplications ; University of Girona</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Algorithmes d&apos;apprentissage pour les systèmes d&apos;inférence floue</title>
		<author>
			<persName><forename type="first">Pierre</forename><surname>Yves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Glorennec</forename><surname>Département</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
		<respStmt>
			<orgName>informatique INSA de Rennes / IRISA</orgName>
		</respStmt>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
