<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamic scheduling in Petroleum process using reinforcement learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nassima Aissani</string-name>
          <email>aissani.nassima@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bouziane Bedjilali</string-name>
          <email>bouzianebeldjilali@yahoo.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oran University</institution>
          ,
          <addr-line>BP 1524 El M'nouer, Oran</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Petroleum industry production systems are highly automatized. In this industry, all functions (e.g., planning, scheduling and maintenance) are automated and in order to remain competitive researchers attempt to design an adaptive control system which optimizes the process, but also able to adapt to rapidly evolving demands at a fixed cost. In this paper, we present a multi-agent approach for the dynamic task scheduling in petroleum industry production system. Agents simultaneously insure effective production scheduling and the continuous improvement of the solution quality by means of reinforcement learning, using the SARSA algorithm. Reinforcement learning allows the agents to adapt, learning the best behaviors for their various roles without reducing the performance or reactivity. To demonstrate the innovation of our approach, we include a computer simulation of our model and the results of experimentation applying our model to an Algerian petroleum refinery.</p>
      </abstract>
      <kwd-group>
        <kwd>reactive scheduling</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>petroleum process</kwd>
        <kwd>multi-agent system</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Current oil and gas market trends, characterized by great competitiveness and
increasingly complex contradictory constraints, have pushed researchers to design an
adaptive control system that is not only able to react effectively, but is also able to
adapt to rapidly evolving demands at a fixed cost. The system does this by using the
available resources as efficiently as possible to optimize this adaptation. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] presented
an analysis of the needs of production systems, highlighting the advantages of
adopting a self-organized heterarchical control system. The term, heterarchy, is used
to describe a relationship between entities on the same hierarchical level [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Initially
proposed in the field of medical biology, it was then adapted for several other
domains [9; 10; 7]. In the multi-agent domain, the term, heterarchy, is relatively close
to the concept of "distribution", as used in "distributed systems". However, from our
point of view, the fact that the decisional capacities are distributed does not mean that
the multi-agent system is organized heterarchically, even though this is often the case
[15;17]. Nonetheless, the heterarchic organization of distributed systems is the
assumption that we make in this paper. From our point of view, this assumption is
justified by the system dynamics and the volatility of the information, which make a
purely or partially hierarchical approach inappropriate for creating an effective
reactive system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In this paper, we focus on the dynamic control of complex manufacturing systems,
such as those found in the petroleum industry. In this industry, all functions (e.g.,
planning, scheduling and maintenance) and resources (e.g., turbines, storage systems)
are automated.
2</p>
    </sec>
    <sec id="sec-2">
      <title>BRIEF DESCRIPTION OF UNIT3100 IN RA1Z REFINERY</title>
      <p>This unit is designed to produce oil from the base oil treated in the units HB3 and
HB4 and imported additives, the base oil is received in Tank TK2501 to TK2506.
Each docking Tank stock defined grade of oil (SPO, SAE10-30, BS) (Production of
132,000 t / year for an amount of 10% additives) if the type of oil stored in a tank
must be changed, the tank must first be rinsed for hours which is often avoided. This
unit produces two major oil: engine oils 81% of the production (gasoline, diesel,
transmission oils) and industrial oils (hydraulic (TISK), turbines (torba), spiral
(Fodda), compressor (Torrada) and various oils). To do this, two methods are used:
continuous mixing (mixing line) and mixing in discontinuous (batch) (see Figure 1).
In this article we focus on the mixing line. To produce finished oil, a recipe must be
applied:</p>
      <p>X1% Hb1 + X2% Hb2+ X3% Additif1</p>
      <sec id="sec-2-1">
        <title>Where : Xi is the rate and HBi is the base oil. Fig. 1. Unite 3100 model The mixing line its base oil from the docking Tanks, which produce this decade plan (see figure 2):</title>
        <p>In this paper, we aim to develop an adaptive control system for Unit3100 which will
produce dynamically efficient scheduling solution using resources in optimal way.
We consider each resource and Oil in tank as a decisional entity, and we model them
as agents.
We conducted a state-of-the-art review of the dynamic scheduling problem in the
literature. This section highlights the studies that reflected our point of view.
In manufacturing control, scheduling is the most important function. In this paper, we
focus on dynamic scheduling.</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] Have classified dynamic scheduling into three categories: predictive, proactive,
and reactive. The first, predictive, assumes a deterministic environment. Predictive
solutions call for a priori off-line resource allocation. However, when the
environment is uncertain, some data (e.g., the actual durations) only becomes
available when the solution is being executed. This kind of situation requires either a
proactive or reactive solution. Proactive solutions are certainly able to take
environmental uncertainties into account. They allocate the operations to resources
and define the order of the operations, though, because the durations are uncertain,
without precise starting times. However, such solutions can only be applied when the
durations of the operations are stochastic and the states of the resources are known
perfectly (e.g. stochastic job-shop scheduling) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The third type of dynamic
scheduling, reactive, is also able to deal with environmental uncertainties, but is better
suited for evolving processes.
        </p>
        <p>Reactive solutions call for on-line scheduling of resources. In fact, the resource
allocation process evolves, making more information available and thus allowing
decisions to be made in real-time [16; 11; 5; 1]. Naturally, a reactive solution is not a
simple objective function, but instead a resource allocation policy (i.e., a state-action
mapping) which controls the process. In this paper, we focus exclusively on reactive
solutions.</p>
        <sec id="sec-2-1-1">
          <title>3.2 Reinforcement learning</title>
          <p>Over the last few decades, scheduling researchers were inspired by artificial
intelligence whose methods were based exclusively on operational research
algorithms of exponential complexity. Taking into account performance effectiveness
and efficiency, which means optimizing several criteria, will increase problem
complexity even more. Artificial intelligence has allowed such complex problems to
be solved, yielding satisfactory, if not always optimal, solutions.</p>
          <p>
            [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] used genetic algorithms (GA) to adapt the decision strategies of autonomous
controllers. Their control agents use pre-assigned decision rules for a limited amount
of time only, and obey a rule re-placement policy that propagates the most successful
rules to the subsequent populations of concurrently operating agents. However, GA
do not provide satisfactory solutions for reactive scheduling. Therefore, a reactive
technique must be integrated into GA to allow the system to be controlled in real
time.
          </p>
          <p>Reinforcement learning (RL) might be an appropriate way to obtain quasi-real-time
solutions that can be improved over time. Reinforcement learning is learning by trial
and error dedicated to agents learning. In this paradigm, agents can perceive their
individual states and perform actions for which numerical rewards are given. The goal
of the agents is thus to maximize the total reward they receive over time.</p>
          <p>
            [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] used reinforcement learning to optimize resource use in a very expensive
electric motor production system. Such systems are characterized by a variety of
products that are produced on re-quest, which requires a great deal of flexibility and
adaptability. The assembly units must be autonomous and modular, which makes
performance control and development difficult. [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] considered these units as insect
colonies able to organize themselves to carry out a task. Self-organization can reduce
the number of resources used, allowing production risk problems to be solved more
easily.
          </p>
          <p>
            The most used reinforcement learning algorithm is Q-learning. [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] extended this
algorithm by using a reward function based on EMLT (Estimated Mean LaTeness)
scheduling criteria, which are effective though not efficient. [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] pro-posed an
intelligent agent-based scheduling system. They employed the Q-III algorithm to
dynamically select dispatching rules. Their state determination criteria were the
queue's mean slack time and the machine's buffer size. These authors take advantage
of domain knowledge and experience in the learning process.
          </p>
          <p>But in this paper, we are exploring a more developed algorithm “SARSA
algorithm” in a heterarchical organisation of agents. In conclusion, we are trying to
experiment reinforcement learning by using SARSA algorithm to conceive an
adaptative and reactive manufacturing control system for petroleum process based on
heterarchical multi-agent architecture. In the next section, we will present our system
architecture and motivating our choices.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 THE PROPOSED CONTROL SYSTEM</title>
      <p>
        A multi-agent system is a distributed system with localized decision-making and
interaction among agents. An agent is an autonomous entity with its own value
system and the means to communicate with other such entities. For a general survey
of the application of multi-agent systems in manufacturing, see the review by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In
order to develop multi-agent system with a reactive decision capability in an uncertain
environment, they may be modelled as Markov Decision Process (MDP) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. And to
improve the system performances and learn optimal policy in Markov environment, If
the transition function T (modelling the system’s evolution from state to state) is
unknown while an objective can be identified a learn-by-trial process such as RL
[12;13] can be designed.
      </p>
      <sec id="sec-3-1">
        <title>4.1 The proposed manufacturing control system</title>
        <p>
          We consider that a petroleum refinery exists in a dynamic, uncertain and
unpredictable environment, since it is subject to internal stress (e.g., production risks)
and external constraints (e.g., forced markets, unexpected orders). According to[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
the decisions made in such environments involve Markov decision processes (MDP).
Clearly, in such a Markovian context, it is necessary to consider the transition
function T, modelling the system’s evolution from state to state as an unknown.
According to [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], a learn-by-trial process, such as reinforcement learning,
should be used determine the optimal policy. This modelling approach is widespread.
Figure 1 shows the main functions embedded in each agent.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2 SARSA (Stat, Action, Reward, new Stat, new Action) algorithm to resolve dynamic scheduling problem</title>
        <p>
          An MDP is a tuple &lt; S,A,T,R &gt;, where S is a set of problem states, A is a set of
actions, T(s, a, s’)Æ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is a function defining the probability that taking action a in
state s results in a transition to state s’, and R(s, a, s’)Æ R defines the reward received
after such a transition.
        </p>
        <sec id="sec-3-2-1">
          <title>RL Æ improvement of on-line scheduling</title>
          <p>Performances</p>
          <p>If all the parameters of the MDP are known, an optimal policy can be found by
dynamic programming. If T and R are initially unknown (which is commonly the case
when considering industrial case studies), Reinforcement learning (RL) methods can
learn an optimal policy by direct interaction with the environment. RL is learning to
act by trial and error. Agents perceive their individual states and perform actions for
which numerical rewards are given. The goal of the agents is thus to maximize the
total reward received over time. This technique is often used in robotics, in order to
teach a robot the behavior to achieve its goals and to overcome obstacles.
The SARSA algorithm is used to learn the function Qπ(s, a), defined as the expected
total discounted return when starting in state s, executing action a and thereafter using
the policy π to choose actions:</p>
          <p>Qπ (s, a) = ∑ T (s, a, s′)[R(s, a, s′) + γ Qπ (s′,π (s′))] (1)</p>
          <p>s′</p>
          <p>
            The discount factor γ ∈ [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ] determines the relative importance of short term and
long term rewards. For each s and a we store a floating point number Q(s,a) for the
current estimate of Qπ(s,a).
          </p>
          <p>As experience tuples &lt; s,a,r,s’,a’ &gt; are generated through interaction with the
environment, the table of Q-values is updated using the following rule:</p>
          <p>Q(s, a) = (α − 1)Q(s, a) + α (r + γ Q(s′, a′)) (2)</p>
          <p>
            The learning rate α ∈ [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ] determines how much the existing estimate of Qπ(s,a)
contributes to the new estimate.
          </p>
          <p>If the agent's policy tends towards greedy choices as time passes, the Q(s,a) values
will eventually converge to the optimal value function Q*(s,a). To achieve this, we
use a Boltzman probability which determines the probability of choosing a random
action.
In our case, this algorithm will make the Resource Agent learn its action policy π,
which in turn makes it able to choose the best action for each state (accept
task/request, or not). This algorithm works with the following data:
State parameters are the current time t ∈ 0…T; the inventory of pmps p1… pn and
their states Sp1… Spn (e.g., maximum capacity, feeding, receiving); the list of Storage
Tanks T1… Tm, and their states ST1…STm (e.g., Capacity). Action concerns the
reception or not of the product, stop or start pumping.... Reward function assigns no
reward to most of the states and positive rewards to a specific goal state. For more
precision and to obtain a proper convergence, the reward function is a state
combination engendered by an action. One idea was to take into account the volum in
tanks and (Ci) and feeding and uploading stream (Fdi) (Udi) in the reward function:
⎧ 1 if Ci (t) = C maxi
⎪
RPart−Ag = ⎨ 0 if Ci (t) ≥ C mini
⎪
⎩−1 if Ci (t) &lt; C mini</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.3 Multi-agent interaction</title>
        <p>⎧ 6
⎪1 if ∑
⎪ i=1
RRe source− Ag = ⎨ 6
⎪−1 if ∑
⎪⎩ i=1</p>
        <p>Fdi = 1500 m3 / h</p>
        <p>Fdi = 0</p>
        <p>As shown in Figure 3, the MCSR (Manufacturing Control System using
Reinforcement learning) architecture consists of “resource agents” for the pumps,
“parts agents” for the tanks containing oil and an "observer agent" to control the
process.</p>
        <p>Based on Alaadin modelling (Ferber and Gutknecht, 1998), the resource and parts
agents have certain properties, roles and groups. Initially, each agent must have
knowledge about its properties (e.g., tank number, capacity, characteristics… or pump
reference, flow stream…), its role (i.e., storage or pumping) and its group (e.g., tanks
containing the same product). The observer agent has a global view of the system, and
the state variables that it observes are the indicators of performances.</p>
        <p>Fig. 3. MCSR architecture (Manufacturing Control System using Reinforcement
learning)</p>
        <p>The Observer Agent receives the decade production demand. It sends the relevant
set of tasks to each agent. Each part agent (i.e., tanks containing oil) and resource
agent (i.e., pump) perceives its state which is a combination of its individual state
(e.g., stopped, busy) and the set of tasks that they must execute.</p>
        <p>
          To deal with agent interaction, we used the well-known Contract Net protocol [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
to determine the task allocation to resource agents.
        </p>
        <p>The idea is roughly the following: a part agent has a task request that it proposes to
resource agents, and then the resource agents give their propositions. The part agent
chooses the best proposition and establishes the contract. A detailed illustration of the
agent interaction is provided in figure 4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 IMPLEMENTATION AND EXPERIMENTS</title>
      <p>Our model was simulated in the Borland Jbuilder environment because of its potential
for facilitating communication and thread programming and because of its
compatibility with the chosen MADKIT platform architecture for SMA development
(visit http:// www.madkit.org/downloads). One of the advantages of the reinforcement
learning algorithms is that they allow evaluation during learning. To permit this
evaluation, we selected the following criteria.</p>
      <sec id="sec-4-1">
        <title>5.1 Description of the process &amp; constraints</title>
        <p>A petroleum refinery is subjected to many operational constraints. Operational
constraints include the requirement that only one tank at a time can receive oil, but
several can simultaneously feed mixing line, and another that states a tank cannot
receive and send oil at the same time. Problem inputs include the base oil arrival
schedule, which describes the volumes and qualities of the base oils and additives that
will be received in the refinery during the desired time horizon; the finished oil
demands, and the current levels and qualities of the base oil in the storage tanks. The
major constraints considered can be formalized as follows (see parameter definitions
given in 4.2):
C1: Tank storage level can never be less than a given threshold Ci (t) ≥ C mini
C2: Tank storage level can never be greater than a given threshold . Ci (t) ≤ C maxi
C3: mixing line must always contain oil ∑n Fdi (t) &gt; 0
i=1
⎧Udi (t) &gt; 0, Fdi (t) = 0
C4: Tank cannot feed and receive at the same time ⎨
⎩Udi (t) = 0, Fdi (t) ≥ 0
The base oil is stored in specific storage tanks (TK2501-TK2506 (see figure 5)). The
total time horizon spans 160 hours, during which completely defined oil parcels have
to be received from the pipeline. Six oil tanks are available; all of them have the same
capacity, but different amounts of oil at the beginning of the time horizon (figure. 6.)
Aims are to receive all base oil using available pumps feeding Tank with sufficient
capacity, and to produce exactly the requested quantity with the available quantity of
bases oil in the range of the decade. For this reason, we consider as an evaluation
criterion the Cmax (Maximum duration time to produce the requested products).
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental results</title>
        <p>
          The experiment was conducted as follows: we launch the system with data explained
above. The graph (Figure 7) shows the results for the first phase of the learning
algorithm. As this graph shows, before 5000 iterations, the Cmax variation is rather
high. It varied in the interval [100h, 1500h], which is a modest result. This can be
justified by the fact that the results are from the exploration phase, in which actions
are executed randomly according to the Boltzmann probability [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The second phase
is the exploitation phase, in which the choice of actions is based on Q values (just
before and after 5000 iterations), and the results are better. This phase produced
solutions with a very interesting Cmax of 45 h. Thus, we can state that our system
converges towards optimal solutions by minimizing the total time of production even
with maintenance tasks.
Despite being relatively under control, thanks to the preventive maintenance plans,
perturbations are always possible in a refinery. To test our system faced with such
random events, we caused system perturbations in order to observe the system’s
behavior.
        </p>
        <p>We caused the same perturbation (a breakdown of P3102) in the exploration phase at
the 2000th iteration and again in the exploitation phase at the 15000th iteration. When
such perturbations occur in the current system, some production tasks have to be
cancelled to allow the maintenance tasks to be performed. The human expert then has
to manually find a solution to replace the cancelled production tasks. However, in our
experiment, the disturbance in the exploitation phase was quickly compensated for
without any Cmax variation over 49h, and the system was brought back to the level of
its best performances. These results show that our system is able to learn how to
establish a continuously improving optimal control policy to schedule maintenance
tasks within a production plan without reducing the production rate.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6 CONCLUSION AND FUTURE WORKS</title>
      <p>In this paper, we have presented a multi-agent model for the dynamic scheduling of in
petroleum process. In this model, agents simultaneously insure effective scheduling
and continuous improvement of the solution quality by means of reinforcement
learning, using the SARSA algorithm. We have also provided an overview of the
research done in the field of manufacturing control, focusing on dynamic and reactive
scheduling. The results of our experiments with this model show that our approach
can generate on-line scheduling solutions and improve their quality by minimizing
Cmax. Nevertheless, we want to widen the time horizon of our experimentation,
taking into consideration more complex production units. Last, we are going to work
on a holonic version of our model for future comparison with the multi-agent model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aissani. N</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Trentesaux</surname>
            and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Beldjilali</surname>
          </string-name>
          ,
          <year>2008</year>
          ,
          <article-title>Use of Machine Learning for Continuous improvement of the Real Time Manufacturing control system performances</article-title>
          .
          <source>IJISE: International Journal of Industrial System Engineering</source>
          , Vol
          <volume>3</volume>
          , No 4, p
          <fpage>474</fpage>
          -
          <lpage>497</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Aydin. M. E</surname>
          </string-name>
          ,
          <string-name>
            <surname>Öztemel</surname>
          </string-name>
          . E, (
          <year>2000</year>
          ),
          <article-title>Dynamic job-shop scheduling using reinforcement learning agents</article-title>
          ,
          <source>Robotics and Autonomous Systems</source>
          , Vol
          <volume>33</volume>
          , No 2, p
          <fpage>169</fpage>
          -
          <lpage>178</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bidot</surname>
            <given-names>J</given-names>
          </string-name>
          , T. Vidal,
          <string-name>
            <given-names>P.</given-names>
            <surname>Laborie</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <year>2007</year>
          , A
          <article-title>General Framework for Scheduling in a Stochastic Environment</article-title>
          .
          <source>Proc International Joint Conference on Artificial Intelligence IJICAI07</source>
          , P.
          <fpage>56</fpage>
          -
          <lpage>61</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bousbia</surname>
            , S and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Trentesaux</surname>
          </string-name>
          , (
          <year>2002</year>
          ),
          <article-title>Self-Organization in Distributed Manufacturing Control: state-of-the-art and future trends</article-title>
          ,
          <source>IEEE International conference on Systems, Man &amp; Cybernetics</source>
          , Hammamet, Tunisia, Vol
          <volume>5</volume>
          , 6 p.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Csaji</surname>
            <given-names>B. C</given-names>
          </string-name>
          and Monostori L..
          <year>2006</year>
          .
          <article-title>Adaptive algorithms in distributed resource allocation</article-title>
          .
          <source>Proc of the 6th International Workshop on Emergent Synthesis, August</source>
          <volume>18</volume>
          -19, The University of Tokyo, Japan, p.
          <fpage>69</fpage>
          -
          <lpage>75</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Duffie</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prabhu</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          (
          <year>1996</year>
          )
          <article-title>'Heterarchical control of highly distributed manufacturing Systems'</article-title>
          ,
          <source>International Journal of Computer Integrated Manufacturing</source>
          , Vol.
          <volume>9</volume>
          , No.
          <volume>4</volume>
          ,
          <issue>1996</issue>
          , p.
          <fpage>270</fpage>
          -
          <lpage>281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Haruno</surname>
          </string-name>
          . M,
          <string-name>
            <surname>Kawato</surname>
          </string-name>
          . M (
          <year>2006</year>
          ),'
          <article-title>Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-actionreward association learning'</article-title>
          ,
          <source>Neural Networks</source>
          , Vol
          <volume>19</volume>
          , (
          <year>2006</year>
          ), p
          <fpage>1242</fpage>
          -
          <lpage>1254</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Katalinic</surname>
          </string-name>
          . B and
          <string-name>
            <surname>Kordic</surname>
          </string-name>
          . V (
          <year>2004</year>
          ) '
          <article-title>Bionic assembly system: concept, structure and function'</article-title>
          <source>Proc of the 5th IDMME</source>
          <year>2004</year>
          ,
          <article-title>Bath</article-title>
          ,
          <string-name>
            <surname>UK</surname>
          </string-name>
          , April 5-
          <issue>7</issue>
          ,
          <fpage>2004</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Maione</surname>
          </string-name>
          . G and Naso. D, (
          <year>2003</year>
          ), '
          <article-title>Discrete-event modeling of heterarchical manufacturing control systems'</article-title>
          ,
          <source>Systems, Man and Cybernetics</source>
          , 2004 IEEE International Conference, Vol
          <volume>2</volume>
          ,
          <fpage>10</fpage>
          -
          <lpage>13</lpage>
          Oct.
          <year>2004</year>
          , p
          <fpage>1783</fpage>
          -
          <lpage>1788</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Prabhu</surname>
          </string-name>
          . V.V, (
          <year>2003</year>
          ).
          <article-title>“Stability and Fault Adaptation in Distributed Control of Heterarchical Manufacturing Job Shops,”</article-title>
          <source>IEEE Transactions on Robotics and Automation</source>
          , Vol.
          <volume>19</volume>
          , No.
          <volume>1</volume>
          , p.
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Pujo</surname>
          </string-name>
          . P and
          <string-name>
            <surname>Brun-Picard</surname>
          </string-name>
          . D,
          <year>2002</year>
          ,
          <article-title>Pilotage sans plan prévisionnel ni ordonnancement préalable , Méthodes du pilotage des systèmes de production</article-title>
          ,
          <source>Hèrmes</source>
          ,
          <year>2002</year>
          . p
          <fpage>129</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Russell</surname>
            <given-names>S. Norvig P.</given-names>
          </string-name>
          (
          <year>1995</year>
          ) '
          <article-title>Artificial Intelligence: A Modern Approach', The Intelligent Agent Book</article-title>
          .
          <source>Prentice Hall Series in Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Singh</surname>
          </string-name>
          . S and
          <string-name>
            <surname>Sutton</surname>
            <given-names>R.</given-names>
          </string-name>
          , (
          <year>1996</year>
          ),
          <article-title>Reinforcement learning with replacing eligibility traces</article-title>
          .
          <source>Machine Learning</source>
          , Vol
          <volume>22</volume>
          ,
          <fpage>p1</fpage>
          -
          <lpage>3</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Smith</surname>
            .
            <given-names>R. G.</given-names>
          </string-name>
          , (
          <year>1980</year>
          ),
          <article-title>The Contract Net Protocol: High-Level Communication and Control in a Distributed Problem Solver</article-title>
          ,
          <source>IEEE Transactions On Computers</source>
          , Vol. C-
          <volume>29</volume>
          , No.
          <volume>12</volume>
          , p
          <fpage>1104</fpage>
          -
          <lpage>1113</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Trentesaux</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dindeleux</surname>
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tahon</surname>
            <given-names>C.</given-names>
          </string-name>
          (
          <year>1998</year>
          ),
          <article-title>A MultiCriteria Decision Support System for Dynamic task Allocation in a Distributed Production Activity Control Structure, Int</article-title>
          .
          <source>Journal of Computer Integrated Manufacturing</source>
          , Vol.
          <volume>11</volume>
          n°
          <issue>1</issue>
          ,
          <year>1998</year>
          , p.
          <fpage>3</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Trentesaux</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gzara</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammadi</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tahon</surname>
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Borne</surname>
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>D-Sign: un cadre méthodologique pour l'ordonnancement décentralisé et réactif</article-title>
          .
          <source>Journal Européen des Systèmes Automatisés</source>
          . p.
          <fpage>933</fpage>
          -
          <lpage>962</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Trentesaux</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <article-title>Les systèmes de pilotage hétérarchiques : innovations réelles ou modèles stériles ?</article-title>
          ,
          <source>Journal Européen des Systèmes Automatisés</source>
          , vol.
          <volume>41</volume>
          , n°
          <fpage>9</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2007</year>
          , pp.
          <fpage>1165</fpage>
          -
          <lpage>1202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Wei Y-Z and Zhao</surname>
            <given-names>M-Y</given-names>
          </string-name>
          , (
          <year>2005</year>
          ),
          <article-title>A reinforcement learning-based approach to dynamic Job-shop scheduling</article-title>
          ,
          <source>Acta automarica sinica</source>
          , Vol
          <volume>31</volume>
          , No 5, p
          <fpage>765</fpage>
          -
          <lpage>771</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>