<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ADS2 : Anytime Distributed Supervision of Distributed Systems that Face Unreliable or Costly Communication</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cédric Herpson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Corruble</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amal El Fallah Seghrouchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sorbonne Universités</institution>
          ,
          <addr-line>UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris</addr-line>
          ,
          <country country="FR">France</country>
          <addr-line>CNRS, UMR 7606, LIP6, F-75005, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <fpage>151</fpage>
      <lpage>158</lpage>
      <abstract>
        <p>The purpose of a supervision system is to detect, identify and repair any fault that may occur in the system it supervises. Nowadays industrial process are mainly distributed, and their supervision systems are still centralized. Consequently, when communications are disrupted, it slows down or stops the supervision process. Increasing production rates make this subjection to the state of the communications no more acceptable. To allow the anytime supervision of such systems, we propose a distributed approch based on a multi-agent system where each supervision agent autonomously handles both diagnosis and repair on a given location. This degree of delegation, never considered in the literature nor in the industry outside of the theoretical framework, requires to overcome several difficulties : How can one agent autonomously make a diagnosis with dynamically arriving information ? How can several agents may coordinate and reach a consensus on a given diagnosis or repair with asynchronous communication ? Finally, how to allow a human to trust the decisions of such a system ? This paper develops our proposal allong these three axis and evaluates ADS2 using an industrial case-study. Experiments demonstrate the relevance of our approach with an overall reduction of the supervised system down-time of 34%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Supervision systems were initially monitoring tools whose
role was limited to collect and display information for their
interpretation and use by the human expert. Today, the
advent of complex and physically distributed systems leads to
a semantic shift from supervision tools to supervision
systems. Indeed, as the complexity of systems increases,
humans can no longer process the flow of information arriving
at each instant. The need to minimize the down-time and
to improve system effectiveness requires the delegation of
some of the decision-making power of the human
supervisor to the supervision system. This requirement has lead to
the (re)birth of a research community around the notions of
autonomic computing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and self-* systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our work
lies within this context.
      </p>
      <p>Within the Dem@tFactory1 project, our objective is thus
to improve the supervision of an existing digitizing chain
distributed over several sites (see Fig 1). Different faults
– single or multiple – can occur and alter or prevent the
processing of the documents (e.g. a scanner quits working,
a disruption of the connection between different sites halts
or corrupts a data transfer, an OCR software is poorly set
and generates unexploitable results, etc.).</p>
      <p>Centralized supervision systems are currently the most
common in industry. However, they do not perform well
in asynchronous contexts. Indeed, communication
malfunctions between the supervision system and the geographically
distributed regions of the supervised system delay the repair
and do not allow to quickly return to normalcy, even though
a number of malfunctions may have local predefined repair
procedures available. The unbounded communication time
between the supervision and the supervised system is the
main reason for this problem.</p>
      <p>To overcome this lack of robustness when facing
unreliable communications and to reduce the supervised system
down-time, we present in this article ADS2 : a multi-agent
architecture where each supervision agent autonomously
handles both diagnosis and repair on a given location. The
proposed architecture is composed of three mechanisms: A
decision mechanism, a coordination and consistency
recov1Project of the French R&amp;D initiative Cap Digital federating 4
industrials and 3 laboratories.
ery mechanism and an intertwining mechanism. The
decision mechanism tackles the dynamicity of the information
available to an agent in order to make a diagnosis. The
coordination and consistency mechanism deals with the problem
of reaching a consensus between several agents on a global
diagnosis (or repair) in a context of asynchronous
communications. Finally, the intertwining mechanism address the
problem of the size of the search-space in a multiple-faults
context.</p>
      <p>In this article, we first present our fault and repair model
and the various assumptions made in section 2. We then
describe the three mechanisms of our multi-agent architecture
in section 3 to 5. We then demonstrate the viability of our
proposal with experiments in section 6. Finally, we discuss
related work in section 7 before concluding.
2</p>
    </sec>
    <sec id="sec-2">
      <title>A Multi-Agent Architecture for the</title>
    </sec>
    <sec id="sec-3">
      <title>Supervision of Distributed Systems</title>
      <p>Our architecture comes within the scope of fault-based
model2 approaches with spatially distributed knowledge.
The supervision process is distributed among several
autonomous agents having each a local view of the system to
be supervised, and endowed with diagnosis and repair
capabilities. The supervised system is partitioned into regions,
each one is supervised by one agent. As illustrated in Fig.
2, the supervision agents (Ai) exchange information in order
to establish a diagnosis and a repair consistent with the
observations (Oj ) they get from the various units of the
supervised system (Uk). The links between the square units
represent the standard workflow of the supervised system. The
dashed arrows represent the fact that some elements may be
reprocessed if the quality is not sufficient. The arrows
between the units and the agents represent the communication
links used to transmit alarms logs. The remaining links
represent the communications between the supervision agents.
We consider that communications are asynchronous and that
there is no upper bound on transmission delay. We assume
that the messages exchanged between supervised units may
be lost or corrupted, and that some units are not supervised
(e.g. unit U2 on Fig. 2). This assumption is based on the
fact that a complex industrial process commonly involves
different actors that do not share their supervision
information3. Moreover, we assume that the observations and the
2No model of the system’s correct behaviour is available. The
system can only use faults model, a priori known or dynamically
learned from the system observation.</p>
      <p>3subcontractors in the case of the Dem@tFactory project.
messages between agents can be lost but not corrupted. The
agents are supposed to be reliable (no Byzantine behaviour).
Finally, we consider that the simultaneous occurrence of
different faults does not result in phenomena of masking
observables.
2.2</p>
      <sec id="sec-3-1">
        <title>Fault model and repair plan</title>
        <p>Let F be the set of known faults of a system S and R be the
set of existing repair plans. The signature of a fault f is a
sequence of observable events generated by the occurrence
of f . The set of signatures of a given fault f is Sig(f ).</p>
        <p>To be able to represent any temporal dependencies, each
fault is modeled as a t-temporised Petri net (Fig. 3). Each
fault is supposed to be repairable, that is to say that there
exists at least one partially ordered sequence of atomic repairs
rk that repairs it (a repair plan).</p>
        <p>The supervised system is partitioned into regions rgj .
Each supervision agent is associated with one unique region
and knows the models of the faults that may occur in the
region it oversees. However, a fault can cover several regions.
In that case, an agent only knows the part of the model that
concerns its region. Its model is completed with the names
of the agents responsible for the others regions. This
hypothesis allow to model workflow involving different actors
that do not share their data.</p>
        <p>Sig(f ) = {or1gb orgb orgc orgc
2 3 4 } =⇒</p>
        <p>SigArgb (f ) = o1o2Argc
SigArgc (f ) = Argb o3o4</p>
        <p>Beside getting the models of faults, the issue of defining a
global precedence relation between events that occur within
the supervised system remains. Indeed, there is no common
clock to the different regions. It is therefore necessary to
add in each agent a stamping mechanism allowing to
recreate this order relation. We will not detail here the concept
of distributed clock.We consider in the following that the
agents are able to recreate this partial-order relation.
2.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Diagnosis and multiple faults</title>
        <p>During the period of time [t − Δt, t], agent Ai collects
a sequence of observations seqObsAi (t, Δt) generated by
the occurrence of faults on the system. However agent Ai
does not know which faults have occurred. It thus
analyses seqObs in order to determine the set of all faults
f pAi (t, Δt) whose signatures partially or totally match
elements of seqObs. A diagnosis dg is a set of faults that can
explain seqObs. Dg is the set of all possible diagnoses of
seqObs.
2.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Fault cost and repair cost</title>
        <p>Finally, each fault f (respectively each repair plan rp(f ))
is associated with a cost of malfunction which depends of
the fault duration Ctdysf (f, t) (resp a cost of execution
CtEx(rp(f ))). The cost of a diagnosis dg for the
supervision system is the result of the aggregation of the respective
costs of the faults that compose it. In the general case:
Ctdysf (dgi, t) = Aggregfj∈dgi (Ctdysf (fj , t))
(1)</p>
        <p>Similarly, the execution cost of a repair plan rp
associated to a given diagnosis depends on the aggregation of the
respective costs of the repairs that compose it. Thus, in the
case the repair plan depends directly on the faults:
CtEx(rp(dgi)) = Aggregf0j∈dgi (CtEx(rp(fj )))
(2)
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Agent Decision Model</title>
      <p>We consider highly dynamic systems. Consequently,
information available to an agent at a given time can be
insufficient to determine with certainty which action to
select. A supervision agent has thus to determine the
optimal decision (Dopt) between the immediate triggering
of the plan made under uncertainty (Dimmopt), and a
delayed action (Ddelayopt) which lets him to wait and
communicate with other supervisor agents during k time
steps. This waiting time can yield information that reduces
uncertainty and thereby improve decision-making. The
counterpart is that the elapsed time may have a significant
negative impact on the system. The expected potential gain
in terms of accuracy must be balanced with the risks taken.</p>
      <p>Let Ct(x) the cost of an action x and Ctwait(k) the cost
related to the extra time k before selecting a repair plan. The
decision-making process of each supervision agent works as
follows :
1. Observation gathering
2. Computation of the different sets of faults that can
explain the current observations : Dg (set of diagnosis)
3. Determination of the immediate repair Dimmopt
based on available information and on the constraints
we chose to focus on (Most Probable Explanation, Law
of parsimony, Worst case,...) and computation of its
estimated cost Ct(Dimmopt)
4. A time t, an agent knows the set of the faults that may
be occurring in the region it supervises f pAi (t, Δt).
Knowing theirs signatures the agent is able to predict,
for each fault of f pAi (t, Δt), the set of observables
that can be expected to appear during the time interval
[t, t + k], with k an a priori fixed parameter. The agent
uses these information to compute the waiting cost
Ctwait(k), the expected potential gain of a delayed
repair Ddelayopt and its associated cost Ct(Ddelayopt).
5. Choice between the immediate repair Dimmopt and
the delayed repair Ddelayopt</p>
      <p>This algorithm is executed at each time-step and by each
agent when faults occur. The value k represents an upper
bound delay as an agents’ decision is updated each time an
observation is received. We will detail in the following
subsections the steps 3 and 4 relative to the determination of the
immediate and delayed repair and of their respectives costs.
3.1</p>
      <sec id="sec-4-1">
        <title>Immediate repair Dimmopt</title>
        <p>The knowledge of the different signatures of faults allows
us to establish a list of potential diagnoses Dg. We sort
these explanations according to available information and to
the constraints we chose to focus on (e.g the most probable
explanation). After this step , the first element ofDg is the
diagnosis considered as the most relevant at the current time.
It is then necessary to estimate its cost.</p>
        <p>The cost of the immediate repair Ct(Drepopt) must take
into account the execution cost of the repair plan associated
to the diagnosis retained (CtEx, equation 2), as well as a
cost representative of the potential error relative to this
decision, CtErr. Indeed,if the only cost considered is the one
of the execution of the repair plan, the final decision (step
5) will always favour an immediate action compared with a
delayed one due to the additional waiting cost of the delayed
action.</p>
        <p>Ct(rp(dgi)) = CtEx(rp(dgi)) + CtErr(dgi|Dg\{dgi})
(3)</p>
        <p>The computation of the error cost CtErr relies on the
fact that we assume that the good diagnosis – and so the
good repair – belongs to the sorted list Dg of the
potential diagnoses. Thus, in case of misdiagnosis when
selecting the first diagnosis dg1 of Dg, the system will
lose a time equal to the execution time of the first
repair plan (CtExecT ime(rp(dg1))) which will be
supplemented by the execution cost of the newly chosen repair
plan (CtEx(rp(dg2))) associated to the 2nd diagnosis of
Dg. As this second choice may also turn out to be an
error, we defineCtErr recursively on Dg. Thus:
 CtErr(dg1|[]) = 0// Dg is empty, the diagnosis is correct.







</p>
        <p>CtErr(dg1|Dg\{dg1}) = P (dg1|Dg\{dg1})×</p>
        <p>CtExecT ime(rp(dg1)) + CtEx(rp(dg2))
+ CtErr(dg2|Dg\{dg1, dg2})
with P (dg1|Dg\{dg1}), the probability that choosing dg1
as the final diagnosis is an error.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2 Delayed repair Ddelayopt</title>
        <p>A time t, an agent knows the set of the faults that may
be occurring in the region it supervises f pAi (t, Δt). The
different faults models are represented using t-temporised
Petri-nets (Fig. 3 page 2). The agent is thus able to predict,
for each fault of f pAi (t, Δt), the set of observables that
can be expected to appear during the time interval [t, t + k],
with k an a priori fixed parameter. Note that the agent uses
the current transmission duration (computed over the
interval [t − Δt, t]) to determine the set of potential observations.</p>
        <p>From this information, the agent builds the tree
representing the set of all possibles futures working towards the
current time plus k units of time, ArbpAoissibles(k). Each node
of the tree is associated with a set of observations and
represent one possible future (Fig. 4 below). The agent then
computes, for each node of the tree, the set of diagnoses
that explain this future (Dg0).</p>
        <p>The agent can then compute, for each possible future,
the immediate decision considered as optimal. At time
t, the determination of the delayed decision with horizon
k (Ddelayopt) involves choosing between the various
possibles situations. This choice is realised by sorting
the first elements of each Dg0 of the tree of the possibles
futures with each other using the same criterion than the
fpAi(t, Δt)
∅
o1</p>
        <p>Dg20
o2
o2</p>
        <p>Dg10
Dg40
Dg30
one used to identify the immediate repair in the sub-section
3.1.</p>
        <p>Once the delayed decision is identified, its cost
Ct(Ddelayopt) is established using Equation (3). We then
have to add to this cost the waiting cost Ctwait. This waiting
cost represents the consequences of the faults on the
supervised system during the time where no action was triggered.
The computation of the waiting cost depends on the
respective costs of the malfunctions associated to the remaining
diagnoses and of the elapsed time.</p>
        <p>Ctwait(k) = Aggregdgi∈Dg(Ctdysf (dgi, k)))
(4)
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Distributed Supervision and System</title>
    </sec>
    <sec id="sec-6">
      <title>Consistency</title>
      <p>In the previous section, we addressed the problem of one
agent making a decision. However, as each agent has a
local view of the system, a decision about a diagnosis and/or a
repair frequently requires information and knowledge from
other supervision agents. It therefore becomes necessary to
reach a consensus on the decision to make.</p>
      <p>However, distributed supervision works in a context of
asynchronous and unbounded communication. Within these
constraints the theorem of Fisher-Lynch-Paterson [3] states
the impossibility of guaranteeing the achievement of a
consensus between different components.</p>
      <p>
        To circumvent this difficulty, the literature on supervision
frequently introduce hypotheses on the quality of the
communications. As our work tends to work under real-life
hypotheses, we do not make any regarding the (un)reliability
of the communication. We discuss in this section the use
the multi-Paxos algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to reach a consensus when the
state of the communication allows it, and propose a
consistency mechanism to restore a common view of the system
by the agents after a unilateral decision taken by some of
them.
4.1
      </p>
      <sec id="sec-6-1">
        <title>Consensus algorithm</title>
        <p>In the general case, establishing a consensus must meet the
following properties : (Agreement) All correct processes
decide the same value. (Integrity) Every process decides at
most once. (Validity) Each value determined belongs to the
set of proposed values. (Termination) Every correct process
eventually decides in a finite time. However, in the context
of Fisher-Lynch-Paterson theorem, the supervision system
can only offer a guarantee of “best-effort”. i.e, to assure that
the consensus can be reached, but only if the system is stable
on a sufficiently important period of time[5].</p>
        <p>The multi-Paxos algorithm, initially developped for
reaching an agreement in a network of unreliable
processors, falls into this category. The interesting aspect of this
algorithm is that it was designed to resist to halt failures
with recovery possibility - of a number of processes,
including the coordinator. Its very low number of assumptions
makes it operational in an environment with unreliable
communications. These properties make it particularly
suited to multiagent systems. Using the multi-Paxos, each
agent is able to initiate, integrate or leave a coalition.</p>
        <p>The fact that there is no upper bound on the time needed
to reach a consensus will inexorably lead to some
unilateral decision-making by agents or agent groups in case of
communication disruption. This feature of our system
guarantees the avoidance of deadlock situations when
communications are too unstable to let the agents reach a consensus.
However, this ability requires the introduction of an
algorithm to restore a consistent view of the system state by all
agents.
4.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>System consistency</title>
        <p>Algorithm 1 works in the manner of producer-consumer
with the decision-making process introduced in section 3.
The two algorithms share, within an agent, a common
inconsistency queue Finc. When a coalition is left by at least
one agent before reaching a consensus (due to a
communication breakdown or to an agent’s decision), the members of
the coalition store their respective decision-making context
(the current sequence of observables, the set of considered
explanations and the list of agents which belong to the
coalition) into their own potential inconsistency queue Finc.</p>
        <p>The consistency maintenance algorithm is available
within each agent as a behaviour, it continuously observes
the state of the queue Finc. When an entry is added to Finc,
the algorithm is automatically triggered.</p>
        <sec id="sec-6-2-1">
          <title>Algorithm 1 Check consistency</title>
          <p>Require: Pattern observer on Finc
1: if Finc 6= ∅ then
2: Try to contact Finc.getF irst().getCoalition()
3: if contact successful then
4: Send Finc.getF irst()
5: Receive other agents decision context
6: Make pairing between local decision context and others.
7: if pairing is ok then
8: Finc.removeF irst()
9: else
10: start new paxos instance
11: end if
12: end if
13: end if</p>
          <p>This algorithm lets each agent find a match between its
actions and those selected by the other members of the
coalition. Thus, in case of faults due to past inconsistency
decisions taken by the agents, they are able to trigger a
sequential diagnosis and to discriminate initials disturbances from
the consequences of their decisions.</p>
          <p>When the potential inconsistency queue of an agent
contains one element, the agent tries to resolve it. The agent
tries to contact each of the agents of the coalition concerned
with this potential inconsistency Pinc. If these agents are
able to communicate (the communications are restored),
they will exchange their respective decision-making
contexts. By comparing them, they will be able to determine
whether the decisions made locally by the different groups
of agents are consistent with one another. If this is the case
(the faults are repaired, the system is in a stable state and
correct), each agent removes Pinc of the queue. Otherwise,
the subset of agents involved initiates a new coalition in
order to resynchronize their respective views of the system and
make a decision consistent at the system’s scale. If
communications are too unstable (or too costly), this consensus will
not be reached, which results in adding a new entry in Finc.</p>
          <p>Restoring the consistency of the system state as it is
perceived by the supervision agents is again relying on the
stability of the communication links for a sufficient amount of
time.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5 Intertwining Diagnosis and Repair Stages</title>
      <p>In the previous sections, we endowed the supervision agents
with decision-making and coordination mechanisms. These
abilities allow the agents to dynamically adapt theirs
behaviours to the current state of the communications and of
the supervised system. In case of uncertainty regarding the
decision to make, the agents are thus able to explore the
solution space, collectively as well as individually. However,
the large size of this set remains a problem. Indeed, it is both
a source of misdiagnosis in case of local decision-making
and the cause of a large number of supervision messages
when a consensus must be reached. In order to reduce the
complexity of the decision process, we adress in this section
the question of obtaining the minimal set of diagnoses and
of associated repair plans. To this aim, we discuss the idea
of intertwining the diagnosis and repair stages.</p>
      <p>
        This idea has been introduced by Cordier et al [6] on the
formalization of self-healing systems [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ]. Several failures
may indeed have the same signature without calling into
question the repairability of the system, all that is needed
is that a repair be common to all of the faults involved
(notion of macrofault).
      </p>
      <p>However, restricted to the single-fault context, this
formal model defines the diagnosability and repairability of
a system as static properties that can be computed offline.
This is not the case in the multiple-faults context. Indeed,
the appearance of faults can prevent the triggering of a
repair associated to another fault currently occurring in
the system, and the possible situations are endless. Being
able to represent this kind of interference is essential to our
work. This led us to introduce context-dependent notions
of diagnosability and repairability.</p>
      <sec id="sec-7-1">
        <title>Definition 1(Conditionnal Diagnosability).</title>
        <p>Diagnosable(fi, t)
⇐⇒
⇐⇒
∀x ∈ CD(fi), x ∈/ ss(t)</p>
        <p>CtD(fi) = ∅</p>
        <p>A fault f is diagnosable at time t if none of the faults that
may prevent its diagnosability (e.g if they share the same
signature) is appearing in the system at this instant. This
set of faults is the conflict set in diagnosis of the fault f
(denoted by CtD(f )). Following the same reasoning, we can
defineCtR(f ) as the conflict set in repair of the fault f .</p>
        <p>Finally, the uncertainty regarding the faults that are
currently occurring on the supervised system, may conduct
the supervision agents’ to “believe” the occurrence of
faults which are not real and which prevent the repair of
the system. We call these situations virtual deadlocks.
To disambiguate these situations, we added a relationship
of innocuousness I to these definitions. Thus, for a fault
f , a repair plan r(f ), its repair conflict set CR(f ) and
considering the current state of the system, I returns the set
of sets of faults belonging to CtR on which the execution of
repair plan r(f ) leaves the system unchanged. The result
of this innocuousness relation is the set of disambiguation
under repair, denoted by DR(f ). Taking into account all
this information, we are then able to propose an algorithm
to plan the order of the repairs and to resolve some conflicts.
We illustrate how it operates below :
Example : Let F = {f1, f2, f3} with Sig(f1) =
Sig(f2) = {a} and Sig(f3) = {b}. Rp(f1) = {r1},
Rp(f2) = {r2} and Rp(f3) = {r3}. Moreover, we know
that CR(f1) = {{f3}} and that DR(f2) = {{f2}, {f1}}.
As the signatures of f1 and f2 are identical, it follows that
CD(f1) = {{f2}} and CD(f2) = {{f1}}. We assume that
an agent detects the observables a ∧ b.</p>
        <p>MF12
{a}
f1
{a}</p>
        <p>F
{a, b}
MF13
{a, b}
f2
{a}
∅
{ok}</p>
        <p>MF23
{a, b}
f3
{b}</p>
        <p>In Fig. 6, the initialization of the algorithm determines for
each potential fault fi all repair conflicts existing at the
current time CtR(fi). The planning phase recursively builds the
repairing order from CD and CR adding the faults whose
t t
conflict sets are empty, and then updates the remaining ones.
At the end of this phase, if some faults remain, they
potentially are in a deadlock. In our example, the agent has to
choose between {f2, f3} and {f1, f3}. As highlighted in
Fig.5, the agent can repair f3 but is unable to make a
distinction between f1 and f2 (we assume that this conflict is
virtual and that only one of these faults is occurring).</p>
        <p>The disambiguation phase then attempts - from the
proven innocuousness of some repairs in the current
context - to solve these conflicts. If one of them is solved, the
planning phase is retried after updating the conflict sets. If
the disambiguation does not work, it means that the agent
does not have, at the current time, enough information to
solve the problem. The decision-making process previously
introduced in section 3 is then triggered. In our example,
repair r2 is selected. If the system returns to normalcy, then
both diagnosis and repair phases end. If not, the previous
action guarantee the occurrence of f1, and the associated
repair plan is executed.
6</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Experimental Evaluation</title>
      <p>To evaluate our approach we developed a simulator for
distributed systems. Based on the JADE multi-agent platform
[8], our environment allows us to model both physical units
and communications links, and to simulate the occurrence
of failures in it. For a given simulation, a list of faults is
associated with each site and each communication link (A
communication link can increase its transmission time, a
unit may stop working properly,...). Each fault is associated
with one or more trigger conditions: a date and/or the
occurrence of another failure. This allows us to simulate
cascading faults. Our supervision system is deployed
on this simulator. When running the simulation, some
faults trigger the sending of an alarm message to the agent
responsible for the site where they appear. The agents
will try to determine the appropriate behaviour from these
messages.</p>
      <p>Our agent decisional model is generic. It can be
instantiated with various criteria (the most probable explanation,
the worst case hypothesis, etc.) depending on available
information and on the constraints we choose to focus on.In
the Dem@tFactory project, it appeared essential to consider
the utility of a decision based on the cost of the occurrence
of a given set of faults rather than on its occurrence
probability. This reasoning led us to favor a robust decision criterion.
The decisions taken by the agents will therefore rely on the
worst case hypothesis.</p>
      <p>The upper bound k which is the horizon considered by
the agents of ADS2 for the computation of the delayed
decision is set to 15 units of time. Moreover, we assume in
these experiments that the respective costs of the faults that
compose a diagnosis are additive. Finally, in order to have
benchmarks for the evaluation of the principles underlying
our architecture (ADS2), we also implemented a
centralized supervision systems (SC) where all observables are
transmitted to a single supervision agent.
6.1</p>
      <sec id="sec-8-1">
        <title>Experimental setup</title>
        <p>Our goal is to study the behaviour of the supervision system
facing an industrial case-study.</p>
        <p>Fig. 7 represents the digitizing chain of the
Dem@tFactory used for the experiments. Each dotted
rectangle corresponds to a factory situated on a given
geographical location (2 in France, 1 in Madagascar and 1 in
Mauritius). The circles correspond to the various process
required to the digitization. The inter-rectangle links
correspond to communications between the different factories,
and the intra-rectangle one to local communications. All
theses components are modeled in the simulator and can
engender the occurrence of faults.</p>
        <p>We fixed a priori the locations and responsibilities (the
regions) of the supervision agents according to the
geographical location of the units that compose the supervised
process. We used a dataset provided by our industrial
partners (8 GB of data corresponding to 48 hours of logs) to
extract nineteen different faults models. From these data,
we determined that the probability of occurrence of n faults
per unit of time follows a Poisson distribution with
parameter λ = 0.043. Finally, using information gathered from
our partners, we were able to estimate the costs of the faults
over time (constant, logarithmic,...) and the associated
repair costs.
6.2</p>
      </sec>
      <sec id="sec-8-2">
        <title>Experiments</title>
        <p>Our experiments study the behaviour of the supervisions
systems when varying the (heterogeneous) transmission
cost. We used a random generator to affect a transmission
time (between 0 to 30 units of time) to each transmission
link for each time unit of the simulation. We arbitrarily set
at 10% the probability of a link to get a transmission time
greater than 1. In order to obtain a baseline, we first
performed different simulations with homogeneous
transmission costs.</p>
        <p>The performance evaluation is based on three criteria:
(1) The average response time to a malfunction. (2) The
average number of supervision messages exchanged. (3) The
average total cost of repairs made during the experiment.</p>
        <p>Figs. 8(a) and 8(b) present the evolution of the behaviour
of supervision systems ADS2 and SC for the two first
criteria in the case of homogeneous (Ho) and heterogeneous
(He) communication links. The vertical bar at t=15ut is the
horizon considered by the agents of ADS2 for the
computation of the delayed decision.</p>
        <p>Fig. 8(a) shows that our architecture is very robust,
allowing the supervised system to rapidly recover from
failures. The response time of ADS2 (Ho and He)
progressively stabilized around 15ut, when the response time of SC
increases over time and becomes higher than ADS2.This is
due to the fact that the agents of ADS2 can decide to act
without waiting for the reception of all the messages that
come from the units of the supervised system.</p>
        <p>We can observe an increase of the average response time
of ADS2(Ho) when the transmission delay is close to 15
ut. This is due to the parameter k of our algorithm, a
priori fixed to 15 ut. This parameter defines the agent’s
horizon for the computation of the delayed decision. When the
transmission time becomes greater or equal to k, an agent
no longer sees interest in waiting or trying to exchange
information with other agents of ADS2; so it decides to act
despite the risk of making a mistake. The impact of
parameter k is less important on ADS2(He). Indeed, as the
communication links are in this case heterogeneous, a
supervision agent is still able to exchange information with
some of the other agents. This leads to a better response
time for ADS2(He) than for ADS2(Ho). This behaviour
is clearly highlighted in figure 8(b). The number of
messages exchanged by the agents of ADS2 drop with the
increase of the transmission delay. We see a sudden drop of
this number when the transmission delay becomes greater
than 15 ut for ADS2(Ho), confirming the local
decisionsmaking of the agents.</p>
        <p>Fig. 8(c) shows that the decisions of ADS2(He) agents
generate a limited repair extra-cost in comparison to the
cen(a) Response time to a malfunction
(b) Communication cost
(c) Total cost of repairs
tralised approach (9%). With the Dem@tFactory fault
models, the overall gain regarding the supervised system
downtime reach 34%.</p>
        <p>Considering the reactivity of ADS2 and the limited
repair extra-cost it generates, the communication extra-cost
for low transmission delays can be considered as an
acceptable consequences compared with a total-absence of
supervision (SC).</p>
        <p>Our next set of experiments evaluates the impact of the
intertwining of the diagnosis and repair phases on the
performances of the supervision system. In order to evaluate
the impact of this behaviour for a supervision agent, we
initially activated this capability in only one agent of ADS2.
We realised 100 simulations. The 10 first simulations are
performed with a number a simultaneous faults restricted to
1. Then the number is gradually incremented every 10
simulations to reach 10 simultaneous faults. For each simulation,
the number of potential diagnoses considered by the agent is
saved at 5 specific time steps. In order to obtain a baseline,
we performed the same 100 simulation with the intertwining
behaviour deactivated. Fig. 9(a) shows that the interleaving
of the diagnosis and repair process does lead to a reduction
of the diagnosis search space of an agent between 10 to 20%
for the set of faults of the Dem@tFactory project.</p>
        <p>Fig. 9(b) shows that the reduction of the number of
potential explanations of each agent is of an extend sufficient
to allow the agent to reduce the number of supervision
messages. The everage response time to a malfunction is not
significantly improved (Fig. 9(c)) but the repair extra-cost
fall from +9% (ADS2) to 7.2% (ADS2+) (with p&lt;0.05).
7</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Related Work</title>
      <p>The supervision of a system consists of four steps:
Detection, Isolation, Identification and Repair. The literature
aggregates the 3 first steps under the name FDI (Fault
Detection and Isolation) [9]. Although several approaches for
the distributed supervision of distributed systems have been
proposed in the literature, whether work is from the
diagnosis and control communities[10; 11] or from the multi-agent
domain [12; 13], they do not cover the repair phase.</p>
      <p>In the work from areas related to distributed systems,
emphasis is placed on the distribution of available
knowledge on the status and behaviour of the supervised system.
Frohlich et al [14] and Roos et al [13] have addressed the
question of the ability of a set of agents to determine an
overall diagnosis according to the shape of this distribution.
They have shown that to obtain a minimum overall
diagnosis is NP-hard in the case of spatially distributed information
and that the complexity of obtaining the diagnosis is
independent of the communications costs engendered during its
establishment[13].</p>
      <p>
        Given these theoretical results, reducing the space of
potential solutions is generally based on a hierarchical
structure of the diagnosis agents [
        <xref ref-type="bibr" rid="ref13">12</xref>
        ] and on the choice of
not returning to previously excluded explanation. Though
the no back-track of past decisions guarantees convergence
and termination of the algorithm, it is a source of diagnosis
errors in an asynchronous environment. The best-effort
approach we chose allows us to reduce these diagnosis
errors, and the termination of the diagnosis algorithm is
guaranteed through the anytime decision-making process of
our agents.
      </p>
      <p>To our knowledge, the work of Nejdl et al [15] is the only
one that addresses the distribution of both the diagnosis
and repair phases. However, placed at a relatively abstract
level of analysis, this work makes the assumptions that
communication links are reliable and that messages can be
exchanged between agents at no cost. In a real situation
these hypotheses are too restrictive. Indeed, to not consider
the communication state may render the supervision system
ineffective or inoperable. Our proposal does not make such
assumptions.</p>
      <p>The problem of online decision-making under uncertainty
is the central point of the work by Horvitz [16], Hansen et
Zilberstein [17] on the control of anytime algorithms.
Indeed, they propose a formal framework to dynamically
determine the time to stop a calculation taking into account the
quality of the current solution and the cost of the algorithm
computation.</p>
      <p>The first distinction between these work and ours is that in
their work the authors determine when to stop the
computation based on the distance between the current solution and
the optimal one. This requires knowing the optimal solution
(or an estimatation) and to be able to dynamically determine
this distance. In our work, talking about the quality of a
solution (i.e a diagnosis) is meaningless insofar as a diagnosis
is right or wrong, and its “value” is only known a
posteriori. The second point of divergence is that we try to select a
candidate (a diagnosis or a repair) among a set of potential
solutions. The complexity of the task is therefore increased.
(a) Size of the potential diagnosis set
(b) Response time to a malfunction
(c) Communication cost</p>
    </sec>
    <sec id="sec-10">
      <title>Conclusions</title>
      <p>We presented the first anytime multi-agent architecture for
the supervision of distributed systems that is able to
dynamically adapt its behaviour to the current state of the
supervised system. In particular, the decision model allows each
supervision agent to find a balance between a quick local
diagnosis and repair under uncertainty, and a delayed,
systemic one, based on the respective costs of misdiagnosis and
communication. The distributed consistency algorithm
allows each agent to form a coalition to reduce its uncertainty
or to restore a consistent view of the system state in case
some had to act locally with incomplete information at an
earlier stage. Moreover, the intertwining of the diagnosis
and the repair phases allows an efficient reduction of the
diagnosis search-space. The overall reduction of 34% of the
Dem@tFactory system down-time associated with a repair
extra-cost of 7.2% demonstrate that ADS2 is able to
efficiently supervise complex systems under real-life
assumptions.</p>
      <p>A fully autonomous supervision system is presently not
realistic in an industrial context as Humans wants to keep
control on what they perceive as critical decisions. ADS2
represents what we see as an acceptable trade-off as the
definition of its autonomy degree can be easily accomplished.
Thus ADS2 organizes the set of known faults and repairs
in several subclasses : the ones whose repair plan can be
triggered automatically, and those whose final repair
decision rests with a human supervisor. The risk aversion of the
users defines the size of these two respective sets. If the
confidence in the efficiency of the autonomous supervision
of complex and distributed systems is not common today,
we believe that the work presented herein provides a step
towards this goal.
[15] W. Nejdl and M. Werner. Distributed intelligent agents
for control, diagnosis and repair. RWTH Aachen,
Informatik, Tech. Rep, 1994.
[17] E.A. Hansen and S. Zilberstein. Monitoring and
control of anytime algorithms: A dynamic programming
approach. Artificial Intelligence , 126:139–157, 2001.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Kephart</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Chess</surname>
          </string-name>
          .
          <article-title>The vision of autonomic computing</article-title>
          .
          <source>Computer</source>
          ,
          <volume>36</volume>
          (
          <issue>1</issue>
          ):
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Salehie</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tahvildari</surname>
          </string-name>
          .
          <article-title>Self-adaptive software: Landscape and research challenges</article-title>
          .
          <source>ACM Trans. Auton. Adapt. Syst.</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Journal of the ACM (JACM)</source>
          ,
          <volume>32</volume>
          (
          <issue>2</issue>
          ):
          <fpage>374</fpage>
          -
          <lpage>382</lpage>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamport</surname>
          </string-name>
          .
          <article-title>Paxos made simple</article-title>
          .
          <source>ACM SIGACT News</source>
          ,
          <volume>32</volume>
          :
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>R. De Prisco</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lampson</surname>
            , and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Lynch</surname>
          </string-name>
          .
          <article-title>Revisiting the paxos algorithm</article-title>
          .
          <source>Distributed Algorithms</source>
          , pages
          <fpage>111</fpage>
          -
          <lpage>125</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Marie-Odile Cordier</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Pencolé</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Travé-Massuyes</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          .
          <article-title>Self-healability = diagnosability + repairability</article-title>
          .
          <source>In The 18th International Workshop on Principles of Diagnosis</source>
          , volume
          <volume>7</volume>
          , pages
          <fpage>251</fpage>
          -
          <lpage>258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Koopman</surname>
          </string-name>
          .
          <article-title>Elements of the self-healing system problem space</article-title>
          .
          <source>In ICSEWADS03</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Chmiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gawinecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaczmarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szymczak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Paprzycki</surname>
          </string-name>
          .
          <article-title>Efficiency of JADE agent platform</article-title>
          . volume
          <volume>13</volume>
          , pages
          <fpage>159</fpage>
          -
          <lpage>172</lpage>
          . IOS Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Betta</surname>
          </string-name>
          and Antonio Pietrosanto.
          <article-title>Instrument fault detection and isolation: state of the art and new research trends</article-title>
          . volume
          <volume>49</volume>
          , pages
          <fpage>100</fpage>
          -
          <lpage>107</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lafortune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teneketzis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sampath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Sinnamohideen</surname>
          </string-name>
          .
          <article-title>Failure diagnosis of dynamic systems: an approach based on discrete event systems</article-title>
          .
          <source>In Proc. American Control Conference</source>
          , volume
          <volume>3</volume>
          , pages
          <fpage>2058</fpage>
          -
          <lpage>2071</lpage>
          , June 25-27,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Christos</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Cassandras</surname>
            and
            <given-names>Stéphane</given-names>
          </string-name>
          <string-name>
            <surname>Lafortune</surname>
          </string-name>
          . Introduction to Discrete
          <source>Event Systems</source>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wörn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Längle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brighenti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Revuelta</given-names>
            <surname>Seijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Senior</surname>
          </string-name>
          ,
          <string-name>
            <surname>M A S. Bobi</surname>
          </string-name>
          , and JV. Collado.
          <article-title>Diamond: distributed multi-agent architecture for monitoring and diagnosis</article-title>
          .
          <source>Production Planning &amp; Control</source>
          ,
          <volume>15</volume>
          :
          <fpage>189</fpage>
          -
          <lpage>200</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>