<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Introduction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nora Faci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zahia Guessoum</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Olivier Marin</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University Pierre and Marie Curie,LIP6- OASIS and SRC Teams</institution>
          ,
          <addr-line>8 Rue du Capitaine Scott, 75015 Paris</addr-line>
          ,
          <country country="FR">FRANCE</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Reims Champagne-Ardenne, CReSTIC-MODECO Team</institution>
          ,
          <addr-line>Rue des Crayeres BP 1035, 51687 Reims</addr-line>
          ,
          <country country="FR">FRANCE</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fault tolerance is an important property of large-scale multi-agent systems as the failure rate grows with both the number of the hosts and deployed agents, and the duration of computation. Several approaches have been introduced to deal with some aspects of the faulttolerance problem. However, most existing solutions are ad hoc. Thus, no existing multi-agent architecture or platform provides a fault-tolerance service that can be reused to facilitate the design and implementation of reliable multi-agent systems. So, we have developed a faulttolerant multi-agent platform (named DimaX) which deals with fail-stop failures like bugs and/or break down machines. It brings fault-tolerance for multi-agent applications by using replication techniques. It is based on a replication framework (named DARX).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>DimaX</title>
      <p>The present section aims at defining the type of failures DimaX deals with. Then, it presents the
DimaX services for developing fault-tolerant MAS.
2.1</p>
      <sec id="sec-1-1">
        <title>Fault Model</title>
        <p>
          The most generally accepted failure classification can be found in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]:
1. A crash failure means a component stops producing output; it is the simplest failure to
contend with.
2. An omission failure is a transient crash failure: the faulty component will eventually resume
its output production.
3. A timing failure occurs when output is produced outside its specified time frame.
4. An arbitrary (or byzantine) failure equates to the production of arbitrary output values at
arbitrary times.
        </p>
        <p>
          Given this classification, two types of failure models are usually considered in distributed
environments [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]:
• fail-silent, where the considered system allows only crash failures, and
• fail-uncontrolled, where any type of failure may occur.
        </p>
        <p>In this work we focus on the fail-silent model. An agent failure is defined as its abnormal
termination due to failure in an underlying resource. This could be either a bug in the underlying
operating system, or a local host crash or a network disconnection.
2.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>DimaX Services</title>
        <p>
          DimaX is the result of an integration of a multi-agent platform (named DIMA[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]) and a fault
tolerance framework (named DARX[
          <xref ref-type="bibr" rid="ref1 ref15">15, 1</xref>
          ]). Figure 1 gives an overview of DimaX and its main
components and services. DimaX is founded on three levels: system (i.e., DARX middleware),
application (i.e., agents) and control. At the application level, DIMA provides a set of libraries to
build multi-agent applications. Moreover, DARX provides the mechanisms necessary for
distributing, observing and replicating agents as services. These mechanisms operate at the middleware
level. Thus, a DimaX server offers the following services: naming, fault detection, observation
and replication. At the control level, DimaX provides a control mechanism of replication which is
automatically performed with cooperation of the observation service [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This mechanism decides
which agent to replicate and where to replicate it.
2.2.1
        </p>
        <sec id="sec-1-2-1">
          <title>Naming Service</title>
          <p>One of the problem related to multi-agent systems distribution is the agent localization at the time
of message sending. A naming server maintains the list (i.e., white pages) of all the agents within
its administration domain. When an agent is created, it is registered at both the DimaX server
and the naming server. To send messages to another, an agent needs to know the application-level
identifier of the receiver. However, the transmission of these messages through DimaX servers,
requires some knowledge about the physical localization (i.e., the IP address and a port number).
The local DimaX server requests this information from the naming server and locally stored it in
a cache. So, the cache contains the list of agents which have been contacted. This avoids that a
DimaX server repeats several times the same search.</p>
          <p>Agents
Replication
Service
Naming</p>
          <p>Service
Failure Detection</p>
          <p>Service</p>
          <p>Observation</p>
          <p>Service</p>
          <p>Control
Application</p>
          <p>(DIMA)
Middleware</p>
          <p>(DARX)
Failure detection is an essential aspect of any fault-tolerant system; indeed it is necessary to
recognize a faulty agent. DARX fault detection service is based on the heartbeat technique; a process
sends an I am alive message to other processes for informing that it is safe (see Figure 2). This
technique has two parameters:
• the heartbeat period: the time between two emissions of the I am alive message,
• the timeout delay: the time between the last reception of an I am alive message from p and
the time where q suspects p, until an I am alive message from p is received.</p>
          <p>
            The detection results may be incorrect; q detects that p is crashed while p is actually safe but its
transmissions are delayed for some reason (e.g., communication load). To overcome this problem,
one solution is to estimate the arrival date of the following I am alive message, with a dynamic
margin. These values are functions of the quality of service of the network and the application [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
          </p>
          <p>When a server detects a failure of another DimaX server, its naming module removes all the
replicated agents the faulty server hosted from the list and replaces these agents by their replicas
located on other hosts. The replacement is initiated by the failure notification.</p>
          <p>HeartBeat Period
P</p>
          <p>Q
FAULT DETECTION</p>
          <p>AT Q</p>
          <p>Delay Timeout
SAFE</p>
          <p>Delay Timeout</p>
          <p>FAULTY</p>
          <p>Delay Timeout</p>
          <p>SAFE
The functionalities of the observation service are fundamental for controlling replication. An
observation module collects data at two levels:
• system level: data about the execution environment of the MAS like CPU time and mean
time between failures,
• application level: information about its dynamic characteristics like the interaction events
among agents (e.g., the sent and received messages).</p>
          <p>The observation service relies on an organization of reactive agents (named host- and
agentmonitors) (see Figure 3). An agent-monitor is associated to each agent of the application (named
domain agents) and a host monitor is associated to each host. These monitoring agents
(agentmonitors and host-monitors) are hierarchically organized. Each agent-monitor communicates only
with one host-monitor. Host-monitors exchange their local information to build global information
(global number of messages, global exchanged quantity of information, . . . ).</p>
          <p>
            After each interval of time Δt, the host-monitor sends the collected events and data to the
corresponding agent-monitors. When the criticality 1 of the domain agent is significantly
modified, the agent-monitor notifies its host-monitor. The latter informs the other host-monitors to
update global information. In turn, agent-monitors are informed by their host-monitor when global
information changes significantly (see [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] for more details).
          </p>
          <p>l Agent_Monitor1
e
v
e
L
n
o
i
t
a
v
r
e
s
b
O
l
e
v
e
L
t
en Domain Agent1
g
A</p>
          <p>Agent_Monitor2
Domain Agent2</p>
          <p>Agent_Monitor3</p>
          <p>Agent_Monitor4</p>
          <p>Agent_Monitor5
Host_Monitor i Host_Monitor j
SendMessage
Event</p>
          <p>Control
Domain Agent3</p>
          <p>Domain Agent4</p>
          <p>
            Domain Agent5
Replication is an effective way to achieve fault tolerance in distributed systems. It has proved its
efficiency [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. We propose therefore to use replication mechanisms to avoid failures of multi-agent
systems. Replication enables to run multi-agent systems without interruption, in spite of failures.
A replicated agent (see Section 2.4) is an entity that possesses two or more copies of its behavior
(or replicas) on different hosts. There are two main types of replication protocols:
• active replication, in which all replicas process concurrently all input messages, and
• passive replication, in which only one of the replicas processes all input messages and
periodically transmits its current state to the other replicas in order to maintain consistency.
          </p>
          <p>Active replication strategies provide fast recovery but lead to a high overhead. If the degree of
replication is n, the n replicas are activated simultaneously. Passive replication minimizes processor
utilization by activating redundant replicas only in case of failures. That is: if the active replica
is found to be faulty, a new replica is elected among the set of passive ones and the execution is
restarted from the last saved state. This technique requires less CPU resources than the active one
but it needs a checkpoint management which remains expensive in processing time and space.</p>
          <p>
            Many toolkits (e.g., see [
            <xref ref-type="bibr" rid="ref20 ref4">4, 20</xref>
            ]) use only one of these techniques. So, they may suffer from the
disadvantages of the used technique. Contrary to these approaches, DimaX relies on the DARX
replication framework [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] which uses these both techniques, in an adaptive manner, depending
on the evolution of the MAS context. The designer can dynamically change replication strategies
during the MAS execution.
          </p>
          <p>1The criticality of an agent, regarding an organization of agents it belongs to, is the measure of the potential
impact of the failure of that individual agent on the failure of the whole organization
2.3</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Control of Replication in DimaX</title>
        <p>Replication has been successfully applied to several distributed applications. These distributed
applications are characterized by a small number of components and the criticality of these
components is often static. So, the number of replicas and the replication strategy are explicitily
and statically defined by the designer before runtime. However, multi-agent applications are more
complex than traditional distributed ones. They have dynamic organizational structures, adaptive
behaviors of agents and a large number of agents. So, the criticality of agents may evolve
dynamically during the course of computation. Our solution is a control mechanism of replication which
decides, dynamically, which agent should be replicated and with what strategy (how many
replicas and where to create the replicas). This control mechanism dynamically estimates the agents’
criticality. We have experimented two strategies based on organizational concepts to estimate the
criticality of an agent.</p>
        <p>
          The first strategy we studied is based on the concept of role. A role, within an organization,
represents a pattern of services, activities and relations. As such, it captures some information about
the relative importance of roles and their interdependencies. A role analysis thus represents the
set of interaction events resulting from the domain agent interactions (sent and received messages).
These events are then used to determine the roles of the agent. This strategy is described in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
A second alternative strategy that we studied is based on the concept of dependency. Intuitively,
the more an agent has other agents depending on it, the more it is critical in the organization.
The dependencies are inferred through the analysis of communication between agents. That second
strategy is described in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
DimaX offers several libraries and mechanisms to facilitate the design and implementation of
faulttolerant multi-agent systems. These libraries and mechanisms are provided by DIMA and DARX.
        </p>
        <p>The class AgentBehavior and its subclasses represent the internal activity of the agent. The
instance method proactivityLoop() (see Table 2.4.1), used by startup, defines the basic loop of
2.4.1</p>
        <sec id="sec-1-3-1">
          <title>DIMA Agent Behaviors</title>
          <p>DIMA is a Java multi-agent platform. Its kernel is a framework of proactive components which
represent autonomous and proactive entities. A simple DIMA agent architecture consists of: a
proactive component, an agent engine, and a communication component (see Figure 4).</p>
          <p>A proactive component (the AgentBehavior class) represents an autonomous and proactive
entity. It provides the basic structure to represent behaviors. The main functionalities of a proactive
component may be extended in the subclasses. An instance of AgentBehavior describes:
• The goal of the proactive component, it is implicilty or explicitly described by the method
isAlive().
• The basic behaviors of the proactive component. A behavior is a sequence of actions that
allow to change the internal state or to send a message to other components.
the agents. An Agent Engine is provided to launch and support the agent activity. AgentEgine
implements Runnable. In the latter, the method run has been redefined:
public class AgentEngine extends
ProactiveComponentEngine implements Runnable {
protected ProactiveComponent proactivity;
public Thread thread; }
public void run(){
proactivity.startUp(); }</p>
          <p>Methods
public abstract boolean isAlive()
public abstract void step()
void proactivityLoop()</p>
          <p>Description
Tests if the agent has not reached its goal.</p>
          <p>Represents an execution cycle of the agent.</p>
          <p>Represents the control of agent behavior.
public void startUp()
public void proactivityLoop()
{ while (this.isAlive()) {
this.preActivity();
this.step();
this.postActivity();} }
Initializes and activates the control of agent
behavior.
public void startUp() {
this.proactivityInitialize();
this.proactivityLoop();
this.proactivityTerminate();}</p>
          <p>DIMA also provides several services like the directory facilitator service. DIMA can be used
easily to build MASs. To make them reliable, we realized an integration of DIMA agents and DarX
tasks. Before desccribing the result of this integration, we define the DarX tasks.
2.4.2</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>DarX Tasks</title>
          <p>
            DARX [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] is a framework to design reliable distributed applications which include a set of
distributed communicating entities (named DarX tasks). It includes transparent replication
management. DARX handles replication groups. Each of these groups consists in software entities (the
replicas) which are the representation of the same DarX task (see Figure 5). A DarX task can be
replicated several times and with different replication strategies. It is wrapped into a TaskShell
which is responsible for replication group management. To maintain coherence between the
different replicas, the TaskShell delivers received messages to all active replicas. Also, it periodically
updates the state of the passive replicas; this requires to suspend the DarX task then to resume
it. When it receives several identical replies from different replicas of the same task, it uses a filter
mechanism to forward the first reply and discard the other redondant ones.
          </p>
          <p>The TasKShell sends outgoing messages through its encapsulated DarXCommInterface. The
communication between distinct TaskShells is performed via a proxy: the RemoteTask (see Figure
7).</p>
          <p>Thus, the sender of a message does not need to know the replicas number of the receiver; the
RemoteTask of the receiver delegates the messages to the corresponding TaskShell which transmits
them to all replicas of the same agent. The replication has a cost in communication but DARX
optimizes it by piggybacking application-level messages on the I am alive messages (see Section
2.2.2).</p>
          <p>DarX task</p>
          <p>TaskShell
+
DarXTaskEngine</p>
          <p>run()
+startTask()
+terminateTask()
C
C
+
+
DarXomomponent
sendMessage()
receiveMessage()
Figure 6 gives the main classes to model fault-tolerant agents. As the DarXTask is an active entity
and each fault-tolerant agent needs to have the structure of a DarXTask, the DarXTask needs to be
autonomous and proactive. To make the DarXTask autonomous, we encapsulate the DIMA agent
behavior into the DarXTask (see Figure 7). This agent architecture enables to replicate the agent
several times. As the DARX middleware and the DIMA platform both provide mechanisms for
execution control, communication and naming but at different levels, their integration requires a set
of some additional components; This set calls, transparently, for DARX services (e.g., replication,
naming) when executing multi-agent applications developed with DIMA; at the application level,
any code modification is required. It controls the execution of agents built under DimaX and offers
a communication interface between remote agents, through DimaX servers.</p>
          <p>Duplicated
Messages</p>
          <p>TaskShell
RemoteTask
(group proxy)
request buffer</p>
          <p>reply
DarXCommInterface</p>
          <p>reply buffer</p>
          <p>DarXMessage:
− sender: agentID
− content: DIMA message
ssseaeg DarXTaskEngine
AM (A Specific DarX Task)
M
I
D</p>
          <p>DIMA Agent</p>
          <p>Behavior
DarXTaskExecutor
t
n
e
n
o
p
m
o
C
n
o
iti
a
c
n
u
m
m
o
C
X
ra</p>
          <p>D</p>
          <p>A DimaX agent is a DIMA agent encapsulated in a particular entity, the DarXTaskEngine.
+
C
ommunicatingAgentBehavior
activate()
++activateWithDarX()
step()
readAllMessage()
++sendMessage()
+</p>
          <p>AgentFact
-csotueppl(e):Vector
-+isresliult()</p>
          <p>Ave():bool
+proactivityInitialize()
+</p>
          <p>AgentMult
-csotueppl(e):Vector
+multiply()
The DarXTaskEngine is a DarXTask, with autonomous behaviors of the original DIMA agent. It
includes the agent engine (called the DarXTaskExecutor) which executes the lifecycle of the agent
(see below the proactivityLoop method). For coherence reasons, the execution of the agent lifecycle
may be suspended during the creation and/or updates of the replicas. When a DimaX agent sends
messages to other agents, DimaX provides communication mechanisms to localize agents and deliver
them messages. This delivery is realized through the communication component of DimaX agents
(DarXComComponent) which delegates the DIMA message transmissions to the associated
DarXCommInterface. This communication interface enables DARX entities to communicate between
them. So, at the application level, the agents communicate DIMA messages which are transmitted
via the DARX middleware.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Example</title>
      <p>The aim of DimaX is to augment an already built MAS with fault-tolerance capabilities. So, this
section presents a MAS which has been developed by DIMA and shows how to make it fault tolerant.</p>
      <p>To exemplify DimaX, we propose the Factorial toy problem (n!). This toy problem gives some
insights to distributed problem solving. We consider two kinds of agents:
1. AgentFact: these agents have the needed behavior to compute a factorial but they do not
have the behavior to multiply numbers.
2. AgentMult: these agents have a behavior to compute a multiplication.</p>
      <p>These agents are implemented as subclasses of CommunicatingAgentBehavior class (see Figure
8). To compute n!, AgentFact creates a list (named couple) with the numbers from 1 to n:
pubilc void proactivityInitialize(){
for (int i=1, i&lt;=n, i++){
couple.addElement(i); }
}
pubilc void result(int i){
couple.addElement(i);
// nbRequests is the number of resquests
nbRequests --;
}</p>
      <p>Then, it sends requests to AgentMult with all possible couples of numbers. When it receives a
result, it puts it into the list:</p>
      <p>If the list has more than one element, new requests are then sent to AgentMult. It repeats this
action while the list contains more than one number or AgentFact has not the responses to all the
sent requests. This test is performed by its isAlive() method as follows:
pubilc boolean isAlive(){
return ((couple.size() &gt; 1) or not(hasAllResponses()));
}</p>
      <p>The AgentFact behavior is defined by:
pubilc void step(){</p>
      <p>readAllMessages();
while (couple.size()&gt;1) {
sendMessage(‘‘multiply’’,couple.elementAt(0), couple.elementAt(1),new AgentName(‘‘multiplier’’));
couple.remove(0);
couple.remove(1);
nbRequests++;}
}</p>
      <p>The AgentMult behavior is as follows:
public void step(){
readAllMessages();
}</p>
      <p>The multiply action of MultAgent is:
public void multiply(int a, int b){
int c=a*b;
sendMessage(‘‘result’’,c, new AgentName(‘‘factorial’’));}
}</p>
      <p>The initialization of the MAS is performed as:
public void main (String [] args) {
// agents behavior initialization
AgentFact a= new AgentFact(‘‘factorial’’);
AgentMult b=new AgentMult(‘‘multiplier’’);</p>
      <p>// agents activation on the same machine
a.activate();
b.activate();
}</p>
      <p>After the designer builds the agents behavior by using the DIMA multi-agent platform, he/she
uses the activateWithDarX method to deploy his/her MAS and to endow it with fault-tolerance
capabilities (i.e., replication). This activation method enables to encapsulate the agent behavior in
a DarXTask (see Section 2.4) and register the agent in the system (i.e., the naming service). Its
parameters are the url and port of the host where the agent will be replicated. The deployment
can be performed as follows:
public void main (String [] args) {
AgentFact a= new AgentFact(‘‘factorial’’);
AgentMult b=new AgentMult(‘‘multiplier’’);
// agents activation on two different machines
a.activateWithDarX(url1, port1);
b.activateWithDarX(url2, port2);
}</p>
      <p>As we can see, the distribution and replication have not required any code modification. The
distribution cost is therefore minimal. Thus, DimaX facilitates the development of fault-tolerant
MAS for developers not trained in fault-tolerance techniques. They need only to focus on problem
solving issues like the agents behavior and their interactions. The factorial example is very simple.
However, the solution is similar even if the application is more complex.</p>
    </sec>
    <sec id="sec-3">
      <title>DimaX Features</title>
      <p>
        This section presents DimaX main features which are provided to the development and deployment
of fault-tolerant large-scale MAS: scalability, reusability, robustness, and adaptability.
1. Scalability. A platform is said to be scalable if it can handle the increasing of the problem
size (number of agents) and complexity without suffering a noticeable loss of performance.
In DimaX, the proposed solution is to organize hierarchically the components of the
different services in order to minimize the communication overload caused by them. DimaX also
provides global state of MAS (e.g., the average number of exchanged messages), in a
distributed manner. Indeed, this reduces remote access and avoids bottleneck, contrary to the
case of a central component. Moreover, the messages used by the failure detection service are
piggybacked by the other services messages and those of the application.
2. Reusability. To faciliate the design and implementation of fault-tolerant large-scale MAS,
for developers not trained in fault-tolerance techniques, DimaX provides several component
libraries to build multi-agent systems: decision components, communication components,
interaction protocols. For example, the library of interaction protocols provides a generic
implementation of interaction protocols. Interaction protocols are resuable components.
3. Robustness. The robustness of MAS is almost always a major concern when they are applied
to critical domains like spacecraft, or medicine. It is important that this kind of application
runs without interruption, in spite of failures, like crashes. DimaX achieves robustness of
MASs by using adaptive replication mechanisms. To evaluate the reliability of the platform,
we have run the robustness test based on fault injection techniques. The results show that
our platform achieves a robustness degree interesting of the application. Also, the platform
must continue to deliver its services in despite of one of its services components failure. The
failure of a machine or a connection often involves the failure of the associated DimaX server.
However, in our solution, the fault tolerance protocols are agent-dependent and not-place
dependent, i.e., the mechanisms built for providing the continuity of the computation are
integrated in the replication groups, and not in the server.
4. Adaptability. To deal with limited resource problem for replicating agents, a good replication
mechanism should adapt the replication strategy to the evolution of the environment. Thus,
we have introduced, in DimaX, a multiagent monitoring architecture to control replication.
This architecure implements our adaptation mechanisms to define the agent criticality. These
mechanisms rely on organizational concepts like role and interdependence graph [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Moreover,
due to the heterogeneous resource problem (i.e., different and dynamic characteristics of the
hosts), DimaX uses an adaptive approach to resource management for determining the number
of replicas and their placement.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Related</title>
    </sec>
    <sec id="sec-5">
      <title>Work</title>
      <p>
        In the multi-agent literature [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], we can find a large number of multi-agent platforms but only few
ones offer fault-tolerance mechanisms. Several corrective solutions to fault tolerance problem have
been proposed. The diagnostic approaches ([
        <xref ref-type="bibr" rid="ref10 ref11 ref8">8, 10, 11</xref>
        ]) are examples of such solutions. For instance,
Kaminka et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] propose a monitoring approach in order to detect, to diagnose and recover faults.
They use models of relations between mental states of agents. They adopt a procedural
planrecognition based approach to identify inconsistencies. However, the adaptation is only structural,
the relation models may change but the contents of plans are static. Their main hypothesis is that
any failure comes from incompletness of beliefs. The diagnostic approaches are attractive ones.
However, they are complex; they need a deep knowledge about the behavior of the system. It is not
always possible to have a precise description of the whole multi-agent system. Exception handling
approaches are also other examples of corrective solutions. Contrary to diagnostic approches where
fault recovery is performed, these approaches focus on error recovery ([
        <xref ref-type="bibr" rid="ref13 ref19">13, 19</xref>
        ]). For instance,
Souchon et al.[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] propose an exception handling system (named SAGE), designed for MASs, that
addresses some exception handling problems (e.g., the exception propagation) related to MAS issues
such as preservation of the agent paradigm features and concurrency. To summarize, the corrective
approaches are not suitable for critical applications where the diagnosis, or exception propagation,
and correction must be done in real-time.
      </p>
      <p>
        Kumar et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] advocate fault-tolerance approach by using broker teams. A broker accepts
requests, locates capable agents, routing requests and responses, etc. They use multiple brokers
which form a team with appropriate commitments. The team members should recover from broker
failures insofar they have team and/or individual commitments like to connect to a registered agent
which gets disconnected. In other words, this brokering knowledge is shared among the members.
This work presents some interesting results, but stays at the theoretical stage. Moreover, they
don’t address scalability and reusability issues.
      </p>
      <p>
        Cougaar [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a Java-based architecture for the construction of large-scale distributed
agentbased applications. An agent is a set of problem solving behaviors interacting via blackboards. If
an agent is unable to contact a member of its community it could send a health alert message to
a health monitor. This agent is responsible for the recovery of agents. For instance, the recovery
of a domain agent consists either to retrieve an appropriate community state needed to pursue the
problem solving or to re-join its community which has began a new problem solving stage. However,
the approach lacks adaptability; no guarantee is given that the MAS will correctly pursue its goals,
in spite of agent failures. Failures could cause interblocked situations; the progress of the problem
solving depends on each other.
      </p>
      <p>
        The FATMAS methodology [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] provides mainly four models used to design and implement
the target system and a fault-tolerance technique where only a certain number of agents will be
replicated. Here, an agent is critical as it performs at least one task that cannot be performed by
any other agent in the system. If the agent is non-critical, then it is not replicated and its tasks are
replicated in other agents. If it is a critical agent, then it must be replicated. FATMAS proposes
guidelines for the analysis and the design of fault-tolerant MAS. Moreover, it provides agent and
task replication. This enables to reduce the replication cost. However, the approach addresses to
closed MAS; the agent criticality is defined at design time. The replication is static.
      </p>
      <p>
        A. Fedoruk and R. Deters [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] propose to use proxies to make transparent the use of agent
replication, i.e. enabling the replicas of an agent to act as a same entity regarding the other agents.
The proxy manages the state of the replicas. All the external and internal communications of the
group are redirected to the proxy. However this increases the workload of the proxy, which is a
quasi central entity. To make it reliable, they propose to build a hierarchy of proxies for each group
of replicas. This approach lacks reusability; in particular concerning the replication control.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we presented a new fault-tolerant multiagent platform named DimaX. The design
and the implementation of fault tolerant large scale multiagent systems require to deal with
problems related to distribution and fault-tolerance. For that, DimaX provides several services namely
naming service for agent localization, fault detection service for reconginizing faulty agents,
observation service for collecting relevant information, and replication service for supporting replication
techniques. Thanks to these services and their implementation, DimaX has interesting features like
scalability, reusability, robustness, and adaptability for fault-tolerant MAS development. Thus, we
achieve robustness by using replication techniques. Contrary to other approaches (i.e., diagnosis),
replication enables us to run the critical multiagent applications without interruption. Moreover,
our control of replication enables to change dynamically replication strategies, for better adapting
to the evolution of the MAS context.</p>
      <p>To generalize our approach, the futur work will propose a design methodology for fault-tolerant
large-scale MAS. The principles developed in our approach to the failure problem in MAS will be
the basis of the methodology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bertier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Marin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Sens</surname>
          </string-name>
          .
          <article-title>Performance analysis of a hierarchical failure detector</article-title>
          .
          <source>In Proceedings of the International Conference on Dependable Systems and Networks (DSN'2003)</source>
          , pages
          <fpage>635</fpage>
          -
          <lpage>644</lpage>
          , San Francisco, USA,
          <year>June 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fedoruk</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Deters</surname>
          </string-name>
          .
          <article-title>Improving fault-tolerance in mas with dynamic proxy replicate groups</article-title>
          .
          <source>In IAT</source>
          , pages
          <fpage>364</fpage>
          -
          <lpage>370</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] FIPA Foundation for Intelligent Physical Agents</article-title>
          .
          <article-title>Fipa acl message structure specification</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerraoui</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Schiper</surname>
          </string-name>
          .
          <article-title>Software-based replication for fault-tolerance</article-title>
          .
          <source>IEEE Computer</source>
          ,
          <volume>30</volume>
          (
          <issue>3</issue>
          ):
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guessoum</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.P.</given-names>
            <surname>Briot</surname>
          </string-name>
          .
          <article-title>From active object to autonomous agents</article-title>
          .
          <source>IEEE Concurrency</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>68</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guessoum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.P.</given-names>
            <surname>Briot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Marin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hamel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Sens</surname>
          </string-name>
          .
          <article-title>Dynamic and adaptive replication for large-scale reliable multi-agent systems</article-title>
          .
          <source>In Proc. Second Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS '03)</source>
          ,
          <source>LNCS 2603</source>
          , pages
          <fpage>182</fpage>
          -
          <lpage>198</lpage>
          , Oregon, USA, May
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guessoum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Faci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J-P.</given-names>
            <surname>Briot</surname>
          </string-name>
          .
          <article-title>Adaptive replication of large scale mass: Towards a fault-tolerant multiagent platform</article-title>
          . In Springer Verlag,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hagg</surname>
          </string-name>
          .
          <article-title>A sentinel approach to fault handling in multi-agent systems</article-title>
          . volume
          <volume>1286</volume>
          <source>of LNCS</source>
          , pages
          <fpage>190</fpage>
          -
          <lpage>195</lpage>
          . Springer-Verlag,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Helsinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thome</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Wright</surname>
          </string-name>
          .
          <article-title>Cougaar: a scalable, distributed multi-agent architecture</article-title>
          .
          <source>In SMC (2)</source>
          , pages
          <fpage>1910</fpage>
          -
          <lpage>1917</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Horling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Benyo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Lesser</surname>
          </string-name>
          .
          <article-title>Using self-diagnosis to adapt organizational structures</article-title>
          .
          <source>In Proc.In 5th International Conference on Autonomous Agents</source>
          , pages
          <fpage>529</fpage>
          -
          <lpage>536</lpage>
          , Montreal, Canada,
          <year>June 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.A.</given-names>
            <surname>Kaminka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.V.</given-names>
            <surname>Pynadah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tambe</surname>
          </string-name>
          .
          <article-title>Monitoring teams by overhearing: A multiagent plan-recognition approach</article-title>
          .
          <source>Journal of Intelligence Artificial Research</source>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>83</fpage>
          -
          <lpage>135</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.I.</given-names>
            <surname>Kistijantoro</surname>
          </string-name>
          , G. Morgan,
          <string-name>
            <given-names>S.K.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.C.</given-names>
            <surname>Little</surname>
          </string-name>
          .
          <article-title>Component replication in distributed systems: a case study using enterprise</article-title>
          .
          <source>In 22nd International Symposium on Reliable Distributed Systems (SRDS'03)</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>99</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rodriguez-Aguilar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Dellarocas</surname>
          </string-name>
          .
          <article-title>Using domain-independent exception handling services to enable robust open multi-agent systems: the case of agent death</article-title>
          .
          <source>Journal of Autonomous Agents and Multi-Agent Systems</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          -2):
          <fpage>179</fpage>
          -
          <lpage>189</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.R.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <article-title>Towards a fault-tolerant multiagent system architecture</article-title>
          .
          <source>In Proc.of 4th International Conference on Autonomous Agents</source>
          , pages
          <fpage>459</fpage>
          -
          <lpage>466</lpage>
          , New York, USA,
          <year>June 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Marin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.P.</given-names>
            <surname>Briot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guessoum</surname>
          </string-name>
          .
          <article-title>Towards adaptive fault-tolerance for distributed multi-agents systems</article-title>
          .
          <source>In Proc. Fourth European Research Seminar on Advances in Distributed Systems (ERSADS'01)</source>
          , pages
          <fpage>195</fpage>
          -
          <lpage>201</lpage>
          , Bertinoro, Italy, May
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mellouli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Moulin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.W.</given-names>
            <surname>Mineau</surname>
          </string-name>
          .
          <article-title>Towards a modelling methodology for fault-tolerant multi-agent systems</article-title>
          .
          <source>In Informatica Journal 28</source>
          , pages
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Powell</surname>
          </string-name>
          .
          <article-title>Delta-4: A generic architecture for dependable distributed computing</article-title>
          . In Springer Verlag,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>P-M. Ricordel</surname>
            and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Demazeau</surname>
          </string-name>
          .
          <article-title>From analysis to deployment: A multi-agent platform survey</article-title>
          . volume
          <volume>1972</volume>
          <source>of LNAI</source>
          , pages
          <fpage>93</fpage>
          -
          <lpage>106</lpage>
          . Springer-Verlag,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Souchon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Urtado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vauttier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Dony</surname>
          </string-name>
          .
          <article-title>A proposition of exception handling in multiagent systems</article-title>
          .
          <source>In Proc. Second Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS '03)</source>
          , LNCS 2603, page 8,
          <string-name>
            <surname>Oregon</surname>
          </string-name>
          , USA, May
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>R. van Renesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Birman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Maffeis</surname>
          </string-name>
          .
          <article-title>A flexible group communication system</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>39</volume>
          (
          <issue>4</issue>
          ):
          <fpage>76</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>