<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">DimaX: A Fault-Tolerant Multi-Agent Platform</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Nora</forename><surname>Faci</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">CReSTIC-MODECO Team</orgName>
								<orgName type="institution">University Reims Champagne-Ardenne</orgName>
								<address>
									<addrLine>Rue des Crayeres BP 1035</addrLine>
									<postCode>51687</postCode>
									<settlement>Reims</settlement>
									<country key="FR">FRANCE</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zahia</forename><surname>Guessoum</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Olivier</forename><surname>Marin</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">LIP6-OASIS and SRC Teams</orgName>
								<orgName type="institution">University Pierre and Marie Curie</orgName>
								<address>
									<addrLine>8 Rue du Capitaine Scott</addrLine>
									<postCode>75015</postCode>
									<settlement>Paris</settlement>
									<country key="FR">FRANCE</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">DimaX: A Fault-Tolerant Multi-Agent Platform</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D8783622389390DA3F2B43EF2E957787</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T14:54+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Fault tolerance is an important property of large-scale multi-agent systems as the failure rate grows with both the number of the hosts and deployed agents, and the duration of computation. Several approaches have been introduced to deal with some aspects of the faulttolerance problem. However, most existing solutions are ad hoc. Thus, no existing multi-agent architecture or platform provides a fault-tolerance service that can be reused to facilitate the design and implementation of reliable multi-agent systems. So, we have developed a faulttolerant multi-agent platform (named DimaX) which deals with fail-stop failures like bugs and/or break down machines. It brings fault-tolerance for multi-agent applications by using replication techniques. It is based on a replication framework (named DARX).</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Fault tolerance is a relevant problem in multi-agent systems (MAS). Nowadays, MASs are naturally employed to build distributed applications. In particular, we are interested in large-scale MASs which are physically distributed and characterized by a dynamic environment with limited resources. As the failure rate grows with both the number of hosts and deployed agents, and the duration of computation, these applications are subject to more failures.</p><p>To deal with some aspects of the fault-tolerance problem, several approaches <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b12">13]</ref> were introduced. For detecting and recovering faults in MAS, Hagg <ref type="bibr" target="#b7">[8]</ref> introduced the sentinel concept. In his project, agents interact for achieving functionalities. The designer associates a sentinel to each functionality. These sentinels observe the different agents and detect functionality deviations in order to diagnose faults and to repair them. Kumar et al. <ref type="bibr" target="#b13">[14]</ref> proposed a brokers team to recover faults unregarding the fault reasons. A broker offers several services like searching appropriate agents for a given task. As a task can be performed by several agents, an agent failure remains transparent as long as there are safe agents. These approaches provide interesting solutions. However, they are ad hoc and are suitable for small-scale multi-agent applications; they could not be reused to build other multi-agent applications. For instance, the Hagg sentinels are specific to each MAS like Kumar brokers that use domain knowledge for delivering their services. Thus, no existing multi-agent architecture or platform provides a fault-tolerance service that can be reused to facilitate the design and implementation of reliable multi-agent systems.</p><p>The aim of this paper is to present a fault-tolerant multi-agent platform (named DimaX). The design of fault-tolerant MAS requires to deal with problems related to distribution and fault tolerance. DimaX offers several services like naming, fault detection and recovery. To make MAS reliable, DimaX uses replication techniques. Moreover, DimaX provides developers with libraries of reusable components for building MAS.</p><p>The remainder of this paper is organized as follows. Section 2 presents our DimaX platform. Section 3 shows how DimaX can be used through a toy problem. Section 4 gives the DimaX features provided for fault-tolerant MAS development. Section 5 discusses the related work. Finally, Section 6 summarizes our approach and ongoing work.</p><p>The present section aims at defining the type of failures DimaX deals with. Then, it presents the DimaX services for developing fault-tolerant MAS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Fault Model</head><p>The most generally accepted failure classification can be found in <ref type="bibr" target="#b16">[17]</ref>:</p><p>1. A crash failure means a component stops producing output; it is the simplest failure to contend with.</p><p>2. An omission failure is a transient crash failure: the faulty component will eventually resume its output production.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>A timing failure occurs when output is produced outside its specified time frame.</p><p>4. An arbitrary (or byzantine) failure equates to the production of arbitrary output values at arbitrary times.</p><p>Given this classification, two types of failure models are usually considered in distributed environments <ref type="bibr" target="#b16">[17]</ref>:</p><p>• fail-silent, where the considered system allows only crash failures, and</p><p>• fail-uncontrolled, where any type of failure may occur.</p><p>In this work we focus on the fail-silent model. An agent failure is defined as its abnormal termination due to failure in an underlying resource. This could be either a bug in the underlying operating system, or a local host crash or a network disconnection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">DimaX Services</head><p>DimaX is the result of an integration of a multi-agent platform (named DIMA <ref type="bibr" target="#b4">[5]</ref>) and a fault tolerance framework (named DARX <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b0">1]</ref>). Figure <ref type="figure" target="#fig_0">1</ref> gives an overview of DimaX and its main components and services. DimaX is founded on three levels: system (i.e., DARX middleware), application (i.e., agents) and control. At the application level, DIMA provides a set of libraries to build multi-agent applications. Moreover, DARX provides the mechanisms necessary for distributing, observing and replicating agents as services. These mechanisms operate at the middleware level. Thus, a DimaX server offers the following services: naming, fault detection, observation and replication. At the control level, DimaX provides a control mechanism of replication which is automatically performed with cooperation of the observation service <ref type="bibr" target="#b6">[7]</ref>. This mechanism decides which agent to replicate and where to replicate it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Naming Service</head><p>One of the problem related to multi-agent systems distribution is the agent localization at the time of message sending. A naming server maintains the list (i.e., white pages) of all the agents within its administration domain. When an agent is created, it is registered at both the DimaX server and the naming server. To send messages to another, an agent needs to know the application-level identifier of the receiver. However, the transmission of these messages through DimaX servers, requires some knowledge about the physical localization (i.e., the IP address and a port number). The local DimaX server requests this information from the naming server and locally stored it in a cache. So, the cache contains the list of agents which have been contacted. This avoids that a DimaX server repeats several times the same search. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Fault Detection Service</head><p>Failure detection is an essential aspect of any fault-tolerant system; indeed it is necessary to recognize a faulty agent. DARX fault detection service is based on the heartbeat technique; a process sends an I am alive message to other processes for informing that it is safe (see Figure <ref type="figure" target="#fig_1">2</ref>). This technique has two parameters:</p><p>• the heartbeat period: the time between two emissions of the I am alive message,</p><p>• the timeout delay: the time between the last reception of an I am alive message from p and the time where q suspects p, until an I am alive message from p is received.</p><p>The detection results may be incorrect; q detects that p is crashed while p is actually safe but its transmissions are delayed for some reason (e.g., communication load). To overcome this problem, one solution is to estimate the arrival date of the following I am alive message, with a dynamic margin. These values are functions of the quality of service of the network and the application <ref type="bibr" target="#b0">[1]</ref>. When a server detects a failure of another DimaX server, its naming module removes all the replicated agents the faulty server hosted from the list and replaces these agents by their replicas located on other hosts. The replacement is initiated by the failure notification. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.3">Observation Service</head><p>The functionalities of the observation service are fundamental for controlling replication. An observation module collects data at two levels:</p><p>• system level: data about the execution environment of the MAS like CPU time and mean time between failures,</p><p>• application level: information about its dynamic characteristics like the interaction events among agents (e.g., the sent and received messages).</p><p>The observation service relies on an organization of reactive agents (named host-and agentmonitors) (see Figure <ref type="figure" target="#fig_2">3</ref>). An agent-monitor is associated to each agent of the application (named domain agents) and a host monitor is associated to each host. These monitoring agents (agentmonitors and host-monitors) are hierarchically organized. Each agent-monitor communicates only with one host-monitor. Host-monitors exchange their local information to build global information (global number of messages, global exchanged quantity of information, . . . ).</p><p>After each interval of time ∆t, the host-monitor sends the collected events and data to the corresponding agent-monitors. When the criticality<ref type="foot" target="#foot_0">1</ref> of the domain agent is significantly modified, the agent-monitor notifies its host-monitor. The latter informs the other host-monitors to update global information. In turn, agent-monitors are informed by their host-monitor when global information changes significantly (see <ref type="bibr" target="#b6">[7]</ref> for more details). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Observation Level Agent Level</head><p>Host_Monitor i Host_Monitor j </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.4">Replication Service</head><p>Replication is an effective way to achieve fault tolerance in distributed systems. It has proved its efficiency <ref type="bibr" target="#b11">[12]</ref>. We propose therefore to use replication mechanisms to avoid failures of multi-agent systems. Replication enables to run multi-agent systems without interruption, in spite of failures.</p><p>A replicated agent (see Section 2.4) is an entity that possesses two or more copies of its behavior (or replicas) on different hosts. There are two main types of replication protocols:</p><p>• active replication, in which all replicas process concurrently all input messages, and</p><p>• passive replication, in which only one of the replicas processes all input messages and periodically transmits its current state to the other replicas in order to maintain consistency.</p><p>Active replication strategies provide fast recovery but lead to a high overhead. If the degree of replication is n, the n replicas are activated simultaneously. Passive replication minimizes processor utilization by activating redundant replicas only in case of failures. That is: if the active replica is found to be faulty, a new replica is elected among the set of passive ones and the execution is restarted from the last saved state. This technique requires less CPU resources than the active one but it needs a checkpoint management which remains expensive in processing time and space.</p><p>Many toolkits (e.g., see <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b19">20]</ref>) use only one of these techniques. So, they may suffer from the disadvantages of the used technique. Contrary to these approaches, DimaX relies on the DARX replication framework <ref type="bibr" target="#b14">[15]</ref> which uses these both techniques, in an adaptive manner, depending on the evolution of the MAS context. The designer can dynamically change replication strategies during the MAS execution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Control of Replication in DimaX</head><p>Replication has been successfully applied to several distributed applications. These distributed applications are characterized by a small number of components and the criticality of these components is often static. So, the number of replicas and the replication strategy are explicitily and statically defined by the designer before runtime. However, multi-agent applications are more complex than traditional distributed ones. They have dynamic organizational structures, adaptive behaviors of agents and a large number of agents. So, the criticality of agents may evolve dynamically during the course of computation. Our solution is a control mechanism of replication which decides, dynamically, which agent should be replicated and with what strategy (how many replicas and where to create the replicas). This control mechanism dynamically estimates the agents' criticality. We have experimented two strategies based on organizational concepts to estimate the criticality of an agent.</p><p>The first strategy we studied is based on the concept of role. A role, within an organization, represents a pattern of services, activities and relations. As such, it captures some information about the relative importance of roles and their interdependencies. A role analysis thus represents the set of interaction events resulting from the domain agent interactions (sent and received messages). These events are then used to determine the roles of the agent. This strategy is described in <ref type="bibr" target="#b5">[6]</ref>.</p><p>A second alternative strategy that we studied is based on the concept of dependency. Intuitively, the more an agent has other agents depending on it, the more it is critical in the organization. The dependencies are inferred through the analysis of communication between agents. That second strategy is described in <ref type="bibr" target="#b6">[7]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">DimaX Agents</head><p>DimaX offers several libraries and mechanisms to facilitate the design and implementation of faulttolerant multi-agent systems. These libraries and mechanisms are provided by DIMA and DARX.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.1">DIMA Agent Behaviors</head><p>DIMA is a Java multi-agent platform. Its kernel is a framework of proactive components which represent autonomous and proactive entities. A simple DIMA agent architecture consists of: a proactive component, an agent engine, and a communication component (see Figure <ref type="figure" target="#fig_3">4</ref>).</p><p>A proactive component (the AgentBehavior class) represents an autonomous and proactive entity. It provides the basic structure to represent behaviors. The main functionalities of a proactive component may be extended in the subclasses. An instance of AgentBehavior describes:</p><p>• The goal of the proactive component, it is implicilty or explicitly described by the method isAlive().</p><p>• The basic behaviors of the proactive component. A behavior is a sequence of actions that allow to change the internal state or to send a message to other components.</p><p>A The class AgentBehavior and its subclasses represent the internal activity of the agent. The instance method proactivityLoop() (see Table <ref type="table">2</ref>.4.1), used by startup, defines the basic loop of the agents. An Agent Engine is provided to launch and support the agent activity. AgentEgine implements Runnable. In the latter, the method run has been redefined: DIMA also provides several services like the directory facilitator service. DIMA can be used easily to build MASs. To make them reliable, we realized an integration of DIMA agents and DarX tasks. Before desccribing the result of this integration, we define the DarX tasks.</p><formula xml:id="formula_0">public</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.2">DarX Tasks</head><p>DARX <ref type="bibr" target="#b14">[15]</ref> is a framework to design reliable distributed applications which include a set of distributed communicating entities (named DarX tasks). It includes transparent replication management. DARX handles replication groups. Each of these groups consists in software entities (the replicas) which are the representation of the same DarX task (see Figure <ref type="figure" target="#fig_4">5</ref>). A DarX task can be replicated several times and with different replication strategies. It is wrapped into a TaskShell which is responsible for replication group management. To maintain coherence between the different replicas, the TaskShell delivers received messages to all active replicas. Also, it periodically updates the state of the passive replicas; this requires to suspend the DarX task then to resume it. When it receives several identical replies from different replicas of the same task, it uses a filter mechanism to forward the first reply and discard the other redondant ones.</p><p>The TasKShell sends outgoing messages through its encapsulated DarXCommInterface. The communication between distinct TaskShells is performed via a proxy: the RemoteTask (see Figure <ref type="figure" target="#fig_6">7</ref>).</p><p>Thus, the sender of a message does not need to know the replicas number of the receiver; the RemoteTask of the receiver delegates the messages to the corresponding TaskShell which transmits them to all replicas of the same agent. The replication has a cost in communication but DARX optimizes it by piggybacking application-level messages on the I am alive messages (see Section 2.2.2).  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.3">Fault-Tolerant Agents</head><p>Figure <ref type="figure" target="#fig_5">6</ref> gives the main classes to model fault-tolerant agents. As the DarXTask is an active entity and each fault-tolerant agent needs to have the structure of a DarXTask, the DarXTask needs to be autonomous and proactive. To make the DarXTask autonomous, we encapsulate the DIMA agent behavior into the DarXTask (see Figure <ref type="figure" target="#fig_6">7</ref>). This agent architecture enables to replicate the agent several times. As the DARX middleware and the DIMA platform both provide mechanisms for execution control, communication and naming but at different levels, their integration requires a set of some additional components; This set calls, transparently, for DARX services (e.g., replication, naming) when executing multi-agent applications developed with DIMA; at the application level, any code modification is required. It controls the execution of agents built under DimaX and offers a communication interface between remote agents, through DimaX servers. C o m m u n i c a t i n g A g e n t B e h a v i o r + a c t i v a t e ( ) + a c t i v a t e W i t h D a r X ( ) + s t e p ( ) + r e a d A l l M e s s a g e ( ) + s e n d M e s s a g e ( ) A g e n t F a c t E c o u p l e : V e c t o r + s t e p ( ) E i s A l i v e ( ) : b o o l + r e s u l t ( ) + p r o a c t i v i t y I n i t i a l i z e ( ) A g e n t M u l t E c o u p l e : V e c t o r + s t e p ( ) + m u l t i p l y ( ) The DarXTaskEngine is a DarXTask, with autonomous behaviors of the original DIMA agent. It includes the agent engine (called the DarXTaskExecutor) which executes the lifecycle of the agent (see below the proactivityLoop method). For coherence reasons, the execution of the agent lifecycle may be suspended during the creation and/or updates of the replicas. When a DimaX agent sends messages to other agents, DimaX provides communication mechanisms to localize agents and deliver them messages. This delivery is realized through the communication component of DimaX agents (DarXComComponent) which delegates the DIMA message transmissions to the associated DarX-CommInterface. This communication interface enables DARX entities to communicate between them. So, at the application level, the agents communicate DIMA messages which are transmitted via the DARX middleware.</p><p>This section presents DimaX main features which are provided to the development and deployment of fault-tolerant large-scale MAS: scalability, reusability, robustness, and adaptability.</p><p>1. Scalability. A platform is said to be scalable if it can handle the increasing of the problem size (number of agents) and complexity without suffering a noticeable loss of performance.</p><p>In DimaX, the proposed solution is to organize hierarchically the components of the different services in order to minimize the communication overload caused by them. DimaX also provides global state of MAS (e.g., the average number of exchanged messages), in a distributed manner. Indeed, this reduces remote access and avoids bottleneck, contrary to the case of a central component. Moreover, the messages used by the failure detection service are piggybacked by the other services messages and those of the application.</p><p>2. Reusability. To faciliate the design and implementation of fault-tolerant large-scale MAS, for developers not trained in fault-tolerance techniques, DimaX provides several component libraries to build multi-agent systems: decision components, communication components, interaction protocols. For example, the library of interaction protocols provides a generic implementation of interaction protocols. Interaction protocols are resuable components.</p><p>3. Robustness. The robustness of MAS is almost always a major concern when they are applied to critical domains like spacecraft, or medicine. It is important that this kind of application runs without interruption, in spite of failures, like crashes. DimaX achieves robustness of MASs by using adaptive replication mechanisms. To evaluate the reliability of the platform, we have run the robustness test based on fault injection techniques. The results show that our platform achieves a robustness degree interesting of the application. Also, the platform must continue to deliver its services in despite of one of its services components failure. The failure of a machine or a connection often involves the failure of the associated DimaX server. However, in our solution, the fault tolerance protocols are agent-dependent and not-place dependent, i.e., the mechanisms built for providing the continuity of the computation are integrated in the replication groups, and not in the server.</p><p>4. Adaptability. To deal with limited resource problem for replicating agents, a good replication mechanism should adapt the replication strategy to the evolution of the environment. Thus, we have introduced, in DimaX, a multiagent monitoring architecture to control replication. This architecure implements our adaptation mechanisms to define the agent criticality. These mechanisms rely on organizational concepts like role and interdependence graph <ref type="bibr" target="#b6">[7]</ref>. Moreover, due to the heterogeneous resource problem (i.e., different and dynamic characteristics of the hosts), DimaX uses an adaptive approach to resource management for determining the number of replicas and their placement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>In the multi-agent literature <ref type="bibr" target="#b17">[18]</ref>, we can find a large number of multi-agent platforms but only few ones offer fault-tolerance mechanisms. Several corrective solutions to fault tolerance problem have been proposed. The diagnostic approaches ( <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>) are examples of such solutions. For instance, Kaminka et al. <ref type="bibr" target="#b10">[11]</ref> propose a monitoring approach in order to detect, to diagnose and recover faults. They use models of relations between mental states of agents. They adopt a procedural planrecognition based approach to identify inconsistencies. However, the adaptation is only structural, the relation models may change but the contents of plans are static. Their main hypothesis is that any failure comes from incompletness of beliefs. The diagnostic approaches are attractive ones. However, they are complex; they need a deep knowledge about the behavior of the system. It is not always possible to have a precise description of the whole multi-agent system. Exception handling approaches are also other examples of corrective solutions. Contrary to diagnostic approches where fault recovery is performed, these approaches focus on error recovery ( <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b18">19]</ref>). For instance, Souchon et al. <ref type="bibr" target="#b18">[19]</ref> propose an exception handling system (named SAGE), designed for MASs, that addresses some exception handling problems (e.g., the exception propagation) related to MAS issues such as preservation of the agent paradigm features and concurrency. To summarize, the corrective approaches are not suitable for critical applications where the diagnosis, or exception propagation, and correction must be done in real-time. Kumar et al. <ref type="bibr" target="#b13">[14]</ref> advocate fault-tolerance approach by using broker teams. A broker accepts requests, locates capable agents, routing requests and responses, etc. They use multiple brokers which form a team with appropriate commitments. The team members should recover from broker failures insofar they have team and/or individual commitments like to connect to a registered agent which gets disconnected. In other words, this brokering knowledge is shared among the members. This work presents some interesting results, but stays at the theoretical stage. Moreover, they don't address scalability and reusability issues.</p><p>Cougaar <ref type="bibr" target="#b8">[9]</ref> is a Java-based architecture for the construction of large-scale distributed agentbased applications. An agent is a set of problem solving behaviors interacting via blackboards. If an agent is unable to contact a member of its community it could send a health alert message to a health monitor. This agent is responsible for the recovery of agents. For instance, the recovery of a domain agent consists either to retrieve an appropriate community state needed to pursue the problem solving or to re-join its community which has began a new problem solving stage. However, the approach lacks adaptability; no guarantee is given that the MAS will correctly pursue its goals, in spite of agent failures. Failures could cause interblocked situations; the progress of the problem solving depends on each other.</p><p>The FATMAS methodology <ref type="bibr" target="#b15">[16]</ref> provides mainly four models used to design and implement the target system and a fault-tolerance technique where only a certain number of agents will be replicated. Here, an agent is critical as it performs at least one task that cannot be performed by any other agent in the system. If the agent is non-critical, then it is not replicated and its tasks are replicated in other agents. If it is a critical agent, then it must be replicated. FATMAS proposes guidelines for the analysis and the design of fault-tolerant MAS. Moreover, it provides agent and task replication. This enables to reduce the replication cost. However, the approach addresses to closed MAS; the agent criticality is defined at design time. The replication is static.</p><p>A. Fedoruk and R. Deters <ref type="bibr" target="#b1">[2]</ref> propose to use proxies to make transparent the use of agent replication, i.e. enabling the replicas of an agent to act as a same entity regarding the other agents. The proxy manages the state of the replicas. All the external and internal communications of the group are redirected to the proxy. However this increases the workload of the proxy, which is a quasi central entity. To make it reliable, they propose to build a hierarchy of proxies for each group of replicas. This approach lacks reusability; in particular concerning the replication control.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, we presented a new fault-tolerant multiagent platform named DimaX. The design and the implementation of fault tolerant large scale multiagent systems require to deal with problems related to distribution and fault-tolerance. For that, DimaX provides several services namely naming service for agent localization, fault detection service for reconginizing faulty agents, observation service for collecting relevant information, and replication service for supporting replication techniques. Thanks to these services and their implementation, DimaX has interesting features like scalability, reusability, robustness, and adaptability for fault-tolerant MAS development. Thus, we achieve robustness by using replication techniques. Contrary to other approaches (i.e., diagnosis), replication enables us to run the critical multiagent applications without interruption. Moreover, our control of replication enables to change dynamically replication strategies, for better adapting to the evolution of the MAS context.</p><p>To generalize our approach, the futur work will propose a design methodology for fault-tolerant large-scale MAS. The principles developed in our approach to the failure problem in MAS will be the basis of the methodology.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Overview of DimaX</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The heartbeat technique</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Architecture of the Observation Service</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: DIMA agent architecture</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Communication between DarX tasks C o m m u n i c a t i n g A g e n t B e h a v i o r D a r X T a s k E x e c u t o r D a r X C o m C o m p o n e n t+ s e n d M e s s a g e ( ) + r e c e i v e M e s s a g e ( ) D a r X T a s k E n g i n e + r u n ( ) + s t a r t T a s k ( ) + t e r m i n a t e T a s k ( ) D a r X T a s k</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Fault-tolerant Agent Model</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: DimaX Agent Architecture A DimaX agent is a DIMA agent encapsulated in a particular entity, the DarXTaskEngine.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: UML Diagram of Factorial Application</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 :</head><label>1</label><figDesc>Main methods of AgentBehavior Class</figDesc><table><row><cell>class AgentEngine extends</cell><cell></cell></row><row><cell cols="2">ProactiveComponentEngine implements Runnable {</cell></row><row><cell cols="2">protected ProactiveComponent proactivity;</cell></row><row><cell>public Thread thread; }</cell><cell></cell></row><row><cell>public void run(){</cell><cell></cell></row><row><cell>proactivity.startUp(); }</cell><cell></cell></row><row><cell>Methods</cell><cell>Description</cell></row><row><cell cols="2">public abstract boolean isAlive() Tests if the agent has not reached its goal.</cell></row><row><cell>public abstract void step()</cell><cell>Represents an execution cycle of the agent.</cell></row><row><cell>void proactivityLoop()</cell><cell>Represents the control of agent behavior.</cell></row><row><cell></cell><cell>public void proactivityLoop()</cell></row><row><cell></cell><cell>{ while (this.isAlive()) {</cell></row><row><cell></cell><cell>this.preActivity();</cell></row><row><cell></cell><cell>this.step();</cell></row><row><cell></cell><cell>this.postActivity();} }</cell></row><row><cell>public void startUp()</cell><cell>Initializes and activates the control of agent</cell></row><row><cell></cell><cell>behavior.</cell></row><row><cell></cell><cell>public void startUp() {</cell></row><row><cell></cell><cell>this.proactivityInitialize();</cell></row><row><cell></cell><cell>this.proactivityLoop();</cell></row><row><cell></cell><cell>this.proactivityTerminate();}</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The criticality of an agent, regarding an organization of agents it belongs to, is the measure of the potential impact of the failure of that individual agent on the failure of the whole organization</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Example</head><p>The aim of DimaX is to augment an already built MAS with fault-tolerance capabilities. So, this section presents a MAS which has been developed by DIMA and shows how to make it fault tolerant.</p><p>To exemplify DimaX, we propose the Factorial toy problem (n!). This toy problem gives some insights to distributed problem solving. We consider two kinds of agents:</p><p>1. AgentFact: these agents have the needed behavior to compute a factorial but they do not have the behavior to multiply numbers.</p><p>2. AgentMult: these agents have a behavior to compute a multiplication.</p><p>These agents are implemented as subclasses of CommunicatingAgentBehavior class (see Figure <ref type="figure">8</ref>). To compute n!, AgentFact creates a list (named couple) with the numbers from 1 to n: pubilc void proactivityInitialize(){ for (int i=1, i&lt;=n, i++){ couple.addElement(i); } } Then, it sends requests to AgentMult with all possible couples of numbers. When it receives a result, it puts it into the list: pubilc void result(int i){ couple.addElement(i); // nbRequests is the number of resquests nbRequests --; } If the list has more than one element, new requests are then sent to AgentMult. It repeats this action while the list contains more than one number or AgentFact has not the responses to all the sent requests. This test is performed by its isAlive() method as follows: pubilc boolean isAlive(){ return ((couple.size() &gt; 1) or not(hasAllResponses())); } The AgentFact behavior is defined by: pubilc void step(){ readAllMessages(); while (couple.size()&gt;1) { sendMessage(''multiply'',couple.elementAt(0), couple.elementAt(1),new AgentName(''multiplier'')); couple.remove(0); couple.remove( <ref type="formula">1</ref> After the designer builds the agents behavior by using the DIMA multi-agent platform, he/she uses the activateWithDarX method to deploy his/her MAS and to endow it with fault-tolerance capabilities (i.e., replication). This activation method enables to encapsulate the agent behavior in a DarXTask (see Section 2.4) and register the agent in the system (i.e., the naming service). Its parameters are the url and port of the host where the agent will be replicated. The deployment can be performed as follows: public void main (String [] args) { AgentFact a= new AgentFact(''factorial''); AgentMult b=new AgentMult(''multiplier''); // agents activation on two different machines a.activateWithDarX(url1, port1); b.activateWithDarX(url2, port2); } As we can see, the distribution and replication have not required any code modification. The distribution cost is therefore minimal. Thus, DimaX facilitates the development of fault-tolerant MAS for developers not trained in fault-tolerance techniques. They need only to focus on problem solving issues like the agents behavior and their interactions. The factorial example is very simple. However, the solution is similar even if the application is more complex.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Performance analysis of a hierarchical failure detector</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bertier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Marin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sens</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Dependable Systems and Networks (DSN&apos;2003)</title>
				<meeting>the International Conference on Dependable Systems and Networks (DSN&apos;2003)<address><addrLine>San Francisco, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003-06">June 2003</date>
			<biblScope unit="page" from="635" to="644" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Improving fault-tolerance in mas with dynamic proxy replicate groups</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fedoruk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Deters</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IAT</title>
				<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="364" to="370" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m">FIPA Foundation for Intelligent Physical Agents</title>
				<imprint/>
	</monogr>
	<note>Fipa acl message structure specification</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Software-based replication for fault-tolerance</title>
		<author>
			<persName><forename type="first">R</forename><surname>Guerraoui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schiper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Computer</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="68" to="74" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">From active object to autonomous agents</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Guessoum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Briot</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Concurrency</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="68" to="78" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Dynamic and adaptive replication for large-scale reliable multi-agent systems</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Guessoum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Briot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Marin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hamel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sens</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Second Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS &apos;03)</title>
				<meeting>Second Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS &apos;03)<address><addrLine>Oregon, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2001-05">May 2001</date>
			<biblScope unit="volume">2603</biblScope>
			<biblScope unit="page" from="182" to="198" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Adaptive replication of large scale mass: Towards a fault-tolerant multiagent platform</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Guessoum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Faci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J-P</forename><surname>Briot</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>Springer Verlag</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A sentinel approach to fault handling in multi-agent systems</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hagg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">LNCS</title>
		<imprint>
			<biblScope unit="volume">1286</biblScope>
			<biblScope unit="page" from="190" to="195" />
			<date type="published" when="1997">1997</date>
			<publisher>Springer-Verlag</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Cougaar: a scalable, distributed multi-agent architecture</title>
		<author>
			<persName><forename type="first">A</forename><surname>Helsinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Thome</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wright</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SMC</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="1910" to="1917" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Using self-diagnosis to adapt organizational structures</title>
		<author>
			<persName><forename type="first">B</forename><surname>Horling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Benyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lesser</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc.In 5th International Conference on Autonomous Agents</title>
				<meeting>.In 5th International Conference on Autonomous Agents<address><addrLine>Montreal, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2001-06">June 2001</date>
			<biblScope unit="page" from="529" to="536" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Monitoring teams by overhearing: A multiagent plan-recognition approach</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Kaminka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">V</forename><surname>Pynadah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tambe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Intelligence Artificial Research</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="83" to="135" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Component replication in distributed systems: a case study using enterprise</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">I</forename><surname>Kistijantoro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Morgan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Shrivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Little</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">22nd International Symposium on Reliable Distributed Systems (SRDS&apos;03)</title>
				<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="89" to="99" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Using domain-independent exception handling services to enable robust open multi-agent systems: the case of agent death</title>
		<author>
			<persName><forename type="first">M</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rodriguez-Aguilar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dellarocas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Autonomous Agents and Multi-Agent Systems</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">1-2</biblScope>
			<biblScope unit="page" from="179" to="189" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Towards a fault-tolerant multiagent system architecture</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">R</forename><surname>Cohen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc.of 4th International Conference on Autonomous Agents</title>
				<meeting>.of 4th International Conference on Autonomous Agents<address><addrLine>New York, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000-06">June 2000</date>
			<biblScope unit="page" from="459" to="466" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Towards adaptive fault-tolerance for distributed multi-agents systems</title>
		<author>
			<persName><forename type="first">O</forename><surname>Marin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Briot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Guessoum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Fourth European Research Seminar on Advances in Distributed Systems (ERSADS&apos;01)</title>
				<meeting>Fourth European Research Seminar on Advances in Distributed Systems (ERSADS&apos;01)<address><addrLine>Bertinoro, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2001-05">May 2001</date>
			<biblScope unit="page" from="195" to="201" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Towards a modelling methodology for fault-tolerant multi-agent systems</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mellouli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Moulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Mineau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Informatica Journal</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="31" to="40" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Delta-4: A generic architecture for dependable distributed computing</title>
		<author>
			<persName><forename type="first">D</forename><surname>Powell</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1991">1991</date>
			<publisher>Springer Verlag</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">From analysis to deployment: A multi-agent platform survey</title>
		<author>
			<persName><forename type="first">P-M</forename><surname>Ricordel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Demazeau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">LNAI</title>
		<imprint>
			<biblScope unit="page" from="93" to="106" />
			<date type="published" when="1972">1972. 2004</date>
			<publisher>Springer-Verlag</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A proposition of exception handling in multiagent systems</title>
		<author>
			<persName><forename type="first">F</forename><surname>Souchon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Urtado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vauttier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dony</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Second Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS &apos;03)</title>
				<meeting>Second Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS &apos;03)<address><addrLine>Oregon, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003-05">May 2003</date>
			<biblScope unit="volume">2603</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A flexible group communication system</title>
		<author>
			<persName><forename type="first">R</forename><surname>Van Renesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Birman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Maffeis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="76" to="83" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
