Chronicles for On-line Diagnosis of Distributed Systems Xavier Le Guillou? Irisa – Université de Rennes 1 Campus de Beaulieu 35042 Rennes Cédex xleguill@irisa.fr Abstract. The formalism of chronicles has been proposed to monitor and diag- nose dynamic physical systems. Even if efficient chronicle recognition algorithms exist, it is now well-known that distributed approaches are better suited to mon- itor actual systems. In this article, we adapt the chronicle-based approach to a distributed context and illustrate this work on the monitoring of software compo- nents. Key words: on-line diagnosis, distributed systems, chronicle recognition 1 Introduction Monitoring and diagnosing dynamic systems have become very active topics in research and development for a few years. Besides continuous models based on differential equa- tions, essentially used in control theory and discrete event systems based on finite state machines (automata, Petri nets, . . . ), a formalism commonly used for on-line monitor- ing, in particular by people from the artificial intelligence community, is the one of chronicles. This formalism, proposed in [1], has been widely used and extended [2–4]. A chronicle describes a situation that is worth identifying within the diagnosis con- text. It is made up a set of events and temporal constraints between those events. As a consequence, this formalism fits particularly well problems that consider a temporal dimension. The set of interesting chronicles constitutes the base of chronicles. Then, monitoring the system consists in analyzing flows of events, and recognizing on fly pat- terns described by the base of chronicles. Efficient algorithms exist which explain that this approach has been used for industrial applications as well as medical ones [2, 5, 6]. One of the key issues of model-based approaches for on-line monitoring is the size of the model which is generally too large when dealing with real applications. Dis- tributed or decentralized approaches have been proposed to cope with this problem, like [7–10]. The idea is to consider the system as a set of interacting components instead of a unique entity. The behavior of the system is thus described by a set of local component models and by the synchronization constraints between the component models. Considering chronicle-based approaches, to our knowledge, no distributed approaches exist and the contribution of this paper consists in adapting the chronicle-based ap- proach to distributed systems. ? under the direction of M.-O. Cordier Proceedings of CAiSE-DC 2008 29 This work has been motivated by an application that aims at monitoring the behavior of software components, and more precisely of web services within the context of the WS-D IAMOND (Web Service DIAgnosability, MONitoring and Diagnosis) European project.In this context, a request is sent to a web service which collaborates with other services to provide the adequate reply. Faults may propagate from one service to another and diagnosing them is a crucial issue, in order to react properly. We use a simplified example of an e-foodshop to illustrate our proposal. We first recall the principles of the chronicle recognition approach and give basic definitions in Sect. 2. We introduce in Sect. 3 the simplified example that will be used all along this paper. In Sect. 4, we show how to extend the chronicle-based approach to distributed systems. We first describe the architecture of a chronicle-based distributed system (4.1). Then we extend the chronicle formalism to deal with synchronization constraints (4.2). We describe in 4.3 a push-pull algorithm able to compute a global di- agnosis from the local diagnoses, computed by locally distributed chronicle recognition systems, by checking the synchronization constraints. After an illustrative example in 4.4, we compare our proposal to related work in Sect. 5 and conclude in Sect. 6. 2 Chronicle Recognition Approach The chronicle recognition approach (first introduced in [1]) relies on a set of patterns, named chronicles, which constitutes the chronicle base. Let us recall the formalism and the chronicle recognition algorithm. 2.1 Formalism of Chronicles A chronicle is a set of observable events which are time-constrained and is characteristic of a situation. An event type defines what is observed within the system, e.g. the name of an activity act, the name augmented with the fact that act is starting (namely act+ ) or ending (namely act− ), the name enriched with observable parameters act(?var1 , . . . , ?varn ) or a combination of those possibilities. E denotes the set of possible event types. An event is a pair (e, ?t) where e ∈ E is an event type and ?t the occurrence date of the event. A chronicle (model) C is a pair (S, T ) where S is a set of events and T a set of con- straints between their occurrence dates. When its variables and its occurrence dates are instantiated, a chronicle is called a chronicle instance. 2.2 Chronicle Recognition A chronicle recognition tool, called CRS (Chronicle Recognition System), has been developed by C. Dousson1 . It is in charge of analyzing the input stream of events and of identifying, on the fly, any pattern matching a situation described by a chronicle. Chronicles are compiled into temporal constraint networks which are processed by ef- ficient graph algorithms. CRS is based on a complete forecast of the possible dates for 1 http://crs.elibel.tm.fr/ 30 Proceedings of CAiSE-DC 2008 each event that has not occurred yet. This set (called temporal window) is reduced by propagation of the dates of observed events through the temporal constraint network. When a new event arrives in the input stream, new instances of chronicles are generated in the set of hypotheses, which is managed as a tree. Instances are discarded as soon as possible, when constraints are violated or when temporal windows become empty. [1,3] a b Chronicle model (a,1) (a,3) (b,5) time (a,1) b,[2,4] (a,1) b,[3,4] - Discarded - I1 (a,3) b,[4,6] (a,3) b,[5,6] I2 (a,3) (b,5) I3 Fig. 1. Principle of chronicle recognition Figure 1 shows the principle of the recognition algorithm on a very simple example: a single chronicle model is defined, containing only two events: (a, ?ta ) and (b, ?tb ), with ?ta +1 ≤?tb ≤?ta +3. When event (a, 1) is received, instance I1 is created, which updates the temporal window of the related node b. When a new event (a, 3) occurs, a new instance I2 is created and the forthcoming temporal window of I1 is updated. When event (b, 5) is received, instance I3 is created (from I2) and I1 is destroyed as no more event (b, ?tb ) could match the temporal constraints from now on. Instance I2 is still waiting for another potential event (b, ?tb ) before ?tb > 6. As all the events of I3 are instantiated, this instance is recognized. 3 Motivating Example To illustrate the ideas developed in this paper, we consider an orchestration of three web services, a shop, a supplier and a warehouse, that provide e-shopping capabilities to users. This application keeps the essential properties of the applications we aim to monitor. In particular, we consider closed environments, where a workflow-like descrip- tion of each web service (Fig. 2) involved in the processing of the request is supposed to be available. A customer wants to place an order and selects items on the shop. This list of items is transferred to a supplier which sends a reservation request to a warehouse, for each item of the list. The warehouse returns an acknowledgement to the supplier for each item request and, at the end of the item list, the supplier sends a list of the available items to the shop which forwards it to the customer. The customer agreement terminates the process. Faults may happen during this process. Figure 2.a shows two of them (represented by pentagons), related with the shop. First, when placing his order, the customer may Proceedings of CAiSE-DC 2008 31 Fig. 2. (a) Workflow of the SHOP and reduced workflows of (b) the SUPP and (c) the WH make a data acquisition error, which may result in unexpected items on his reserva- tion list. Then, a timeout may occur when calling the supplier. We consider that only timeouts may occur on the supplier (Fig. 2.b), when calling the warehouse. On the warehouse (Fig. 2.c), things are more complicated. First, an item may be out of stock, resulting in an incomplete reservation list. Then, an internal error may happen, resulting in a denial of service. Figure 3 presents two processes that may result in the same observation on the shop, i.e. a cancellation of the order due to an incorrect reservation list: (a) a data acquisition error, ordering “eggs and teak” instead of “eggs and tea”, for instance, and (b) a stock error happening on the warehouse. Here, we notice that two distinct errors that happen on two distinct services can result in the same local problem, hence the necessity of diagnosing the system globally in order to repair in an adequate way. SHOP SUPP WH SHOP SUPP WH {eggs,teak} {eggs,tea} {eggs} {eggs} avail avail {teak} {tea} avail notAvail {eggs,teak} {eggs} (a) (b) Fig. 3. Two scenarii that may result in a cancelled order 32 Proceedings of CAiSE-DC 2008 4 Extension to Distributed Environments Diagnosing distributed systems thanks to chronicles requires to define a modular di- agnosis architecture capable of merging diagnoses provided by local chronicle-based diagnosers and to enrich the chronicle formalism with synchronization constraints. 4.1 Architecture Figure 4 summarizes our chronicle-based approach architecture. This decentralized sys- tem is composed of a global diagnoser (or broker) in charge of merging the local diag- noses sent by each service and sending global diagnoses to a repair module. Services are composed of the web service itself, logs generated in real time by the web service, a base of chronicles generated off-line, a local diagnoser that uses the logs to instantiate chronicles from the base. Broker (global diagnoser) Local diagnoser 1 Local diagnoser 2 logs 1 base of chronicles 1 logs 2 base of chronicles 2 Web service 1 Web service 2 ... ... Fig. 4. General architecture of a distributed system 4.2 Extension of the Formalism of Chronicles As a fault occurring on a service often propagates to other services, we base our ap- proach on the merging of local diagnoses. As a consequence, we enrich the initial for- malism of chronicles with synchronization constraints that allow the broker to spot homologous chronicles and merge them. Before defining a distributed chronicle, let us firstly define what is a synchronization point. The status of a variable is a boolean that denotes if the value of a chronicle variable is normal (¬err) or abnormal (err) in a given execution case. A synchronization vari- able is a pair (?var, status) where ?var is a (non temporal) chronicle variable and status the status of this variable inside a given chronicle model. A synchronization point is a tuple (e, {vars}, servtype ) where e is an event type, {vars} a set of synchronization variables linked with this event type and servtype a type of remote service the local service communicates with. An instance of a syn- chronization point is a synchronization point in which variables are instantiated, and servtype is instantiated as the effective address of the remote service. Proceedings of CAiSE-DC 2008 33 A synchronization point is incoming if it corresponds to a servremote → servlocal communication, outgoing for the contrary (see example chronicle below). Referring to Fig. 2.a and Sect. 3, here is one of the two synchronization points on the SHOP, which is instantiated as follows, in the execution case of an external error (see Fig. 6): (ChkN Reserve+ , {(?SHOP listIn, err)}, supplier). It expresses the fact that the error is coming from the supplier, through the ?SHOPlistIn variable, which is received by the SHOP at the end of the ChkNReserve activity. A distributed chronicle is a classical chronicle enriched with a “color” and a “syn- chronization” part, so that we can merge it with chronicles from adjacent services. The color of a chronicle K represents the degree of importance of a chronicle and its capacity to trigger a global diagnosis process. Two colors are used: red for faults that may trigger the broker and green for normal behaviors and non critical faults. Distributed chronicle: a distributed chronicle is a tuple CD = (S, T , O, I, K) where S is a set of events, T a graph of constraints between their occurrence dates, O and I are respectively two sets of outgoing and incoming synchronization points, and K is the color of the chronicle. Let us consider the chronicle describing the external error case. We have the distributed chronicle model CD = (S, T , O, I, K): S = { (ReceiveOrder, ?t1 ), (ChkN Reserve− (?SHOP listOut), ?t2 ), (ChkN Reserve+ (?SHOP listIn), ?t3 ), (SendBill, ?t4 ), (ReceiveConf irm, ?t5 ), (F orwardOrder, ?t6 ) } T = {?t1