=Paper= {{Paper |id=Vol-343/paper-3 |storemode=property |title=Chronicles for On-line Diagnosis of Distributed Systems |pdfUrl=https://ceur-ws.org/Vol-343/paper3.pdf |volume=Vol-343 |authors=Xavier Le Guillou }} ==Chronicles for On-line Diagnosis of Distributed Systems== https://ceur-ws.org/Vol-343/paper3.pdf
Chronicles for On-line Diagnosis of Distributed Systems

                                      Xavier Le Guillou?

                                  Irisa – Université de Rennes 1
                                       Campus de Beaulieu
                                       35042 Rennes Cédex
                                     xleguill@irisa.fr



         Abstract. The formalism of chronicles has been proposed to monitor and diag-
         nose dynamic physical systems. Even if efficient chronicle recognition algorithms
         exist, it is now well-known that distributed approaches are better suited to mon-
         itor actual systems. In this article, we adapt the chronicle-based approach to a
         distributed context and illustrate this work on the monitoring of software compo-
         nents.

         Key words: on-line diagnosis, distributed systems, chronicle recognition


1 Introduction

Monitoring and diagnosing dynamic systems have become very active topics in research
and development for a few years. Besides continuous models based on differential equa-
tions, essentially used in control theory and discrete event systems based on finite state
machines (automata, Petri nets, . . . ), a formalism commonly used for on-line monitor-
ing, in particular by people from the artificial intelligence community, is the one of
chronicles. This formalism, proposed in [1], has been widely used and extended [2–4].
A chronicle describes a situation that is worth identifying within the diagnosis con-
text. It is made up a set of events and temporal constraints between those events. As
a consequence, this formalism fits particularly well problems that consider a temporal
dimension. The set of interesting chronicles constitutes the base of chronicles. Then,
monitoring the system consists in analyzing flows of events, and recognizing on fly pat-
terns described by the base of chronicles. Efficient algorithms exist which explain that
this approach has been used for industrial applications as well as medical ones [2, 5, 6].
    One of the key issues of model-based approaches for on-line monitoring is the size
of the model which is generally too large when dealing with real applications. Dis-
tributed or decentralized approaches have been proposed to cope with this problem, like
[7–10]. The idea is to consider the system as a set of interacting components instead of a
unique entity. The behavior of the system is thus described by a set of local component
models and by the synchronization constraints between the component models.
    Considering chronicle-based approaches, to our knowledge, no distributed approaches
exist and the contribution of this paper consists in adapting the chronicle-based ap-
proach to distributed systems.
?
    under the direction of M.-O. Cordier
                                            Proceedings of CAiSE-DC 2008               29

     This work has been motivated by an application that aims at monitoring the behavior
of software components, and more precisely of web services within the context of the
WS-D IAMOND (Web Service DIAgnosability, MONitoring and Diagnosis) European
project.In this context, a request is sent to a web service which collaborates with other
services to provide the adequate reply. Faults may propagate from one service to another
and diagnosing them is a crucial issue, in order to react properly. We use a simplified
example of an e-foodshop to illustrate our proposal.
     We first recall the principles of the chronicle recognition approach and give basic
definitions in Sect. 2. We introduce in Sect. 3 the simplified example that will be used
all along this paper. In Sect. 4, we show how to extend the chronicle-based approach to
distributed systems. We first describe the architecture of a chronicle-based distributed
system (4.1). Then we extend the chronicle formalism to deal with synchronization
constraints (4.2). We describe in 4.3 a push-pull algorithm able to compute a global di-
agnosis from the local diagnoses, computed by locally distributed chronicle recognition
systems, by checking the synchronization constraints. After an illustrative example in
4.4, we compare our proposal to related work in Sect. 5 and conclude in Sect. 6.


2 Chronicle Recognition Approach
The chronicle recognition approach (first introduced in [1]) relies on a set of patterns,
named chronicles, which constitutes the chronicle base. Let us recall the formalism and
the chronicle recognition algorithm.

2.1 Formalism of Chronicles
A chronicle is a set of observable events which are time-constrained and is characteristic
of a situation.
An event type defines what is observed within the system, e.g. the name of an activity
act, the name augmented with the fact that act is starting (namely act+ ) or ending
(namely act− ), the name enriched with observable parameters act(?var1 , . . . , ?varn )
or a combination of those possibilities. E denotes the set of possible event types. An
event is a pair (e, ?t) where e ∈ E is an event type and ?t the occurrence date of the
event.
A chronicle (model) C is a pair (S, T ) where S is a set of events and T a set of con-
straints between their occurrence dates. When its variables and its occurrence dates are
instantiated, a chronicle is called a chronicle instance.

2.2 Chronicle Recognition
A chronicle recognition tool, called CRS (Chronicle Recognition System), has been
developed by C. Dousson1 . It is in charge of analyzing the input stream of events and
of identifying, on the fly, any pattern matching a situation described by a chronicle.
Chronicles are compiled into temporal constraint networks which are processed by ef-
ficient graph algorithms. CRS is based on a complete forecast of the possible dates for
 1
     http://crs.elibel.tm.fr/
30        Proceedings of CAiSE-DC 2008

each event that has not occurred yet. This set (called temporal window) is reduced by
propagation of the dates of observed events through the temporal constraint network.
When a new event arrives in the input stream, new instances of chronicles are generated
in the set of hypotheses, which is managed as a tree. Instances are discarded as soon as
possible, when constraints are violated or when temporal windows become empty.


                                                       [1,3]
                                               a                 b       Chronicle model



                    (a,1)                      (a,3)                                 (b,5)           time



                    (a,1)     b,[2,4]        (a,1)             b,[3,4]             - Discarded -
               I1
                                             (a,3)             b,[4,6]              (a,3)          b,[5,6]
                                        I2

                                                                                    (a,3)           (b,5)
                                                                              I3



                            Fig. 1. Principle of chronicle recognition



     Figure 1 shows the principle of the recognition algorithm on a very simple example:
a single chronicle model is defined, containing only two events: (a, ?ta ) and (b, ?tb ),
with ?ta +1 ≤?tb ≤?ta +3. When event (a, 1) is received, instance I1 is created, which
updates the temporal window of the related node b. When a new event (a, 3) occurs, a
new instance I2 is created and the forthcoming temporal window of I1 is updated.
When event (b, 5) is received, instance I3 is created (from I2) and I1 is destroyed as
no more event (b, ?tb ) could match the temporal constraints from now on. Instance I2
is still waiting for another potential event (b, ?tb ) before ?tb > 6. As all the events of
I3 are instantiated, this instance is recognized.


3 Motivating Example

To illustrate the ideas developed in this paper, we consider an orchestration of three
web services, a shop, a supplier and a warehouse, that provide e-shopping capabilities
to users. This application keeps the essential properties of the applications we aim to
monitor. In particular, we consider closed environments, where a workflow-like descrip-
tion of each web service (Fig. 2) involved in the processing of the request is supposed
to be available.
     A customer wants to place an order and selects items on the shop. This list of items
is transferred to a supplier which sends a reservation request to a warehouse, for each
item of the list. The warehouse returns an acknowledgement to the supplier for each
item request and, at the end of the item list, the supplier sends a list of the available
items to the shop which forwards it to the customer. The customer agreement terminates
the process.
     Faults may happen during this process. Figure 2.a shows two of them (represented
by pentagons), related with the shop. First, when placing his order, the customer may
                                                                Proceedings of CAiSE-DC 2008           31




  Fig. 2. (a) Workflow of the SHOP and reduced workflows of (b) the SUPP and (c) the WH



make a data acquisition error, which may result in unexpected items on his reserva-
tion list. Then, a timeout may occur when calling the supplier. We consider that only
timeouts may occur on the supplier (Fig. 2.b), when calling the warehouse. On the
warehouse (Fig. 2.c), things are more complicated. First, an item may be out of stock,
resulting in an incomplete reservation list. Then, an internal error may happen, resulting
in a denial of service.
     Figure 3 presents two processes that may result in the same observation on the shop,
i.e. a cancellation of the order due to an incorrect reservation list: (a) a data acquisition
error, ordering “eggs and teak” instead of “eggs and tea”, for instance, and (b) a stock
error happening on the warehouse. Here, we notice that two distinct errors that happen
on two distinct services can result in the same local problem, hence the necessity of
diagnosing the system globally in order to repair in an adequate way.


                       SHOP             SUPP               WH   SHOP            SUPP              WH
                          {eggs,teak}                              {eggs,tea}
                                                  {eggs}                               {eggs}
                                                  avail                                 avail
                                                  {teak}                                {tea}
                                                  avail                                notAvail
                              {eggs,teak}                              {eggs}


                                            (a)                                  (b)




                   Fig. 3. Two scenarii that may result in a cancelled order
32        Proceedings of CAiSE-DC 2008

4 Extension to Distributed Environments
Diagnosing distributed systems thanks to chronicles requires to define a modular di-
agnosis architecture capable of merging diagnoses provided by local chronicle-based
diagnosers and to enrich the chronicle formalism with synchronization constraints.

4.1 Architecture
Figure 4 summarizes our chronicle-based approach architecture. This decentralized sys-
tem is composed of a global diagnoser (or broker) in charge of merging the local diag-
noses sent by each service and sending global diagnoses to a repair module. Services
are composed of the web service itself, logs generated in real time by the web service,
a base of chronicles generated off-line, a local diagnoser that uses the logs to instantiate
chronicles from the base.


                                                              Broker
                                                        (global diagnoser)



                                  Local diagnoser 1                             Local diagnoser 2


                         logs 1        base of chronicles 1            logs 2        base of chronicles 2


                                   Web service 1                                 Web service 2


                                                       ...                                           ...


                     Fig. 4. General architecture of a distributed system




4.2 Extension of the Formalism of Chronicles
As a fault occurring on a service often propagates to other services, we base our ap-
proach on the merging of local diagnoses. As a consequence, we enrich the initial for-
malism of chronicles with synchronization constraints that allow the broker to spot
homologous chronicles and merge them.

   Before defining a distributed chronicle, let us firstly define what is a synchronization
point.
The status of a variable is a boolean that denotes if the value of a chronicle variable is
normal (¬err) or abnormal (err) in a given execution case. A synchronization vari-
able is a pair (?var, status) where ?var is a (non temporal) chronicle variable and
status the status of this variable inside a given chronicle model.
A synchronization point is a tuple (e, {vars}, servtype ) where e is an event type,
{vars} a set of synchronization variables linked with this event type and servtype a
type of remote service the local service communicates with. An instance of a syn-
chronization point is a synchronization point in which variables are instantiated, and
servtype is instantiated as the effective address of the remote service.
                                            Proceedings of CAiSE-DC 2008              33

A synchronization point is incoming if it corresponds to a servremote → servlocal
communication, outgoing for the contrary (see example chronicle below).

Referring to Fig. 2.a and Sect. 3, here is one of the two synchronization points on the
SHOP, which is instantiated as follows, in the execution case of an external error (see
Fig. 6):
(ChkN Reserve+ , {(?SHOP listIn, err)}, supplier).
It expresses the fact that the error is coming from the supplier, through the ?SHOPlistIn
variable, which is received by the SHOP at the end of the ChkNReserve activity.

    A distributed chronicle is a classical chronicle enriched with a “color” and a “syn-
chronization” part, so that we can merge it with chronicles from adjacent services.
The color of a chronicle K represents the degree of importance of a chronicle and its
capacity to trigger a global diagnosis process. Two colors are used: red for faults that
may trigger the broker and green for normal behaviors and non critical faults.
Distributed chronicle: a distributed chronicle is a tuple CD = (S, T , O, I, K) where
S is a set of events, T a graph of constraints between their occurrence dates, O and I
are respectively two sets of outgoing and incoming synchronization points, and K is the
color of the chronicle.

Let us consider the chronicle describing the external error case. We have the distributed
chronicle model CD = (S, T , O, I, K):
 S = { (ReceiveOrder, ?t1 ),
        (ChkN Reserve− (?SHOP listOut), ?t2 ),
        (ChkN Reserve+ (?SHOP listIn), ?t3 ),
        (SendBill, ?t4 ),
        (ReceiveConf irm, ?t5 ),
        (F orwardOrder, ?t6 )
 }
 T = {?t1